Skip to main content

Quick Answer

max_tokens controls the maximum number of tokens the model can generate in a single response. APIYI does not impose any additional limits on max_tokens — the parameter is passed directly to the upstream model. You can set it yourself; if not set, the model’s default value applies.
APIYI’s approach: We do not enforce any max_tokens limit. You have full control. When not set, each model uses its own default output behavior.

What max_tokens Does

max_tokens (maximum output tokens) is one of the most common parameters when calling LLM APIs. It tells the model: generate at most this many tokens in your response.
  • Set it too low: The model may be cut off mid-response (returns finish_reason: "length")
  • Set it too high: The model won’t be forced to generate that many tokens, but you may incur higher costs (some models charge per output token)
  • Not set: Uses the model’s default value (varies by provider — see table below)
Token ≠ character. In English, roughly 1 word ≈ 1-1.5 tokens. In Chinese, roughly 1 character ≈ 1-2 tokens. 4,096 tokens is approximately 3,000 English words.

OpenAI Parameter Naming Evolution

OpenAI has used different parameter names across different APIs and time periods, which can cause confusion:
API TypeParameter NameApplicable ModelsIntroduced
Chat Completions APImax_tokensGPT-3.5, GPT-4, GPT-4o, etc.Original version
Chat Completions APImax_completion_tokenso1, o3, o4-mini reasoning modelsSeptember 2024 (o1 launch)
Responses APImax_output_tokensGPT-4o, GPT-5.4, o3, all models2025

Why the Rename?

When OpenAI released the o1 reasoning model in September 2024, it introduced “hidden reasoning tokens” — the model generates extensive internal reasoning tokens that do not appear in your response. The original max_tokens meant both “tokens generated” and “tokens you receive,” but with reasoning models these are no longer equal. So OpenAI introduced max_completion_tokens to explicitly mean “the cap on tokens you receive in the response.” Later, the Responses API unified on the more intuitive name max_output_tokens.
Important: When using OpenAI’s o-series reasoning models (e.g., o3, o4-mini) with the Chat Completions API, you must use max_completion_tokens instead of max_tokens, or you’ll get an error.

What Happens If max_tokens Is Not Set?

Different providers handle this differently:
ProviderDefault Behavior When Not SetNotes
OpenAINo limit (outputs until context window is exhausted)Model decides output length naturally
Anthropic ClaudeRequired parameter — errors if not setClaude API requires explicit max_tokens
Google GeminiDefaults to 8,192 tokensEven if the model supports more, only 8,192 tokens are returned
DeepSeek (chat)Defaults to 4,000 tokensCan be manually increased to 8,000
DeepSeek (reasoner)Defaults to 32,000 tokensIncludes chain-of-thought output, max 64,000
Special Note: Anthropic Claude API’s max_tokens is a required parameter. If you don’t include it, the API will return an error. Always set it when using Claude models.

Maximum Output Tokens Reference

Below are the maximum output token limits for popular models. Always check the official documentation for the latest values, as models are updated frequently.
ModelModel IDMax Output TokensContext Window
GPT-5.4gpt-5.4-2026-03-05128,0001,047,576
GPT-4ogpt-4o16,384128,000
o3o3100,000200,000
Claude Opus 4.6claude-opus-4-6128,0001,000,000
Claude Sonnet 4.6claude-sonnet-4-664,0001,000,000
Gemini 3.1 Progemini-3.1-pro65,5362,000,000
DeepSeek V3deepseek-chat8,00064,000
DeepSeek R1deepseek-reasoner64,00064,000
Official Documentation (for the latest values):
  • OpenAI: platform.openai.com/docs/models
  • Anthropic Claude: docs.anthropic.com/en/docs/about-claude/models
  • Google Gemini: ai.google.dev/gemini-api/docs/models
  • DeepSeek: api-docs.deepseek.com/api/create-chat-completion

Recommendations

Best Practice: We recommend explicitly setting max_tokens in every API call because:
  • Different models/providers have different defaults, which can cause unexpected truncation
  • Controls output length and prevents unnecessary token consumption
  • Claude API requires it — building a consistent habit reduces errors
  • Typical settings: general chat 2048-4096, long-form generation 8192-16384, code generation 4096-8192

FAQ

No. APIYI passes the max_tokens parameter directly to the upstream model without any additional restrictions. Whatever you set is what the upstream model receives. The only limit comes from the model’s own maximum output token cap.
No error will occur — the model will simply generate up to its own maximum. For example, GPT-4o has a max output of 16,384 tokens; even if you set max_tokens: 100000, it will output at most 16,384 tokens.
They serve the same purpose — limiting output tokens. The difference is naming:
  • max_tokens: OpenAI’s original parameter name, used for GPT series non-reasoning models
  • max_completion_tokens: Since September 2024, used for OpenAI’s o-series reasoning models
  • max_output_tokens: The unified parameter name in OpenAI’s Responses API
When calling through APIYI, use the appropriate parameter name based on the model and API format you’re using.
This means the model’s output reached the max_tokens limit. Solutions:
  1. Increase the max_tokens value
  2. Optimize your prompt to get more concise responses
  3. Check that you’re using the correct parameter name (o-series models require max_completion_tokens)

How to Choose the Right AI Model?

Select the best model for your use case

API Concurrency Limits

Learn about concurrency limits for different models

Base URL Configuration Guide

How to configure APIYI Base URL in various tools

APIYI Token Management

Manage API keys, check usage and balance