This page focuses on
/v1/chat/completions compatible mode. For Claude’s native thinking blocks (the thinking field on /v1/messages), see the Claude Effort & Thinking Guide. For Gemini’s native thinking_level and thought_signature, see Gemini Native Calls.Overview
In compatible mode, reasoning models fall into three groups by “does it emit thinking text”:| Type | Example models | Thinking content | Getting the answer |
|---|---|---|---|
| No thinking emitted | gpt-4.1-mini, gemini-3.1-flash-lite, claude-haiku-4-5 | None | Same as regular models |
| Emits thinking text | grok-4.3, qwen3.6-plus, glm-5.1 | reasoning_content field | Answer in content, thinking in reasoning_content |
| Tokens only | gpt-5.4-mini | No text, only usage.reasoning_tokens | Same as regular models |
Thinking content: reasoning_content
Models that emit thinking text put the chain of thought inreasoning_content, parallel to content.
Non-streaming — message carries both:
delta.reasoning_content is pushed first; delta.content starts only once thinking is done. Be sure to render the two separately (collapse the thinking, stream the answer), or the UI will flash a wall of thoughts first:
Reasoning tokens can dwarf the answer. In testing, a trivial “1+1” question produced hundreds of
reasoning_tokens on grok-4.3 against just a few answer tokens. Thinking is billed as output tokens, so evaluate whether to enable / display it for latency- and cost-sensitive use cases.Thought signatures and multi-turn
A “thought signature” is a Gemini native concept: in native multimodal / function calling, the model returns an encryptedthought_signature that must be passed back across turns to preserve reasoning continuity (see Gemini Native Calls and Gemini Function Calling).
In /v1/chat/completions compatible mode, reasoning models are stateless:
- Multi-turn only requires putting the previous assistant turn’s
contentinto the message history; - There is no need to pass back
reasoning_content, and no signature field appears in the response; - In testing, gemini-3.1-flash-lite and grok-4.3 both kept multi-turn context correctly while only
contentwas passed back.
Structured output
Useresponse_format to make the model emit JSON only. Two types:
Per-model support (tested)
json_schema support varies widely — this is the biggest pitfall in structured output:
| Model | json_schema | json_object | Can content be parsed directly |
|---|---|---|---|
| gpt-4.1-mini | ✅ strictly honored | ✅ | ✅ pure JSON |
| gemini-3.1-flash-lite | ✅ | ✅ | ✅ |
| grok-4.3 | ✅ (also emits thinking) | ✅ | ✅ content is pure JSON |
| gpt-5.4-mini | ⚠️ JSON correct, but prefixed with <think>…</think> | — | ❌ strip the think block first |
| qwen3.6-plus | ⚠️ returns 400: messages must contain the word “json” | ✅ | ✅ |
| claude-haiku-4-5 | ❌ ignores schema, returns Markdown | — | ❌ |
| glm-5.1 | ❌ ignores schema, returns prose | ✅ | ✅ under json_object |
Getting JSON reliably across models
Related links
- Same group: Handling Responses · Compatible Mode Calls · Function Calling
- Native thinking: Claude Effort & Thinking Guide · Gemini Native Calls
- Models & pricing: Models & Pricing Overview