Skip to main content
When you call compatible mode, every model — OpenAI, Claude, Gemini, Grok, Qwen, GLM and others — returns the same OpenAI schema. Almost all of your parsing logic is shared: follow the patterns below and switching models needs no code changes. This page helps you get response handling right the first time: commonality first, then a single table of the few differences you must tolerate (none of which block integration).
The request side (base_url, auth, switching models) is covered in Compatible Mode Calls. This page is purely about the response side: how to parse what comes back.

Two modes, one endpoint

The same /v1/chat/completions endpoint; only the stream flag changes the shape:
stream: false (default)stream: true
ShapeA single JSON objectAn SSE stream (many data: lines)
Top-level typechat.completionchat.completion.chunk
Get the textchoices[0].message.contentAccumulate each choices[0].delta.content
Use caseBackends, batch jobs, full resultChat UIs, token-by-token rendering

Non-streaming response

Stable structure — just read choices[0].message.content:
{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "model": "gpt-4.1-mini",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "1+1 equals 2." },
      "finish_reason": "stop"
    }
  ],
  "usage": { "prompt_tokens": 31, "completion_tokens": 8, "total_tokens": 39 }
}
resp = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": "What is 1+1?"}]
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)
Non-streaming output is highly consistent across all major models — choices[0].message.content works everywhere. Some models (e.g. the OpenAI family) also add annotations and refusal on message; read them if you need them, ignore them otherwise.

Streaming response (SSE)

Streaming pushes chunks as Server-Sent Events, one per line as data: {...}, ending with data: [DONE]:
data: {"choices":[{"delta":{"content":"1"},"index":0}], ...}
data: {"choices":[{"delta":{"content":"+1"},"index":0}], ...}
data: {"choices":[{"delta":{},"finish_reason":"stop","index":0}], ...}
data: [DONE]
With the official SDK just iterate; the core is accumulating delta.content:
stream = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": "Write a short poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Integration notes: a few differences, handled uniformly

Streaming details vary slightly between models, but following the rules below lets one code path cover them all.
The final chunk’s choices may be an empty array. The last chunk that carries usage is "choices":[] on some models (gpt-4.1-mini, grok, qwen, glm); indexing choices[0] there throws. Check that choices is non-empty before reading it.
DifferenceWhat you seeUniform handling
Final-chunk choicesMay be [] empty, or non-emptyCheck choices is non-empty before reading delta
finish_reason mid-valueUsually null; Claude uses "" (empty string)Detect end with finish_reason === "stop"
usage locationEmpty-choices chunk / non-empty chunk / same chunk as stopTry all three; record whenever present
Chunk granularityPer-token (OpenAI) or per-sentence (Gemini/Claude)Irrelevant — just accumulate
First role-declaration chunkSome send an empty-content chunk declaring roleSkip when content is empty; don’t treat as text
Vendor-private fieldsobfuscation, system_fingerprint, first_token_return_time, etc.Ignore — never depend on them

Robust reference parser

When you handle the raw SSE yourself (no SDK), this covers every difference above:
import json, requests

def stream_chat(model, messages, api_key):
    resp = requests.post(
        "https://api.apiyi.com/v1/chat/completions",
        headers={"Authorization": f"Bearer {api_key}",
                 "Content-Type": "application/json"},
        json={"model": model, "messages": messages, "stream": True},
        stream=True, timeout=300,
    )
    text, usage = "", None
    for line in resp.iter_lines(decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[6:]
        if data == "[DONE]":
            break
        chunk = json.loads(data)
        if chunk.get("usage"):          # usage may appear in any chunk
            usage = chunk["usage"]
        choices = chunk.get("choices")
        if not choices:                 # final chunk may be empty; guard it
            continue
        delta = choices[0].get("delta", {})
        piece = delta.get("content")
        if piece:                       # skip role-only / empty-content chunks
            text += piece
            print(piece, end="", flush=True)
        # finish_reason == "stop" is just a marker; don't break (usage often follows)
    return text, usage
Reasoning models (grok, qwen, glm, etc.) first stream delta.reasoning_content (the chain of thought), then delta.content (the answer). The parser above reads only content, so the thinking is skipped automatically. To display the thinking, see Reasoning Model Output.

Usage and billing

  • usage comes back inline in non-streaming responses; in streaming it arrives in a trailing chunk (location per the table above — “record whenever present”).
  • Field breakdowns differ: the OpenAI family adds completion_tokens_details, Gemini/Claude add input_tokens/output_tokens, reasoning models add reasoning_tokens. Rely on the three standard fields: prompt_tokens / completion_tokens / total_tokens.
Don’t trust the streamed total_tokens. In testing, a few models (e.g. gpt-5.4-mini) emit a trailing frame where total ≠ prompt + completion, while the same model is correct non-streaming. Bill from your account statement, not from that streamed frame.