Skip to main content
/v1/responses is OpenAI’s current flagship native endpoint. In OpenAI’s own words: “While Chat Completions remains supported, Responses is recommended for all new projects.” APIYI fully supports this endpoint — just point base_url at https://api.apiyi.com/v1. This page is based on the official OpenAI documentation (developers.openai.com/api/docs, as of June 2026). All examples are copy-paste ready.

Why Responses

Compared with Chat Completions, OpenAI cites three hard numbers:
  • Better reasoning: the same reasoning model scores about 3% higher on SWE-bench via Responses (reasoning state persists across turns)
  • Cheaper input: cache utilization is 40%–80% higher than Chat Completions (OpenAI internal testing), which directly cuts your input bill
  • More tools: built-in tools like web_search and code_interpreter are Responses-only
When Chat Completions is still the right choice: you rely on existing frameworks (LangChain and most clients default to /v1/chat/completions), or you want one codebase that also calls Claude, Gemini, and other non-OpenAI models — see Compatible Mode.
What is deprecated is the Assistants API (scheduled to shut down on August 26, 2026 (UTC)), not Chat Completions. Both endpoints remain supported long-term; new features simply land on Responses first.

Quick Start

curl https://api.apiyi.com/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-5.4",
    "input": "Introduce yourself in one sentence",
    "instructions": "You are a concise assistant"
  }'
Prefer response.output_text over hand-written output[0].content[0].text — for reasoning models, the first item in output is often a reasoning item, not a message, so hard-coded indexing breaks.

Request Parameters

ParameterTypeDefaultDescription
modelstringrequirede.g. gpt-5.4, gpt-5.5
inputstring / arrayrequiredUser input; supports multimodal content arrays
instructionsstringnullSystem instructions (system prompt equivalent)
max_output_tokensintnullOutput cap (includes reasoning tokens)
reasoningobjectmedium{"effort": "none/low/medium/high/xhigh"}
textobjectformat (output format), verbosity (low/medium/high)
toolsarray[]Functions + built-in tools
tool_choicestring”auto”auto / required / none / a specific tool
parallel_tool_callsbooleantrueAllow parallel tool calls
storebooleantrueKeep the response object server-side (30 days)
previous_response_idstringnullChain to a prior response; context carries over
conversationstringnullBind to a persistent /v1/conversations object
backgroundbooleanfalseAsync background execution (long tasks / Pro models)
streambooleanfalseStreaming (semantic events)
prompt_cache_keystringnullCache routing key — see Cache Billing
metadataobjectCustom metadata
gpt-5 series reasoning models do not support temperature / top_p — passing them raises an error. Use reasoning.effort and text.verbosity instead.

Response Structure

output is an array of items. The three common types: reasoning (reasoning summary), message (text reply), and function_call (a function call request). A trimmed example:
{
  "id": "resp_abc123",
  "object": "response",
  "status": "completed",
  "model": "gpt-5.4-2026-03-05",
  "output": [
    { "type": "reasoning", "summary": [] },
    {
      "type": "message",
      "role": "assistant",
      "content": [{ "type": "output_text", "text": "Hi! I'm an AI assistant." }]
    }
  ],
  "usage": {
    "input_tokens": 24,
    "input_tokens_details": { "cached_tokens": 0 },
    "output_tokens": 58,
    "output_tokens_details": { "reasoning_tokens": 40 },
    "total_tokens": 82
  }
}
Two usage fields worth watching:
  • input_tokens_details.cached_tokens: input that hit the cache (billed at 0.1×)
  • output_tokens_details.reasoning_tokens: reasoning spend (billed at the output rate; tune with reasoning.effort)

State Management: Three Options

1. previous_response_id chaining

The simplest multi-turn approach — no manual history assembly:
r1 = client.responses.create(
    model="gpt-5.4",
    input="My name is Alice. Please remember it."
)

r2 = client.responses.create(
    model="gpt-5.4",
    input="What's my name?",
    previous_response_id=r1.id
)

print(r2.output_text)  # Answers "Alice"

2. conversation objects

Create a conversation object and attach requests to the same conversation. Not subject to the 30-day response retention window — good for long-lived sessions.

3. The store flag

store defaults to true: response objects are retained server-side for 30 days so previous_response_id can reference them. For data-residency-sensitive workloads, pass store: false explicitly — but that response can no longer be chained.
Chaining does not reduce input billing: all prior context pulled in via previous_response_id is still billed as input tokens in full. Long conversations save money through cache discounts (the historical prefix auto-hits the 0.1× cache rate), not through chaining itself — see Cache Billing.

Reasoning and Output Controls

Choosing reasoning.effort

LevelWhen to use
noneSimple Q&A and format conversion — fast and cheap
lowRoutine chat, summaries
medium (default)Balanced choice for everyday development
highComplex code, multi-step reasoning
xhighThe hardest problems, with gpt-5.5 / gpt-5.4
response = client.responses.create(
    model="gpt-5.5",
    input="Prove that the square root of 2 is irrational",
    reasoning={"effort": "xhigh"}
)

text.verbosity

low / medium (default) / high controls answer length. Responses-only:
response = client.responses.create(
    model="gpt-5.4",
    input="Explain closures",
    text={"verbosity": "low"}  # short version
)

Streaming

Responses streams semantic events, not the generic choices[0].delta chunks of Chat Completions. Core events:
EventMeaning
response.createdResponse started
response.output_item.addedA new output item (message / function_call / …)
response.output_text.deltaText increment
response.function_call_arguments.deltaFunction-argument increment
response.completedDone (includes final usage)
errorFailure
stream = client.responses.create(
    model="gpt-5.4",
    input="Write a short poem about autumn",
    stream=True
)

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
    elif event.type == "response.completed":
        print("\n\nUsage:", event.response.usage)

Built-in Tools

Built-in tools are a Responses-only capability — declare them in tools and OpenAI executes them server-side:
TooltypeDescription
Web searchweb_searchModel searches the web autonomously
File searchfile_searchQuery uploaded vector stores
Code interpretercode_interpreterRun Python in a sandbox
Computer usecomputer_useDrive a virtual desktop
Remote MCPmcpConnect to remote MCP servers
Image generationimage_generationInline image generation
Tool searchtool_searchDynamic retrieval over large tool sets (gpt-5.4 and later)
Minimal web_search example:
response = client.responses.create(
    model="gpt-5.4",
    input="What are today's major AI news stories?",
    tools=[{"type": "web_search"}]
)
print(response.output_text)
Built-in tools execute on OpenAI’s side; pass-through support per tool on the APIYI channel should be confirmed by testing. Custom function calling is fully supported — see Function Calling.

Pro Models and Background Mode

gpt-5.4-pro and gpt-5.5-pro are deep-reasoning models for professional workloads ($30 / $180 per million tokens, svip group only) and are, in practice, available via /v1/responses only. A single request can take minutes — pair them with background: true:
# Submit a background task
response = client.responses.create(
    model="gpt-5.4-pro",
    input="Do a deep review of this architecture proposal: ...",
    background=True
)

# Poll for the result
import time
while response.status in ("queued", "in_progress"):
    time.sleep(10)
    response = client.responses.retrieve(response.id)

print(response.output_text)
Pro models are expensive and slow — the trade is “minutes of waiting for a more reliable answer”. For everyday development use gpt-5.4 / gpt-5.5; don’t reach for Pro without a clear deep-reasoning need.

Supported Models and Pricing

ModelInput (per 1M tokens)Output (per 1M tokens)Notes
gpt-5.4$2.50$15.00Current workhorse, 1M context
gpt-5.4-mini$0.75$4.50Lightweight, great value
gpt-5.5$5.00$30.00Flagship, complex reasoning
gpt-5.2$1.75$14.00Previous workhorse
gpt-5.1 / gpt-5$1.25$10.00Budget-friendly
gpt-5.4-pro$30.00$180.00svip only, responses only, professional use
gpt-5.5-pro$30.00$180.00svip only, responses only, professional use
Pinned date versions (e.g. gpt-5.4-2026-03-05) are also available at the same price. Full list: Models & Pricing.

Mapping from Chat Completions

Field mapping when migrating from /v1/chat/completions:
Chat CompletionsResponsesNotes
messages arrayinputPlain string works for simple cases
system messageinstructionsStandalone parameter
max_tokens / max_completion_tokensmax_output_tokens
response_formattext.format
top-level reasoning_effortreasoning.effortNested object in Responses
choices[0].message.contentoutput_textReading the result
Stateless, manual historyprevious_response_id / conversationStateful
usage.prompt_tokensusage.input_tokensDifferent field names
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": "You are a concise assistant"},
        {"role": "user", "content": "Hello"}
    ]
)
content = response.choices[0].message.content

Troubleshooting

SymptomCause and fix
model_not_supported errorThe model doesn’t support the responses endpoint — use the gpt-5 series
previous_response_id reference failsThe prior response used store: false, or the 30-day retention expired
output_text is emptyThe output is all function_call items (the model wants tools run) — iterate over output
Error when passing temperatureUnsupported on gpt-5 reasoning models — remove it, use reasoning.effort