developers.openai.com/api/docs/guides/prompt-caching, as of June 2026), with examples adapted for APIYI.
The One-Sentence Version
Whenever a request’s opening segment (prefix) exactly matches a recent request and is at least 1024 tokens long, the server skips reprocessing it: the matched portion bills at 0.1× and latency drops by up to 80%. The two biggest differences from Claude’s caching:- No markers: there is no
cache_control— caching kicks in automatically when conditions are met - No write fee: Claude charges 1.25× / 2× to write; OpenAI writes for free
Why Bother — the Billing Multipliers
With the model’s raw input token price as 1×:| Type | Price | Notes |
|---|---|---|
| Regular input | 1× | The unmatched portion, full price |
| Cache write | 0× (free) | Happens automatically, costs nothing |
| Cache hit | 0.1× | Matched portion is 90% off |
| Model | Regular input | Cache hit |
|---|---|---|
gpt-5.4 | $2.50 | $0.25 |
gpt-5.4-mini | $0.75 | $0.075 |
gpt-5.5 | $5.00 | $0.50 |
gpt-5.1 / gpt-5 | $1.25 | $0.125 |
Good fits
- A long system prompt + tool definitions reused across calls (agents, support bots)
- Multi-turn conversations (each new turn auto-hits on all prior history)
- Batch-processing one document (asking 50 questions about one contract)
- RAG with stable document chunks placed at the front of the prompt
Poor fits
- Requests that differ from the very first character
- Prompts under 1024 tokens total (below the caching threshold)
Three Hard Conditions for a Hit
All three are required.1. Prefix of at least 1024 tokens
Requests shorter than 1024 tokens are never cached (no error — it silently doesn’t apply). Beyond 1024, hits extend in 128-token increments: the matched length lands on steps like 1024, 1152, 1280 …, socached_tokens typically reads slightly below your full stable prefix. That’s normal.
2. Byte-for-byte identical prefix
Caching is prefix matching: comparison starts at the first character and stops at the first difference. Any change — a timestamp, a username, JSON key order — makes everything after it bill at full price. Practical rule: stable content first, volatile content last.3. Reuse within the retention window
- Base retention: evicted after 5–10 minutes idle, at most 1 hour
- Since May 29, 2026 (UTC), gpt-5.1 and later models (including pro variants) default to 24-hour extended retention (
prompt_cache_retention: "24h") for non-ZDR organizations, at no extra cost — same-day reuse essentially always hits
Minimal Working Example
Send the same long prefix twice with different questions — the first write is automatic, the second hits:cached is close to the system prompt length (rounded to 128) — that portion bills at 10%.
The
/v1/responses endpoint caches automatically too; the field is usage.input_tokens_details.cached_tokens. OpenAI’s internal testing shows cache utilization on Responses runs 40%–80% higher than Chat Completions — for multi-turn agents, prefer Native Calls.Did It Hit? Read the usage Fields
| Endpoint | Hit field |
|---|---|
/v1/chat/completions | usage.prompt_tokens_details.cached_tokens |
/v1/responses | usage.input_tokens_details.cached_tokens |
cached_tokens > 0 means you’re saving: that portion bills at 0.1×, and the remaining prompt_tokens - cached_tokens bills at full price.
Advanced: Raising Your Hit Rate
prompt_cache_key routing
A hit requires the request to land on the same cache machine. Default prefix-hash routing usually suffices, but when many users share similar prefixes or concurrency is high, an explicitprompt_cache_key noticeably improves hit rates:
Engineering a stable prefix
- Keep tool definition order and JSON serialization fixed (don’t let your serializer randomize key order)
- Image inputs participate in prefix matching too — keep the URL / base64 and
detailparameter identical when reusing - To vary available tools per scenario, use
allowed_toolsto restrict the subset instead of editing thetoolslist — the former doesn’t break the cache prefix
Multi-turn chats hit for free
An append-only messages array naturally satisfies prefix stability: each turn’s history is the previous turn’s full prefix. Hits happen automatically with no extra work.Common Pitfalls
| Symptom | Cause |
|---|---|
cached_tokens always 0 | Under 1024 tokens total / dynamic content at the start of the prefix (timestamps, UUIDs, random IDs) |
| Intermittent hits | High concurrency without split prompt_cache_key / idle past retention |
| Hit count lower than expected | 128-token step truncation (normal) / dynamic content leaked into the middle of the prefix |
| No hits after switching models | Caches are isolated per model — gpt-5.4 and gpt-5.4-mini don’t share |
| No cached_tokens when calling Claude | OpenAI-compatible calls to Claude can’t use Claude’s cache — use Claude Native Calls |
OpenAI vs. Claude Caching at a Glance
| OpenAI (gpt-5 series) | Claude | |
|---|---|---|
| Trigger | Fully automatic, zero code | Manual cache_control markers |
| Write fee | Free | 1.25× (5 min) / 2× (1 hour) |
| Hit price | 0.1× | 0.1× |
| Minimum threshold | 1024 tokens | 1024–4096 tokens by model |
| Retention | From 5 min; 24h default on gpt-5.1+ | 5 min / 1 hour (sliding renewal) |
| Field to watch | cached_tokens | cache_read_input_tokens |
APIYI and Caching
The APIYI OpenAI channel supports cache hits. Requests are forwarded upstream as-is, the
cached_tokens field is returned to you untouched, and the billing dashboard lists the matched portion as a separate “cache read” line item at the official 0.1× rate — no middleware-specific adaptation needed in your code.- Build a stable prefix of at least 1024 tokens and send 2 requests back to back
- The 2nd response should show
cached_tokens > 0 - In the call logs, the 2nd request’s input cost should be visibly lower than the 1st
Key Takeaways
1. Fully automatic
No markers, no write fee — caching applies automatically and the 2nd use is pure savings.
2. Long enough
At least 1024 tokens of prefix to start caching; hits count in 128-token steps.
3. Stable prefix
Stable content first, volatile content last; keep timestamps and random IDs out of the opening.
4. Watch usage
Only
cached_tokens > 0 proves a hit — that portion bills at 10%.Related Links
- This group: Native Calls · Compatible Mode · Function Calling
- Claude-side caching: Claude Cache Billing Guide
- Get / manage tokens:
https://api.apiyi.com/token - Official OpenAI docs:
developers.openai.com/api/docs/guides/prompt-caching