If you’re using Claude Code, Cline, Cursor, or hand-rolling your own Claude API calls, Prompt Cache is the single biggest knob for lowering your bill — cached input tokens are billed at just 0.1×, a 90% discount. This page is based on Anthropic’s official documentation (Documentation Index
Fetch the complete documentation index at: https://docs.apiyi.com/llms.txt
Use this file to discover all available pages before exploring further.
docs.claude.com/en/docs/build-with-claude/prompt-caching) and adapted to API易’s setup with copy-paste-ready examples.
In one sentence
Mark a long, reused prompt prefix (system instructions / a long document / few-shot examples) withcache_control. The server stores it; on the next request with the same prefix, it skips reprocessing — roughly 10× cheaper and faster. It expires after a period of inactivity.
Why bother — read the multipliers
Relative to the model’s base input token price (1×):
| Type | Price | Notes |
|---|---|---|
| Plain input | 1× | Whatever isn’t cached, full price |
| Cache write (5-min TTL) | 1.25× | First-time write costs 25% more |
| Cache write (1-hour TTL) | 2× | Pay more to store longer |
| Cache read (hit) | 0.1× | The whole point. 90% off thereafter |
- 5-min TTL: only 2 reuses of the same prefix to break even (1.25 + 0.1 = 1.35, cheaper than 2.0 for two uncached requests).
- 1-hour TTL: 3 reuses to break even (2 + 0.2 = 2.2, cheaper than 3.0).
TTL is a sliding window: every cache hit resets the expiration timer, so active conversations don’t expire from under you. Only true idleness beyond the TTL causes eviction.
Good fit
- The same long system prompt called many times (agents, chatbots)
- Multi-turn conversations (every prior turn becomes reusable prefix)
- Batch processing of one document (asking 50 questions about one contract)
- RAG, where stable retrieved chunks form the prefix
Bad fit
- Every prompt differs from the first character onward
- The whole thing is short and never crosses the per-model minimum (below)
The three hard requirements
All three are mandatory.1. Explicit cache_control marker
content cannot be a plain string. It must be a content block array, with cache_control attached to the block you want cached:
2. Length must clear the per-model minimum
If the content is shorter than the model’s minimum, it won’t be cached even with the marker (no error, just silently skipped). Verified against Anthropic’s official docs:| Model | Minimum tokens |
|---|---|
| Claude Sonnet 4.5 | 1,024 |
| Claude Sonnet 4.6 | 2,048 |
| Claude Opus 4.5 / 4.6 / 4.7 | 4,096 |
| Claude Haiku 4.5 | 4,096 |
3. Prefix must match byte-for-byte
Caching is prefix-based: from the start of the request up to thecache_control marker, the byte stream must be identical to the previous request. Any single character change — whitespace, JSON key ordering, a timestamp — counts as a new prefix and triggers a fresh write instead of a hit.
Practical rule: stable stuff up front, volatile stuff at the back.
Minimal runnable example
Send two requests using the same long document but different questions. The first writes, the second hits:read ≈ the first call’s write — the same prefix is being reused.
How to tell whether you hit — three usage fields
In every response,usage reports:
| Field | Meaning | Billing multiplier |
|---|---|---|
input_tokens | Uncached input tokens | 1× |
cache_creation_input_tokens | Tokens written to cache this call | 1.25× or 2× |
cache_read_input_tokens | Tokens read from cache this call | 0.1× |
cache_read_input_tokens > 0, you’re saving money.
Most common pitfalls
| Symptom | Cause |
|---|---|
write always 0 or field missing | No cache_control marker / below the minimum threshold / used OpenAI-compatible format |
2nd request still has write > 0 and read = 0 | Prefix changed. Common culprits: datetime.now(), UUIDs, rotating user IDs in the prompt; non-deterministic JSON serialization; timestamped system prompts |
| It worked, then a while later writes again | Idle beyond TTL. Use {"type": "ephemeral", "ttl": "1h"} for longer retention |
| Same prompt, different model — no hit | Caches are isolated per model. Switching models = new cache key |
| Recent turns in a long conversation don’t hit | Max 4 cache_control breakpoints per request; each breakpoint only looks back 20 content blocks for prior cache entries |
Advanced: multi-turn conversations
Placecache_control on the last content block of the most recent user message. Each new turn auto-extends the cached read range up to the end of the previous turn:
- At most 4
cache_controlbreakpoints per request. - Each breakpoint’s prefix lookup window is at most 20 content blocks back — anything older than that won’t be considered for a hit. In other words, in very long conversations, marking only the latest turn won’t cover the entire prior history.
On API易 and caching
API易 forwards cache fields end-to-end. The
cache_control you send is passed through to upstream Claude (AWS Claude or Claude Official) as-is, and the returned cache_creation_input_tokens / cache_read_input_tokens are passed straight back to you — no special adaptation needed in your code.- On the first request,
usage.cache_creation_input_tokens > 0(write succeeded). - Within seconds, send the same prefix again — you should see
usage.cache_read_input_tokens > 0(hit). - Your billing dashboard will itemize cache writes and cache reads separately, at the same official multipliers (1.25× / 2× / 0.1×).
Recap
1. Mark it
cache_control: {"type": "ephemeral"} on a content block — plain-string content is never cached.2. Long enough
Sonnet 4.6 ≥ 2,048 tokens; Opus 4.x / Haiku 4.5 ≥ 4,096 tokens, otherwise silently skipped.
3. Stable prefix
Stable up front, volatile in the back; one character of drift kills the hit.
4. Check usage
Only
cache_read_input_tokens > 0 proves you actually saved money.Related links
- Parent page: Claude API Basics
- Get / manage tokens:
https://api.apiyi.com/token - Anthropic official docs:
docs.claude.com/en/docs/build-with-claude/prompt-caching