Claude Prompt Caching Guide

If you’re using Claude Code, Cline, Cursor, or hand-rolling your own Claude API calls, Prompt Cache is the single biggest knob for lowering your bill — cached input tokens are billed at just 0.1×, a 90% discount. This page is based on Anthropic’s official documentation (docs.claude.com/en/docs/build-with-claude/prompt-caching) and adapted to API易’s setup with copy-paste-ready examples.

In one sentence

Mark a long, reused prompt prefix (system instructions / a long document / few-shot examples) with cache_control. The server stores it; on the next request with the same prefix, it skips reprocessing — roughly 10× cheaper and faster. It expires after a period of inactivity.

Why bother — read the multipliers

Relative to the model’s base input token price (1×):

Type	Price	Notes
Plain input	1×	Whatever isn’t cached, full price
Cache write (5-min TTL)	1.25×	First-time write costs 25% more
Cache write (1-hour TTL)	2×	Pay more to store longer
Cache read (hit)	0.1×	The whole point. 90% off thereafter

Break-even points:

5-min TTL: only 2 reuses of the same prefix to break even (1.25 + 0.1 = 1.35, cheaper than 2.0 for two uncached requests).
1-hour TTL: 3 reuses to break even (2 + 0.2 = 2.2, cheaper than 3.0).

TTL is a sliding window: every cache hit resets the expiration timer, so active conversations don’t expire from under you. Only true idleness beyond the TTL causes eviction.

Good fit

The same long system prompt called many times (agents, chatbots)
Multi-turn conversations (every prior turn becomes reusable prefix)
Batch processing of one document (asking 50 questions about one contract)
RAG, where stable retrieved chunks form the prefix

Bad fit

Every prompt differs from the first character onward
The whole thing is short and never crosses the per-model minimum (below)

The three hard requirements

All three are mandatory.

1. Explicit `cache_control` marker

content cannot be a plain string. It must be a content block array, with cache_control attached to the block you want cached:

# ❌ Wrong: plain string is never cached
"content": "a long passage..."

# ✅ Right: content block + cache_control
"content": [
    {
        "type": "text",
        "text": "a long passage...",
        "cache_control": {"type": "ephemeral"},
    },
    {"type": "text", "text": "the question"},
]

2. Length must clear the per-model minimum

If the content is shorter than the model’s minimum, it won’t be cached even with the marker (no error, just silently skipped). Verified against Anthropic’s official docs:

Model	Minimum tokens
Claude Sonnet 4.5	1,024
Claude Sonnet 4.6	2,048
Claude Opus 4.5 / 4.6 / 4.7	4,096
Claude Haiku 4.5	4,096

English text averages roughly 0.75 words per token, so Sonnet 4.6 needs about 1,500+ words of stable content to make caching meaningful. Always refer to Anthropic’s official docs for the latest thresholds — they can change between model versions.

3. Prefix must match byte-for-byte

Caching is prefix-based: from the start of the request up to the cache_control marker, the byte stream must be identical to the previous request. Any single character change — whitespace, JSON key ordering, a timestamp — counts as a new prefix and triggers a fresh write instead of a hit. Practical rule: stable stuff up front, volatile stuff at the back.

# ❌ Wrong: question first means the prefix changes every turn; never hits
content = [
    {"type": "text", "text": "Please answer this question: " + question},  # volatile
    {"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}},
]

# ✅ Right: long stable content first with marker, question after
content = [
    {"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}},  # stable
    {"type": "text", "text": question},                                            # volatile
]

Minimal runnable example

Send two requests using the same long document but different questions. The first writes, the second hits:

import json, os, requests

URL = "https://api.apiyi.com/v1/messages"
KEY = os.environ["APIYI_API_KEY"]
HEADERS = {
    "content-type": "application/json",
    "x-api-key": KEY,
    "anthropic-version": "2023-06-01",
}

# Must be long enough. Sonnet 4.6 needs >= 2,048 tokens (~1,500+ English words).
LONG_TEXT = open("long_document.txt").read()


def ask(question: str, label: str):
    payload = {
        "model": "claude-sonnet-4-6",
        "max_tokens": 256,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": LONG_TEXT, "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": question},
            ],
        }],
    }
    r = requests.post(URL, headers=HEADERS, data=json.dumps(payload), timeout=120)
    u = r.json().get("usage", {})
    print(f"[{label}] input={u.get('input_tokens')} "
          f"write={u.get('cache_creation_input_tokens')} "
          f"read={u.get('cache_read_input_tokens')}")


ask("Summarize the main idea", "1st")  # expect write>0, read=0
ask("Give 3 keywords",        "2nd")    # expect write=0, read>0

Expected output:

[1st] input=35 write=6512 read=0
[2nd] input=22 write=0    read=6512

The second call’s read ≈ the first call’s write — the same prefix is being reused.

How to tell whether you hit — three usage fields

In every response, usage reports:

Field	Meaning	Billing multiplier
`input_tokens`	Uncached input tokens	1×
`cache_creation_input_tokens`	Tokens written to cache this call	1.25× or 2×
`cache_read_input_tokens`	Tokens read from cache this call	0.1×

Total input tokens = sum of all three. As long as cache_read_input_tokens > 0, you’re saving money.

Most common pitfalls

Symptom	Cause
`write` always `0` or field missing	No `cache_control` marker / below the minimum threshold / used OpenAI-compatible format
2nd request still has `write > 0` and `read = 0`	Prefix changed. Common culprits: `datetime.now()`, UUIDs, rotating user IDs in the prompt; non-deterministic JSON serialization; timestamped system prompts
It worked, then a while later writes again	Idle beyond TTL. Use `{"type": "ephemeral", "ttl": "1h"}` for longer retention
Same prompt, different model — no hit	Caches are isolated per model. Switching models = new cache key
Recent turns in a long conversation don’t hit	Max 4 `cache_control` breakpoints per request; each breakpoint only looks back 20 content blocks for prior cache entries

Prompt Cache only works on the Anthropic native format (/v1/messages). When you call Claude through the OpenAI-compatible format (/v1/chat/completions), no cache fields will come back regardless of what you send. For Claude Code, Cline, Cursor and similar high-frequency clients, the native format is mandatory if you care about your bill.

Advanced: multi-turn conversations

Place cache_control on the last content block of the most recent user message. Each new turn auto-extends the cached read range up to the end of the previous turn:

# When constructing the Nth turn's request
messages[-1]["content"][-1]["cache_control"] = {"type": "ephemeral"}

Two hard limits to keep in mind:

At most 4 cache_control breakpoints per request.
Each breakpoint’s prefix lookup window is at most 20 content blocks back — anything older than that won’t be considered for a hit. In other words, in very long conversations, marking only the latest turn won’t cover the entire prior history.

A common pattern: place one breakpoint each on tool definitions, system prompt, long documents, and the latest conversation turn — using all 4 slots so that sections changing at different rates don’t invalidate each other’s cache.

On API易 and caching

API易 forwards cache fields end-to-end. The cache_control you send is passed through to upstream Claude (AWS Claude or Claude Official) as-is, and the returned cache_creation_input_tokens / cache_read_input_tokens are passed straight back to you — no special adaptation needed in your code.

How to self-verify:

On the first request, usage.cache_creation_input_tokens > 0 (write succeeded).
Within seconds, send the same prefix again — you should see usage.cache_read_input_tokens > 0 (hit).
Your billing dashboard will itemize cache writes and cache reads separately, at the same official multipliers (1.25× / 2× / 0.1×).

Recap

1. Mark it

cache_control: {"type": "ephemeral"} on a content block — plain-string content is never cached.

2. Long enough

Sonnet 4.6 ≥ 2,048 tokens; Opus 4.x / Haiku 4.5 ≥ 4,096 tokens, otherwise silently skipped.

3. Stable prefix

Stable up front, volatile in the back; one character of drift kills the hit.

4. Check usage

Only cache_read_input_tokens > 0 proves you actually saved money.

Parent page: Claude API Basics
Get / manage tokens: https://api.apiyi.com/token
Anthropic official docs: docs.claude.com/en/docs/build-with-claude/prompt-caching

Basics

Basic API

Image API

Video API

Multimodal Understanding API

Text API

In one sentence

Why bother — read the multipliers

Good fit

Bad fit

The three hard requirements

1. Explicit `cache_control` marker

2. Length must clear the per-model minimum

3. Prefix must match byte-for-byte

Minimal runnable example

How to tell whether you hit — three usage fields

Most common pitfalls

Advanced: multi-turn conversations

On API易 and caching

Recap

1. Mark it

2. Long enough

3. Stable prefix

4. Check usage

Basics

Basic API

Image API

Video API

Multimodal Understanding API

Text API

Documentation Index

​In one sentence

​Why bother — read the multipliers

​Good fit

​Bad fit

​The three hard requirements

​1. Explicit cache_control marker

​2. Length must clear the per-model minimum

​3. Prefix must match byte-for-byte

​Minimal runnable example

​How to tell whether you hit — three usage fields

​Most common pitfalls

​Advanced: multi-turn conversations

​On API易 and caching

​Recap

1. Mark it

2. Long enough

3. Stable prefix

4. Check usage

​Related links

In one sentence

Why bother — read the multipliers

Good fit

Bad fit

The three hard requirements

1. Explicit `cache_control` marker

2. Length must clear the per-model minimum

3. Prefix must match byte-for-byte

Minimal runnable example

How to tell whether you hit — three usage fields

Most common pitfalls

Advanced: multi-turn conversations

On API易 and caching

Recap

Related links