cached_content_token_count field is returned untouched — zero code changes.
The headline first: Gemini caching exists, but don’t count on it. Implicit cache behavior is controlled upstream, and real-world hit rates clearly trail OpenAI and Claude. Treat it as a nice-to-have bonus and always estimate costs at the uncached price.
This page is based on the official Google documentation (ai.google.dev/gemini-api/docs/caching, as of June 2026).
The Mechanism in One Sentence
When a request’s opening segment (prefix) matches a recent request and meets the minimum length, the upstream reuses its cache automatically: the matched portion bills at the official discount (officially up to 90% off), no markers needed.Trigger Conditions
| Condition | Requirement |
|---|---|
| Minimum prefix length | Gemini 3 / 3.1 / 3.5 series: 4096 tokens; 2.5 series: 2048 tokens |
| Stable prefix | Byte-for-byte identical from the first character; dynamic content (timestamps, random IDs) breaks the match |
| Time window | Caches expire after idle time; back-to-back requests hit more reliably |
How to Confirm a Hit
Checkusage_metadata.cached_content_token_count:
usageMetadata.cachedContentTokenCount.
Raising Your Odds
The playbook is the same as OpenAI’s (full explanation in the OpenAI Cache Billing Guide):- Stable content first: long system instructions, documents, few-shot examples up front; user input and timestamps last
- Make the prefix long: anything under 4096 tokens (Gemini 3 series) never hits
- Cluster reuse in time: send batch jobs back to back, don’t space them out
- Multi-turn chats are naturally append-only prefixes and hit more easily
Explicit Caching (cachedContents)
Google also offers an explicit caching API (cachedContents — create a TTL’d cache object and reference it). That is a stateful server-side resource and is not currently supported on the APIYI channel; use implicit caching.
Versus Other Channels
| Gemini | OpenAI | Claude | |
|---|---|---|---|
| Trigger | Implicit, automatic | Fully automatic | Manual markers |
| Minimum threshold | 4096 (3 series) / 2048 (2.5 series) | 1024 | 1024–4096 |
| Hit discount | Officially up to 90% off | 0.1× | 0.1× |
| Hit reliability | ⚠️ Best-effort, mediocre | ✅ Stable | ✅ Stable |
| Hit field | cached_content_token_count | cached_tokens | cache_read_input_tokens |
Related Links
- This group: Native Calls · Multimodal & Code Execution · Function Calling
- Other channels: OpenAI Cache Billing · Claude Cache Billing
- Official Google docs:
ai.google.dev/gemini-api/docs/caching