Overview
Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch. Benefits:- Up to ~90% cost reduction on cached input tokens (cache hits)
- Lower latency on requests with large static prefixes
Supported Models
Prompt caching is available on Claude models, including these families (examples):| Model family | Example model IDs |
|---|---|
| Claude 3.5 Sonnet v2 | claude-3-5-sonnet-20241022 |
| Claude 3.5 Haiku | claude-3-5-haiku-20241022 |
| Claude 3.7 Sonnet | claude-3-7-sonnet-20250219 (and :thinking variants) |
| Claude Sonnet 4 | claude-sonnet-4-20250514 (and :thinking variants) |
| Claude Sonnet 4.5 | claude-sonnet-4-5-20250929 (and :thinking variants) |
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 |
| Claude Opus 4 | claude-opus-4-20250514 (and :thinking variants) |
| Claude Opus 4.1 | claude-opus-4-1-20250805 (and :thinking variants) |
| Claude Opus 4.5 | claude-opus-4-5-20251101 (and :thinking variants) |
| Claude Opus 4.6 | claude-opus-4-6 (and :thinking variants) |
anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).
Minimum Cacheable Tokens
If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.| Model | Minimum cacheable tokens |
|---|---|
| Claude Opus 4.5 | 4,096 |
| Claude Haiku 4.5 | 4,096 |
| Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.7 | 1,024 |
| Claude Haiku 3.5 | 2,048 |
How To Enable Prompt Caching
Prompt caching works onPOST /api/v1/chat/completions.
You can enable it in 3 ways.
Method A: prompt_caching / promptCaching helper
Add a top-level helper object:
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | — | Enable prompt caching |
ttl | "5m" or "1h" | "5m" | Cache time-to-live |
cut_after_message_index | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | When true, avoid failover to preserve cache consistency (see stickyProvider) |
cut_after_message_index is omitted, NanoGPT selects a cache boundary automatically. Set it explicitly for full control.
Method B: explicit cache_control markers (advanced)
Attach cache_control directly to content blocks you want cached:
Method C: anthropic-beta header
You can enable caching via headers (especially useful if you are sending explicit cache_control blocks):
- 5 minutes:
anthropic-beta: prompt-caching-2024-07-31
- 1 hour:
anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11
NanoGPT Web UI
In the NanoGPT web UI, supported Claude models show a prompt caching toggle where you can choose a cache duration.Costs and Savings
Prompt caching changes the effective price of cached input tokens:Cache creation (first request)
| TTL | Multiplier on cached input tokens |
|---|---|
| 5 minutes | 1.25x |
| 1 hour | 2.0x |
Cache reads (subsequent requests)
| TTL | Multiplier on cached input tokens |
|---|---|
| 5 minutes or 1 hour | 0.1x (about 90% discount) |
Structuring Prompts for Cache Hits
Cache hits require the cached prefix to be byte-identical across requests. Best practices:- Put static content first (system prompt, reference docs, tool definitions).
- Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
- Put dynamic content after the cache boundary (typically the final user message).
TTL and Invalidation
Supported TTL values:"5m"(5 minutes)"1h"(1 hour)
- On a cache hit, the TTL timer resets.
- If a cache entry is not referenced again before its TTL expires, it expires.
- Any change to the cached prefix (even a single character) results in a cache miss and a new cache creation.
Usage Fields (How To Verify Cache Hits)
When caching is active, responses include cache-related usage fields:cache_creation_input_tokens: tokens written to cache on this requestcache_read_input_tokens: tokens read from cache on this request (cache hit when > 0)prompt_tokens_details.cached_tokens: OpenAI-style cached token count
Limitations and Caveats
- Prompt caching is only supported on Claude models. Other models ignore caching helpers/markers.
- A maximum of 4 cache breakpoints (
cache_control) are supported per request. - Some models may report
prompt_tokensdifferently across generations; usecache_creation_input_tokensandcache_read_input_tokensfor the authoritative cached token counts. - On cache hits, it is normal to sometimes see a small non-zero
cache_creation_input_tokensvalue (small per-request overhead). It does not necessarily mean a cache miss.
Cache Consistency with stickyProvider
If a later request is routed differently, it can miss the cache even if the prefix is identical. If cache consistency is more important than availability, set:
stickyProvider: false(default): the request may succeed even if routing changes, but you might rebuild caches.stickyProvider: true: if a fallback would be required, the request returns503instead.