Skip to main content

Overview

Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch. Benefits:
  • Up to ~90% cost reduction on cached input tokens (cache hits)
  • Lower latency on requests with large static prefixes
Caching is opt-in and configured per request.

Supported Models

Prompt caching is available on Claude models, including these families (examples):
Model familyExample model IDs
Claude 3.5 Sonnet v2claude-3-5-sonnet-20241022
Claude 3.5 Haikuclaude-3-5-haiku-20241022
Claude 3.7 Sonnetclaude-3-7-sonnet-20250219 (and :thinking variants)
Claude Sonnet 4claude-sonnet-4-20250514 (and :thinking variants)
Claude Sonnet 4.5claude-sonnet-4-5-20250929 (and :thinking variants)
Claude Haiku 4.5claude-haiku-4-5-20251001
Claude Opus 4claude-opus-4-20250514 (and :thinking variants)
Claude Opus 4.1claude-opus-4-1-20250805 (and :thinking variants)
Claude Opus 4.5claude-opus-4-5-20251101 (and :thinking variants)
Claude Opus 4.6claude-opus-4-6 (and :thinking variants)
All of the above are also supported via the anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).

Minimum Cacheable Tokens

If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.
ModelMinimum cacheable tokens
Claude Opus 4.54,096
Claude Haiku 4.54,096
Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.71,024
Claude Haiku 3.52,048

How To Enable Prompt Caching

Prompt caching works on POST /api/v1/chat/completions. You can enable it in 3 ways.

Method A: prompt_caching / promptCaching helper

Add a top-level helper object:
{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    { "role": "system", "content": "Your large static content..." },
    { "role": "user", "content": "Summarize the key points." }
  ],
  "prompt_caching": {
    "enabled": true,
    "ttl": "5m",
    "cut_after_message_index": 0
  }
}
Parameters:
ParameterTypeDefaultDescription
enabledbooleanEnable prompt caching
ttl"5m" or "1h""5m"Cache time-to-live
cut_after_message_indexintegerZero-based index; cache all messages up to and including this index
stickyProviderbooleanfalseWhen true, avoid failover to preserve cache consistency (see stickyProvider)
If cut_after_message_index is omitted, NanoGPT selects a cache boundary automatically. Set it explicitly for full control.

Method B: explicit cache_control markers (advanced)

Attach cache_control directly to content blocks you want cached:
{
  "model": "anthropic/claude-opus-4.5",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "Your long reference document...",
          "cache_control": { "type": "ephemeral", "ttl": "5m" }
        }
      ]
    },
    { "role": "user", "content": "Live question goes here" }
  ]
}
When using explicit markers, include:
anthropic-beta: prompt-caching-2024-07-31
For 1-hour TTL, also include:
anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11

Method C: anthropic-beta header

You can enable caching via headers (especially useful if you are sending explicit cache_control blocks):
  • 5 minutes:
    • anthropic-beta: prompt-caching-2024-07-31
  • 1 hour:
    • anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11

NanoGPT Web UI

In the NanoGPT web UI, supported Claude models show a prompt caching toggle where you can choose a cache duration.

Costs and Savings

Prompt caching changes the effective price of cached input tokens:

Cache creation (first request)

TTLMultiplier on cached input tokens
5 minutes1.25x
1 hour2.0x

Cache reads (subsequent requests)

TTLMultiplier on cached input tokens
5 minutes or 1 hour0.1x (about 90% discount)
The first cached request costs more than uncached (cache creation surcharge), but cache hits are much cheaper. Break-even is typically 2 requests (5m) or 3 requests (1h) with the same cached prefix.

Structuring Prompts for Cache Hits

Cache hits require the cached prefix to be byte-identical across requests. Best practices:
  • Put static content first (system prompt, reference docs, tool definitions).
  • Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
  • Put dynamic content after the cache boundary (typically the final user message).

TTL and Invalidation

Supported TTL values:
  • "5m" (5 minutes)
  • "1h" (1 hour)
Behavior:
  • On a cache hit, the TTL timer resets.
  • If a cache entry is not referenced again before its TTL expires, it expires.
  • Any change to the cached prefix (even a single character) results in a cache miss and a new cache creation.

Usage Fields (How To Verify Cache Hits)

When caching is active, responses include cache-related usage fields:
  • cache_creation_input_tokens: tokens written to cache on this request
  • cache_read_input_tokens: tokens read from cache on this request (cache hit when > 0)
  • prompt_tokens_details.cached_tokens: OpenAI-style cached token count
Example (cache hit):
{
  "usage": {
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 4269,
    "prompt_tokens_details": { "cached_tokens": 4269 }
  }
}
For streaming requests, the final SSE chunk includes the same usage fields when usage is included.

Limitations and Caveats

  • Prompt caching is only supported on Claude models. Other models ignore caching helpers/markers.
  • A maximum of 4 cache breakpoints (cache_control) are supported per request.
  • Some models may report prompt_tokens differently across generations; use cache_creation_input_tokens and cache_read_input_tokens for the authoritative cached token counts.
  • On cache hits, it is normal to sometimes see a small non-zero cache_creation_input_tokens value (small per-request overhead). It does not necessarily mean a cache miss.

Cache Consistency with stickyProvider

If a later request is routed differently, it can miss the cache even if the prefix is identical. If cache consistency is more important than availability, set:
{
  "prompt_caching": { "enabled": true, "ttl": "5m", "stickyProvider": true }
}
Behavior:
  • stickyProvider: false (default): the request may succeed even if routing changes, but you might rebuild caches.
  • stickyProvider: true: if a fallback would be required, the request returns 503 instead.