Documentation Index
Fetch the complete documentation index at: https://docs.nano-gpt.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch.
Benefits:
- Up to ~90% cost reduction on cached input tokens (cache hits)
- Lower latency on requests with large static prefixes
NanoGPT supports two caching modes:
- Implicit caching (default): For providers/models that support provider-native prompt reuse (including OpenAI, Gemini, and many open-source provider routes), caching is applied automatically when eligible. No extra request fields are required.
- Explicit prompt caching (opt-in): Claude models use explicit cache controls (
prompt_caching/promptCaching/body-level cache_control, or inline cache_control) when you want deterministic cache boundaries and TTL control.
Supported Models
Implicit caching (automatic)
NanoGPT automatically uses implicit caching on providers/models that support it, including OpenAI and Gemini model families plus many open-source provider/model routes.
No cache-control flags are required for this mode.
Explicit prompt caching controls (Claude)
Explicit prompt-caching controls are available on Claude models, including these families (examples):
| Model family | Example model IDs |
|---|
| Claude 3.5 Sonnet v2 | claude-3-5-sonnet-20241022 |
| Claude 3.5 Haiku | claude-3-5-haiku-20241022 |
| Claude 3.7 Sonnet | claude-3-7-sonnet-20250219 (and :thinking variants) |
| Claude Sonnet 4 | claude-sonnet-4-20250514 (and :thinking variants) |
| Claude Sonnet 4.5 | claude-sonnet-4-5-20250929 (and :thinking variants) |
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 |
| Claude Opus 4 | claude-opus-4-20250514 (and :thinking variants) |
| Claude Opus 4.1 | claude-opus-4-1-20250805 (and :thinking variants) |
| Claude Opus 4.5 | claude-opus-4-5-20251101 (and :thinking variants) |
| Claude Opus 4.6 | claude-opus-4-6 (and :thinking variants) |
All of the above are also supported via the anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).
Claude minimum cacheable tokens
If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.
| Model | Minimum cacheable tokens |
|---|
| Claude Opus 4.5 | 4,096 |
| Claude Haiku 4.5 | 4,096 |
| Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.7 | 1,024 |
| Claude Haiku 3.5 | 2,048 |
How To Enable Explicit Prompt Caching (Claude)
Prompt caching works on POST /api/v1/chat/completions.
You can enable it in 3 ways.
Option 1: body-level helper (promptCaching / prompt_caching / cache_control)
Add a top-level helper object:
{
"model": "anthropic/claude-sonnet-4.5",
"messages": [
{ "role": "system", "content": "Your large static content..." },
{ "role": "user", "content": "Summarize the key points." }
],
"promptCaching": {
"enabled": true,
"ttl": "5m",
"cutAfterMessageIndex": 0
}
}
Parameters:
| Parameter | Type | Default | Description |
|---|
enabled | boolean | — | Enable prompt caching |
ttl | "5m" or "1h" | "5m" | Cache time-to-live |
cutAfterMessageIndex / cut_after_message_index | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | When true, avoid failover to preserve cache consistency (see stickyProvider) |
explicitCacheControl / explicit_cache_control | boolean | false | When true, only refresh TTLs on existing inline cache_control blocks and do not auto-add cache breakpoints |
explicitCacheControl (boolean, default false)
When true, the system only refreshes TTLs on cache_control blocks you already placed in your request. No additional cache breakpoints are added automatically.
This is useful when you use inline cache_control markers (Option 2) but also want body-level settings like ttl or stickyProvider to apply. Without this flag, the system may add its own cache breakpoints on top of yours.
Also accepts the snake_case alias explicit_cache_control.
Aliases are accepted:
promptCaching
prompt_caching
cache_control (body-level helper alias)
Passing true instead of an object defaults to:
{ "enabled": true, "ttl": "5m" }
If cutAfterMessageIndex is omitted, NanoGPT selects cache boundaries automatically.
Option 2: inline cache_control markers
Attach cache_control directly to content blocks you want cached:
{
"model": "anthropic/claude-sonnet-4.5",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Your long reference document...",
"cache_control": { "type": "ephemeral" }
}
]
},
{ "role": "user", "content": "Live question goes here" }
]
}
Combining inline markers with body-level settings
{
"model": "anthropic/claude-sonnet-4.5",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful coding assistant with access to a large codebase...",
"cache_control": { "type": "ephemeral" }
}
]
},
{
"role": "user",
"content": "Summarize the auth module"
}
],
"promptCaching": {
"enabled": true,
"ttl": "1h",
"explicitCacheControl": true
}
}
In this example, the system prompt’s cache_control marker is preserved and its TTL is set to 1h. The user message does not receive an auto-generated cache breakpoint.
The Anthropic-compatible header is supported:
anthropic-beta: prompt-caching-2024-07-31
For Claude 1-hour TTL requests using Anthropic-native routing, also include:
anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11
Controlling What Gets Cached
Note: explicitCacheControl and cutAfterMessageIndex serve different purposes. cutAfterMessageIndex tells the system where to auto-place cache breakpoints. explicitCacheControl tells the system not to auto-place any and only refresh what you already marked. If both are set, explicitCacheControl takes precedence.
cutAfterMessageIndex
Override automatic cache breakpoints by setting the last cached message index:
{
"promptCaching": {
"enabled": true,
"cutAfterMessageIndex": 4
}
}
Messages at indices 0..4 are cached; later messages are not.
You can also set this via request header:
x-prompt-caching-cut-after: 4
Cache block limit
A maximum of 4 cache_control breakpoints are allowed per request (across system prompt, tools, and messages). If more are present, the oldest breakpoints are pruned automatically.
Forcing a Cache Write
There is no separate “force write” flag.
Enable prompt caching and send the request. The first eligible request writes cache automatically (if provider thresholds/availability allow it). Repeated requests with the same cached prefix read from cache.
Usage Fields (How To Verify Cache Hits)
When caching is active (implicit or explicit), responses can include:
cache_creation_input_tokens: tokens written to cache on this request
cache_read_input_tokens: tokens read from cache on this request (cache hit when > 0)
prompt_tokens_details.cached_tokens: OpenAI-style cached token count
Example:
{
"usage": {
"prompt_tokens": 8500,
"completion_tokens": 200,
"cache_creation_input_tokens": 8000,
"cache_read_input_tokens": 0
}
}
When present, x_nanogpt_pricing includes cache pricing breakdown fields such as cacheCreationInputTokens, cacheReadInputTokens, cacheTTL, and cacheCost.
For streaming requests, the final SSE chunk includes the same usage fields when usage is included.
Pricing
Cache writes and reads are billed differently by provider. Implicit-caching providers apply their cache pricing automatically when eligible. Explicit Claude caching uses the TTL settings below.
Gemini Pro models (implicit caching, provider-native)
| Token type | Rate (per 1M tokens) | Notes |
|---|
| Regular input | $2.00 | — |
| Cache write surcharge | +$0.375 | Added on top of input cost |
| Cache read | $0.20 | 90% cheaper than input |
Example: writing 10,000 cached tokens costs (10k × $2.00/M) + (10k × $0.375/M) = $0.02375. Reading 10,000 cached tokens costs 10k × $0.20/M = $0.002.
Gemini Flash models (implicit caching, provider-native)
| Token type | Rate (per 1M tokens) | Notes |
|---|
| Regular input | Varies by model | — |
| Cache write surcharge | +$0.083 | Added on top of input cost |
| Cache read | 10% of input rate | 90% cheaper than input |
For Gemini 2.0 models, cache reads are 25% of the base input rate (75% cheaper), not 10%.
Claude models (explicit caching)
| TTL | Creation multiplier on cached input tokens | Read multiplier |
|---|
5m | 1.25x | 0.1x |
1h | 2.0x | 0.1x |
TTL Options (Explicit Claude Controls)
| TTL | Duration | Description |
|---|
"5m" | 5 minutes | Default. Suitable for interactive sessions. |
"1h" | 1 hour | Extended. Useful for batch processing or long-running sessions. |
Structuring Prompts for Cache Hits
Cache hits require the cached prefix to be byte-identical across requests.
Best practices:
- Put static content first (system prompt, reference docs, tool definitions).
- Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
- Put dynamic content after the cache boundary (typically the latest user message).
Behavior:
- On cache hit, TTL resets.
- If TTL expires without reuse, cache expires.
- Any prefix change (even one character) causes a cache miss and new cache creation.
Cache Consistency with stickyProvider (Explicit Caching)
Each provider keeps its own cache. If a request fails over to another provider, the previous cache may be unavailable.
If cache consistency matters more than availability, set:
{
"promptCaching": { "enabled": true, "ttl": "5m", "stickyProvider": true }
}
Behavior:
stickyProvider: false (default): the request may succeed even if routing changes, but you might rebuild caches.
stickyProvider: true: if a fallback would be required, the request returns 503 instead.
NanoGPT Web UI
In the NanoGPT web UI, models with explicit prompt-caching controls show a prompt caching toggle where you can choose cache duration.
Limitations and Caveats
- Provider-side minimum token thresholds still apply before a cache entry is created.
- A maximum of 4 cache breakpoints (
cache_control) are supported per request.
- Some models report aggregate prompt usage differently; use
cache_creation_input_tokens and cache_read_input_tokens for authoritative cached token counts.
- On cache hits, a small non-zero
cache_creation_input_tokens can appear due to per-request overhead and does not necessarily indicate a cache miss.
- Implicit caching behavior (eligibility, TTL behavior, and exact discounts) is provider-dependent.