Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.nano-gpt.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch. Benefits:
  • Up to ~90% cost reduction on cached input tokens (cache hits)
  • Lower latency on requests with large static prefixes
NanoGPT supports two caching modes:
  • Implicit caching (default): For providers/models that support provider-native prompt reuse (including OpenAI, Gemini, and many open-source provider routes), caching is applied automatically when eligible. No extra request fields are required.
  • Explicit prompt caching (opt-in): Claude models use explicit cache controls (prompt_caching/promptCaching/body-level cache_control, or inline cache_control) when you want deterministic cache boundaries and TTL control.

Supported Models

Implicit caching (automatic)

NanoGPT automatically uses implicit caching on providers/models that support it, including OpenAI and Gemini model families plus many open-source provider/model routes. No cache-control flags are required for this mode.

Explicit prompt caching controls (Claude)

Explicit prompt-caching controls are available on Claude models, including these families (examples):
Model familyExample model IDs
Claude 3.5 Sonnet v2claude-3-5-sonnet-20241022
Claude 3.5 Haikuclaude-3-5-haiku-20241022
Claude 3.7 Sonnetclaude-3-7-sonnet-20250219 (and :thinking variants)
Claude Sonnet 4claude-sonnet-4-20250514 (and :thinking variants)
Claude Sonnet 4.5claude-sonnet-4-5-20250929 (and :thinking variants)
Claude Haiku 4.5claude-haiku-4-5-20251001
Claude Opus 4claude-opus-4-20250514 (and :thinking variants)
Claude Opus 4.1claude-opus-4-1-20250805 (and :thinking variants)
Claude Opus 4.5claude-opus-4-5-20251101 (and :thinking variants)
Claude Opus 4.6claude-opus-4-6 (and :thinking variants)
All of the above are also supported via the anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).

Claude minimum cacheable tokens

If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.
ModelMinimum cacheable tokens
Claude Opus 4.54,096
Claude Haiku 4.54,096
Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.71,024
Claude Haiku 3.52,048

How To Enable Explicit Prompt Caching (Claude)

Prompt caching works on POST /api/v1/chat/completions. You can enable it in 3 ways.

Option 1: body-level helper (promptCaching / prompt_caching / cache_control)

Add a top-level helper object:
{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    { "role": "system", "content": "Your large static content..." },
    { "role": "user", "content": "Summarize the key points." }
  ],
  "promptCaching": {
    "enabled": true,
    "ttl": "5m",
    "cutAfterMessageIndex": 0
  }
}
Parameters:
ParameterTypeDefaultDescription
enabledbooleanEnable prompt caching
ttl"5m" or "1h""5m"Cache time-to-live
cutAfterMessageIndex / cut_after_message_indexintegerZero-based index; cache all messages up to and including this index
stickyProviderbooleanfalseWhen true, avoid failover to preserve cache consistency (see stickyProvider)
explicitCacheControl / explicit_cache_controlbooleanfalseWhen true, only refresh TTLs on existing inline cache_control blocks and do not auto-add cache breakpoints
explicitCacheControl (boolean, default false) When true, the system only refreshes TTLs on cache_control blocks you already placed in your request. No additional cache breakpoints are added automatically. This is useful when you use inline cache_control markers (Option 2) but also want body-level settings like ttl or stickyProvider to apply. Without this flag, the system may add its own cache breakpoints on top of yours. Also accepts the snake_case alias explicit_cache_control. Aliases are accepted:
  • promptCaching
  • prompt_caching
  • cache_control (body-level helper alias)
Passing true instead of an object defaults to:
{ "enabled": true, "ttl": "5m" }
If cutAfterMessageIndex is omitted, NanoGPT selects cache boundaries automatically.

Option 2: inline cache_control markers

Attach cache_control directly to content blocks you want cached:
{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "Your long reference document...",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    { "role": "user", "content": "Live question goes here" }
  ]
}

Combining inline markers with body-level settings

{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful coding assistant with access to a large codebase...",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    {
      "role": "user",
      "content": "Summarize the auth module"
    }
  ],
  "promptCaching": {
    "enabled": true,
    "ttl": "1h",
    "explicitCacheControl": true
  }
}
In this example, the system prompt’s cache_control marker is preserved and its TTL is set to 1h. The user message does not receive an auto-generated cache breakpoint.

Option 3: anthropic-beta header (Claude-compatible)

The Anthropic-compatible header is supported:
anthropic-beta: prompt-caching-2024-07-31
For Claude 1-hour TTL requests using Anthropic-native routing, also include:
anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11

Controlling What Gets Cached

Note: explicitCacheControl and cutAfterMessageIndex serve different purposes. cutAfterMessageIndex tells the system where to auto-place cache breakpoints. explicitCacheControl tells the system not to auto-place any and only refresh what you already marked. If both are set, explicitCacheControl takes precedence.

cutAfterMessageIndex

Override automatic cache breakpoints by setting the last cached message index:
{
  "promptCaching": {
    "enabled": true,
    "cutAfterMessageIndex": 4
  }
}
Messages at indices 0..4 are cached; later messages are not. You can also set this via request header:
x-prompt-caching-cut-after: 4

Cache block limit

A maximum of 4 cache_control breakpoints are allowed per request (across system prompt, tools, and messages). If more are present, the oldest breakpoints are pruned automatically.

Forcing a Cache Write

There is no separate “force write” flag. Enable prompt caching and send the request. The first eligible request writes cache automatically (if provider thresholds/availability allow it). Repeated requests with the same cached prefix read from cache.

Usage Fields (How To Verify Cache Hits)

When caching is active (implicit or explicit), responses can include:
  • cache_creation_input_tokens: tokens written to cache on this request
  • cache_read_input_tokens: tokens read from cache on this request (cache hit when > 0)
  • prompt_tokens_details.cached_tokens: OpenAI-style cached token count
Example:
{
  "usage": {
    "prompt_tokens": 8500,
    "completion_tokens": 200,
    "cache_creation_input_tokens": 8000,
    "cache_read_input_tokens": 0
  }
}
When present, x_nanogpt_pricing includes cache pricing breakdown fields such as cacheCreationInputTokens, cacheReadInputTokens, cacheTTL, and cacheCost. For streaming requests, the final SSE chunk includes the same usage fields when usage is included.

Pricing

Cache writes and reads are billed differently by provider. Implicit-caching providers apply their cache pricing automatically when eligible. Explicit Claude caching uses the TTL settings below.

Gemini Pro models (implicit caching, provider-native)

Token typeRate (per 1M tokens)Notes
Regular input$2.00
Cache write surcharge+$0.375Added on top of input cost
Cache read$0.2090% cheaper than input
Example: writing 10,000 cached tokens costs (10k × $2.00/M) + (10k × $0.375/M) = $0.02375. Reading 10,000 cached tokens costs 10k × $0.20/M = $0.002.

Gemini Flash models (implicit caching, provider-native)

Token typeRate (per 1M tokens)Notes
Regular inputVaries by model
Cache write surcharge+$0.083Added on top of input cost
Cache read10% of input rate90% cheaper than input
For Gemini 2.0 models, cache reads are 25% of the base input rate (75% cheaper), not 10%.

Claude models (explicit caching)

TTLCreation multiplier on cached input tokensRead multiplier
5m1.25x0.1x
1h2.0x0.1x

TTL Options (Explicit Claude Controls)

TTLDurationDescription
"5m"5 minutesDefault. Suitable for interactive sessions.
"1h"1 hourExtended. Useful for batch processing or long-running sessions.

Structuring Prompts for Cache Hits

Cache hits require the cached prefix to be byte-identical across requests. Best practices:
  • Put static content first (system prompt, reference docs, tool definitions).
  • Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
  • Put dynamic content after the cache boundary (typically the latest user message).
Behavior:
  • On cache hit, TTL resets.
  • If TTL expires without reuse, cache expires.
  • Any prefix change (even one character) causes a cache miss and new cache creation.

Cache Consistency with stickyProvider (Explicit Caching)

Each provider keeps its own cache. If a request fails over to another provider, the previous cache may be unavailable. If cache consistency matters more than availability, set:
{
  "promptCaching": { "enabled": true, "ttl": "5m", "stickyProvider": true }
}
Behavior:
  • stickyProvider: false (default): the request may succeed even if routing changes, but you might rebuild caches.
  • stickyProvider: true: if a fallback would be required, the request returns 503 instead.

NanoGPT Web UI

In the NanoGPT web UI, models with explicit prompt-caching controls show a prompt caching toggle where you can choose cache duration.

Limitations and Caveats

  • Provider-side minimum token thresholds still apply before a cache entry is created.
  • A maximum of 4 cache breakpoints (cache_control) are supported per request.
  • Some models report aggregate prompt usage differently; use cache_creation_input_tokens and cache_read_input_tokens for authoritative cached token counts.
  • On cache hits, a small non-zero cache_creation_input_tokens can appear due to per-request overhead and does not necessarily indicate a cache miss.
  • Implicit caching behavior (eligibility, TTL behavior, and exact discounts) is provider-dependent.