Overview
The NanoGPT API offers multiple ways to generate text, including OpenAI-compatible endpoints and our legacy options. This guide covers all available text generation methods. If you are using a TEE-backed model (e.g., prefixed withTEE/), you can also verify the enclave attestation and signatures for your chat completions. See the TEE Model Verification guide for more details.
Provider Selection
Pay-as-you-go requests can select an upstream provider for supported open-source models using theX-Provider header or saved preferences. Subscription endpoints ignore provider selection. See Provider Selection.
OpenAI Compatible Endpoints
Chat Completions (v1/chat/completions)
This endpoint mimics OpenAI’s chat completions API:Responses (v1/responses)
Use the OpenAI Responses-compatible endpoint for stateful threading (previous_response_id), background processing, and Responses-style streaming events. See the dedicated docs at /api-reference/endpoint/responses.
Text Completions (v1/completions)
This endpoint mimics OpenAI’s legacy text completions API:Legacy Text Completions
For the older, non-OpenAI compatible endpoint:Prompt Caching (Claude Models)
Claude caching follows Anthropic’s Messages schema:cache_control lives on the message content blocks you want to reuse. NanoGPT simply forwards those markers to Anthropic, so you decide where each cache breakpoint sits. The first invocation costs 1.25× (5 min TTL) or 2× (1 hour TTL); cached replays discount the same tokens by ~90%.
Note: NanoGPT’s automatic failover system ensures high availability but may occasionally cause cache misses. If you’re seeing unexpected cache misses in your usage logs, see the “Cache Consistency with stickyProvider” section below.
The prompt_caching / promptCaching helper accepts these options:
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | — | Enable prompt caching |
ttl | string | "5m" | Cache time-to-live: "5m" or "1h" |
cut_after_message_index | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | New: When true, disable automatic failover to preserve cache consistency. Returns 503 error instead of switching services. |
Explicit cache_control markers
cache_controlbelongs to the individual content blocks (system,user, tool definitions, etc.). Each marker caches the entire prefix up to and including that block, matching Anthropic’s behavior.- Supported TTLs are
5mand1h. Omitttlto use the default5mwindow. - Set the
anthropic-beta: prompt-caching-2024-07-31header on any request that contains cache markers; Anthropic rejects cache requests without the beta flag. - Check
usage.prompt_tokens_details.cached_tokensin NanoGPT’s response to confirm what was billed at the discounted rate.
Using the prompt_caching helper
If you prefer not to duplicate cache_control entries manually, NanoGPT accepts a helper object that tags the leading prefix for you.
cut_after_message_index is zero-based and points at the last message in the static prefix. NanoGPT will attach a cache_control block with your TTL to each message up to that index before forwarding the request to Anthropic—no additional heuristics are applied. If you need different cache durations or non-contiguous breakpoints, fall back to explicit cache_control markers in your messages array.
Cache Consistency with stickyProvider
NanoGPT automatically fails over to backup services when the primary service is temporarily unavailable. While this ensures high availability, it can break your prompt cache because each backend service maintains its own separate cache.
If cache consistency is more important than availability for your use case, you can enable the stickyProvider option:
stickyProvider: false(default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).stickyProvider: true— If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.
stickyProvider: true:
- You have very large cached contexts where cache misses are expensive
- You prefer to retry failed requests yourself rather than pay for cache rebuilds
- Cost predictability is more important than request success rate
stickyProvider: false (default):
- You prefer requests to always succeed when possible
- Occasional cache misses are acceptable
- You’re using shorter contexts where cache rebuilds are inexpensive
Chat Completions with Web Search
Enable real-time web access for any model by appending special suffixes:Web Search Options
:online- Standard search with 10 results ($0.006 per request):online/linkup-deep- Deep iterative search ($0.06 per request)