Creates a chat completion for the provided messages
https://nano-gpt.com/api/subscription/v1/chat/completions (swap /api/v1 for /api/subscription/v1)./api/v1/chat/completions endpoint supports OpenAI-compatible function calling. You can describe callable functions in the tools array, control when the model may invoke them, and continue the conversation by echoing tool role messages that reference the assistant’s chosen call.
tools (optional array): Each entry must be { "type": "function", "function": { "name": string, "description"?: string, "parameters"?: JSON-Schema object } }. Only function tools are accepted. The serialized tools payload is limited to 200 KB (overrides via TOOL_SPEC_MAX_BYTES); violating the shape or size yields a 400 with tool_spec_too_large, invalid_tool_spec, or invalid_tool_spec_parse.tool_choice (optional string or object): Defaults to auto. Set "none" to guarantee no tool calls (the server also drops the tools payload upstream), "required" to force the next response to be a tool call, or { "type": "function", "function": { "name": "your_function" } } to pin the exact function.parallel_tool_calls (optional boolean): When true the flag is forwarded to providers that support issuing multiple tool calls in a single turn. Models that ignore the flag fall back to sequential calls.messages[].tool_calls (assistant role): Persist the tool call metadata returned by the model so future turns can see which functions were invoked. Each item uses the OpenAI shape { id, type: "function", function: { name, arguments } }.messages[] with role: "tool": Respond to the model by sending { "role": "tool", "tool_call_id": "<assistant tool_calls id>", "content": "<JSON or text payload>" }. The server drops any tool response that references an unknown tool_call_id, so keep the IDs in sync.tool_choice: "none" with a tools array the request is accepted but the tools are omitted before hitting the model; invalid schemas or oversize payloads return the error codes above.tool_calls schema, so consumers can reuse their existing parsing logic without changes.
/api/v1/chat/completions endpoint accepts a full set of sampling and decoding knobs. All fields are optional; omit any you want to leave at provider defaults.
| Parameter | Range/Default | Description |
|---|---|---|
temperature | 0–2 (default 0.7) | Classic randomness control; higher values explore more. |
top_p | 0–1 (default 1) | Nucleus sampling that trims to the smallest set above top_p cumulative probability. |
top_k | 1+ | Sample only from the top-k tokens each step. |
top_a | provider default | Blends temperature and nucleus behavior; set only if a model calls for it. |
min_p | 0–1 | Require each candidate token to exceed a probability floor. |
tfs | 0–1 | Tail free sampling; 1 disables. |
eta_cutoff / epsilon_cutoff | provider default | Drop tokens once they fall below the tail thresholds. |
typical_p | 0–1 | Entropy-based nucleus sampling; keeps tokens whose surprise matches expected entropy. |
mirostat_mode | 0/1/2 | Enable Mirostat sampling; set tau/eta when active. |
mirostat_tau / mirostat_eta | provider default | Target entropy and learning rate for Mirostat. |
| Parameter | Range/Default | Description |
|---|---|---|
max_tokens | 1+ (default 4000) | Upper bound on generated tokens. |
min_tokens | 0+ (default 0) | Minimum completion length when provider supports it. |
stop | string or string[] | Stop sequences passed upstream. |
stop_token_ids | int[] | Stop generation on specific token IDs (limited provider support). |
include_stop_str_in_output | boolean (default false) | Keep the stop sequence in the final text where supported. |
ignore_eos | boolean (default false) | Continue even if the model predicts EOS internally. |
| Parameter | Range/Default | Description |
|---|---|---|
frequency_penalty | -2 – 2 (default 0) | Penalize tokens proportional to prior frequency. |
presence_penalty | -2 – 2 (default 0) | Penalize tokens based on whether they appeared at all. |
repetition_penalty | -2 – 2 | Provider-agnostic repetition modifier; >1 discourages repeats. |
no_repeat_ngram_size | 0+ | Forbid repeating n-grams of the given size (limited support). |
custom_token_bans | int[] | Fully block listed token IDs. |
| Parameter | Range/Default | Description |
|---|---|---|
logit_bias | object | Map token IDs to additive logits (OpenAI-compatible). |
logprobs | boolean or int | Return token-level logprobs where supported. |
prompt_logprobs | boolean | Request logprobs on the prompt when available. |
seed | integer | Make completions repeatable where the provider allows it. |
temperature + top_p + top_k), but overly narrow settings may lead to early stops.linkup object in the request body. If linkup.enabled is true, it takes precedence over any model suffix.
model value:
:online (default web search, standard depth):online/linkup (Linkup, standard):online/linkup-deep (Linkup, deep):online/tavily (Tavily, standard):online/tavily-deep (Tavily, deep):online/exa-fast (Exa, fast):online/exa-auto (Exa, auto):online/exa-neural (Exa, neural):online/exa-deep (Exa, deep):online/kagi (Kagi, standard, search):online/kagi-web (Kagi, standard, web):online/kagi-news (Kagi, standard, news):online/kagi-search (Kagi, deep, search):online without an explicit provider uses the default web search backend.
linkup object in the request body. This works with or without a model suffix and controls web search across all providers.
linkup fields:
enabled (boolean, required to activate web search)provider (string): linkup | tavily | exa | kagidepth (string):
standard or deepfast, auto, neural, deep (use standard if you want auto)search_context_size or searchContextSize (string): low | medium | high (default: medium)kagiSource or kagi_source (string, Kagi only): web | news | searchmessages array.
{"type":"image_url","image_url":{"url":"https://..."}}{"type":"image_url","image_url":{"url":"data:image/png;base64,...."}}image/png, image/jpeg, image/jpg, image/webp.) are auto‑normalized into structured parts server‑side....BASE64... with your image bytes.
data: { ... } lines until a final terminator. Usage metrics appear only when requested: set stream_options.include_usage to true for streaming responses, or send "include_usage": true on non-streaming calls.
Note: Prompt-caching helpers implicitly force include_usage, so cached requests still receive usage data without extra flags.
/v1/messages: you must place cache_control objects on the content blocks you want the model to reuse, or instruct NanoGPT to do it for you via the prompt_caching helper.
Note: NanoGPT’s automatic failover system ensures high availability but may occasionally cause cache misses. If you’re seeing unexpected cache misses in your usage logs, see the “Cache Consistency with stickyProvider” section below.
The prompt_caching / promptCaching helper accepts these options:
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | — | Enable prompt caching |
ttl | string | "5m" | Cache time-to-live: "5m" or "1h" |
cut_after_message_index | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | New: When true, disable automatic failover to preserve cache consistency. Returns 503 error instead of switching services. |
cache_control marker caches the full prefix up to that block. Place them on every static chunk (system messages, tool definitions, large contexts) you plan to reuse.5m (1.25× one-time billing) and 1h (2× one-time). Replays show the discounted tokens inside usage.prompt_tokens_details.cached_tokens.anthropic-beta: prompt-caching-2024-07-31 header is mandatory on requests that include caching metadata.cut_after_message_index is zero-based and NanoGPT does not guess—everything at or before that index is cached, everything after is not. Switch back to explicit cache_control blocks if you need multiple cache breakpoints or mixed TTLs in the same payload.
stickyProviderstickyProvider option:
stickyProvider: false (default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).stickyProvider: true — If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.stickyProvider: true:
stickyProvider: false (default):
include_usage is true in the payload or that prompt caching is enabled.:memory to any model namememory: true:online:memory:memory-<days> (1..365) or header memory_expiration_days: <days>; header takes precedencemodel_context_limit.
model_context_limit (number or numeric string)thinking model suffix (for example :thinking or -thinking:8192) are normalized before dispatch, but the response contract remains the same.
https://nano-gpt.com/api/v1/chat/completions — default endpoint that streams internal thoughts through choices[0].delta.reasoning (and repeats them in message.reasoning on completion). Recommended for apps like SillyTavern that understand the modern response shape.https://nano-gpt.com/api/v1legacy/chat/completions — legacy contract that swaps the field name to choices[0].delta.reasoning_content / message.reasoning_content for older OpenAI-compatible clients. Use this for LiteLLM’s OpenAI adapter to avoid downstream parsing errors.https://nano-gpt.com/api/v1thinking/chat/completions — reasoning-aware models write everything into the normal choices[0].delta.content stream so clients that ignore reasoning fields still see the full conversation transcript. This is the preferred base URL for JanitorAI.choices[0].delta.content and the thought process in choices[0].delta.reasoning (plus optional delta.reasoning_details). Reasoning deltas are dispatched before or alongside regular content, letting you render both panes in real-time.
choices[0].message.content contains the assistant reply and choices[0].message.reasoning (plus reasoning_details when available) contains the full chain-of-thought. Non-streaming requests reuse the same formatter, so the reasoning block is present as a dedicated field.
reasoning: { "exclude": true } to strip the reasoning payload from both streaming deltas and the final message. With this flag set, delta.reasoning and message.reasoning are omitted entirely.
reasoning_effort| Value | Description |
|---|---|
none | Disables reasoning entirely |
minimal | Allocates ~10% of max_tokens for reasoning |
low | Allocates ~20% of max_tokens for reasoning |
medium | Allocates ~50% of max_tokens for reasoning (default when reasoning is enabled) |
high | Allocates ~80% of max_tokens for reasoning |
reasoning_effort parameter can be passed at the top level:
reasoning object:
reasoning_effort to models that don’t support reasoning will have no effect (the parameter is ignored).
:reasoning-exclude:reasoning-exclude to the model name.
{ "reasoning": { "exclude": true } }:reasoning-exclude suffix is stripped before the request is routed; other suffixes remain active:reasoning-exclude composes safely with the other routing suffixes you already use:
:thinking (and variants like …-thinking:8192):online and :online/linkup-deep:memory and :memory-<days>claude-3-7-sonnet-thinking:8192:reasoning-excludegpt-4o:online:reasoning-excludeclaude-3-5-sonnet-20241022:memory-30:online/linkup-deep:reasoning-excludereasoning_content field can opt in per request. Set reasoning.delta_field to "reasoning_content", or use the top-level shorthands reasoning_delta_field / reasoning_content_compat if updating nested objects is difficult. When the toggle is active, every streaming and non-streaming response exposes reasoning_content instead of reasoning, and the modern key is omitted. The compatibility pass is skipped if reasoning.exclude is true, because no reasoning payload is emitted. If you cannot change the request payload, target https://nano-gpt.com/api/v1legacy/chat/completions instead—the legacy endpoint keeps reasoning_content without extra flags. LiteLLM’s OpenAI adapter should point here to maintain compatibility. For clients that ignore reasoning-specific fields entirely, use https://nano-gpt.com/api/v1thinking/chat/completions so the full text appears in the standard content stream; this is the correct choice for JanitorAI.
phala/*) require byte-for-byte SSE passthrough for signature verification. For those models, streaming cannot be filtered; the suffix has no effect on the streaming bytes.youtube_transcripts (boolean)false (opt-in)youtube_transcripts to true (string "true" is also accepted) to fetch transcriptsyoutube_transcripts to true when you want the system to retrieve and bill for transcripts.
scraping: true. YouTube transcripts do not require scraping: true.| Provider | Score |
|---|---|
| LinkUp Deep Search | 90.10% |
| Exa | 90.04% |
| Perplexity Sonar Pro | 86% |
| LinkUp Standard Search | 85% |
| Perplexity Sonar | 77% |
| Tavily | 73% |
linkup objectBearer authentication header of the form Bearer <token>, where <token> is your auth token.
Parameters for chat completion
The model to use for completion. Append ':online' for web search ($0.005/request) or ':online/linkup-deep' for deep web search ($0.05/request)
"chatgpt-4o-latest"
"chatgpt-4o-latest:online"
"chatgpt-4o-latest:online/linkup-deep"
"claude-3-5-sonnet-20241022:online"
Array of message objects with role and content
Whether to stream the response
Classic randomness control. Accepts any decimal between 0-2. Lower numbers bias toward deterministic responses, higher values explore more aggressively
0 <= x <= 2Upper bound on generated tokens
x >= 1Nucleus sampling. When set below 1.0, trims candidate tokens to the smallest set whose cumulative probability exceeds top_p. Works well as an alternative to tweaking temperature
0 <= x <= 1Penalizes tokens proportionally to how often they appeared previously. Negative values encourage repetition; positive values discourage it
-2 <= x <= 2Penalizes tokens based on whether they appeared at all. Good for keeping the model on topic without outright banning words
-2 <= x <= 2Provider-agnostic repetition modifier (distinct from OpenAI penalties). Values >1 discourage repetition
-2 <= x <= 2Caps sampling to the top-k highest probability tokens per step
Combines top-p and temperature behavior; leave unset unless a model description explicitly calls for it
Ensures each candidate token probability exceeds a floor (0-1). Helpful for stopping models from collapsing into low-entropy loops
0 <= x <= 1Tail free sampling. Values between 0-1 let you shave the long tail of the distribution; 1.0 disables the feature
0 <= x <= 1Cut probabilities as soon as they fall below the specified tail threshold
Cut probabilities as soon as they fall below the specified tail threshold
Typical sampling (aka entropy-based nucleus). Works like top_p but preserves tokens whose surprise matches the expected entropy
0 <= x <= 1Enables Mirostat sampling for models that support it. Set to 1 or 2 to activate
0, 1, 2 Mirostat target entropy parameter. Used when mirostat_mode is enabled
Mirostat learning rate parameter. Used when mirostat_mode is enabled
For providers that support it, enforces a minimum completion length before stop conditions fire
x >= 0Stop sequences. Accepts string or array of strings. Values are passed directly to upstream providers
Numeric array that lets callers stop generation on specific token IDs. Not supported by many providers
When true, keeps the stop sequence in the final text. Not supported by many providers
Allows completions to continue even if the model predicts EOS internally. Useful for long creative writing runs
Extension that forbids repeating n-grams of the given size. Not supported by many providers
x >= 0List of token IDs to fully block
Object mapping token IDs to additive logits. Works just like OpenAI's version
When true or a number, forwards the request to providers that support returning token-level log probabilities
Requests logprobs on the prompt itself when the upstream API allows it
Numeric seed. Wherever supported, passes the value to make completions repeatable
Helper to tag the leading messages for Claude prompt caching. NanoGPT injects cache_control blocks on each message up to the specified index before forwarding to Anthropic.