Chat Completion
Creates a chat completion for the provided messages
Documentation Index
Fetch the complete documentation index at: https://docs.nano-gpt.com/llms.txt
Use this file to discover all available pages before exploring further.
https://nano-gpt.com/api/subscription/v1/chat/completions (swap /api/v1 for /api/subscription/v1).X-Provider or body provider explicitly selects or constrains providers for the request and is always billed pay-as-you-go at the selected provider’s price, including provider-selection markup. The body provider field accepts either the existing string form or a structured routing object with order, only, ignore, sort, max_price, allow_fallbacks, and require_parameters. Provider-selection-capable models also support routing preference suffixes such as :fast and :cheap. For subscription users, explicit provider selection bypasses subscription coverage for that request; X-Billing-Mode: paygo is only needed when forcing pay-as-you-go without an explicit provider or when saved provider preferences should apply to subscription-included traffic. See Provider Selection, Model Suffixes, and Pay-As-You-Go Billing Override.X-X402: true header. See X-402 Micropayments for details.Page map
Use the jump list below to navigate the long-form reference quickly.Sampling & decoding
Sampling & decoding
Structured outputs
Structured outputs
Web search
Web search
Images & caching
Images & caching
Memory & reasoning
Memory & reasoning
Tool calling
The/api/v1/chat/completions endpoint supports OpenAI-compatible function calling. You can describe callable functions in the tools array, control when the model may invoke them, and continue the conversation by echoing tool role messages that reference the assistant’s chosen call.
Request parameters
tools(optional array): Each entry must be{ "type": "function", "function": { "name": string, "description"?: string, "parameters"?: JSON-Schema object } }. Onlyfunctiontools are accepted. The serializedtoolspayload is limited to 200 KB (overrides viaTOOL_SPEC_MAX_BYTES); violating the shape or size yields a 400 withtool_spec_too_large,invalid_tool_spec, orinvalid_tool_spec_parse.tool_choice(optional string or object): Defaults toauto. Set"none"to guarantee no tool calls (the server also drops thetoolspayload upstream),"required"to force the next response to be a tool call, or{ "type": "function", "function": { "name": "your_function" } }to pin the exact function.parallel_tool_calls(optional boolean): Whentruethe flag is forwarded to providers that support issuing multiple tool calls in a single turn. Models that ignore the flag fall back to sequential calls.messages[].tool_calls(assistant role): Persist the tool call metadata returned by the model so future turns can see which functions were invoked. Each item uses the OpenAI shape{ id, type: "function", function: { name, arguments } }.messages[]withrole: "tool": Respond to the model by sending{ "role": "tool", "tool_call_id": "<assistant tool_calls id>", "content": "<JSON or text payload>" }. The server drops any tool response that references an unknowntool_call_id, so keep the IDs in sync.- Validation behavior: If you send
tool_choice: "none"with atoolsarray the request is accepted but the tools are omitted before hitting the model; invalid schemas or oversize payloads return the error codes above.
Example request
Example assistant/tool turn
tool_calls schema, so consumers can reuse their existing parsing logic without changes.
Overview
The Chat Completion endpoint provides OpenAI-compatible chat completions.Provider Routing Suffixes
Provider-selection-capable models support routing preference suffixes such as:fast and :cheap. See Provider Selection > Per-Request Routing Preference for the full list and billing rules, or Model Suffixes for all suffix composition rules.
Provider Routing Object
Theprovider request body field accepts either a provider ID string or a structured object for routing controls:
Sampling & Decoding Controls
The/api/v1/chat/completions endpoint accepts a full set of sampling and decoding knobs. All fields are optional; omit any you want to leave at provider defaults.
Temperature & Nucleus
| Parameter | Range/Default | Description |
|---|---|---|
temperature | 0–2 (provider default) | Classic randomness control; higher values explore more. If omitted, NanoGPT does not force a value and the routed provider/model default applies. |
top_p | 0–1 (default 1) | Nucleus sampling that trims to the smallest set above top_p cumulative probability. |
top_k | 1+ | Sample only from the top-k tokens each step. |
top_a | provider default | Blends temperature and nucleus behavior; set only if a model calls for it. |
min_p | 0–1 | Require each candidate token to exceed a probability floor. |
tfs | 0–1 | Tail free sampling; 1 disables. |
eta_cutoff / epsilon_cutoff | provider default | Drop tokens once they fall below the tail thresholds. |
typical_p | 0–1 | Entropy-based nucleus sampling; keeps tokens whose surprise matches expected entropy. |
mirostat_mode | 0/1/2 | Enable Mirostat sampling; set tau/eta when active. |
mirostat_tau / mirostat_eta | provider default | Target entropy and learning rate for Mirostat. |
Length & Stopping
| Parameter | Range/Default | Description |
|---|---|---|
max_tokens | 1+ (provider default) | Upper bound on generated tokens. If omitted, NanoGPT does not enforce an explicit default and the routed provider/model default applies. |
min_tokens | 0+ (default 0) | Minimum completion length when provider supports it. |
stop | string or string[] | Stop sequences passed upstream. |
stop_token_ids | int[] | Stop generation on specific token IDs (limited provider support). |
include_stop_str_in_output | boolean (default false) | Keep the stop sequence in the final text where supported. |
ignore_eos | boolean (default false) | Continue even if the model predicts EOS internally. |
Penalties & Repetition Guards
| Parameter | Range/Default | Description |
|---|---|---|
frequency_penalty | -2 – 2 (default 0) | Penalize tokens proportional to prior frequency. |
presence_penalty | -2 – 2 (default 0) | Penalize tokens based on whether they appeared at all. |
repetition_penalty | -2 – 2 | Provider-agnostic repetition modifier; >1 discourages repeats. |
no_repeat_ngram_size | 0+ | Forbid repeating n-grams of the given size (limited support). |
custom_token_bans | int[] | Fully block listed token IDs. |
Logit Shaping & Determinism
| Parameter | Range/Default | Description |
|---|---|---|
logit_bias | object | Map token IDs to additive logits (OpenAI-compatible). |
logprobs | boolean or int | Return token-level logprobs where supported. |
prompt_logprobs | boolean | Request logprobs on the prompt when available. |
seed | integer | Make completions repeatable where the provider allows it. |
Usage notes
- Parameters can be combined (e.g.,
temperature+top_p+top_k), but overly narrow settings may lead to early stops. - Invalid ranges yield a 400 before reaching the provider.
- Provider defaults apply to any omitted field.
Example request
Structured Outputs (response_format)
The/api/v1/chat/completions endpoint supports OpenAI-compatible structured outputs via the response_format parameter. This ensures the model returns valid JSON matching your specified schema.
Supported Formats
| Type | Description |
|---|---|
json_object | Forces the model to return valid JSON |
json_schema | Forces the model to return JSON matching a specific schema |
text | Default text output (no constraint) |
JSON Object Mode
Request valid JSON output without a specific schema:JSON Schema Mode (Structured Outputs)
Request JSON that conforms to a specific schema:Schema Requirements
When usingstrict: true:
- All properties must be listed in
required - Set
additionalProperties: false - NanoGPT automatically transforms optional properties to be nullable for OpenAI compatibility
Supported Models
JSON schema mode works with most models including:- OpenAI models (GPT-5.1, GPT-5.2, etc.)
- Anthropic Claude models
- Google Gemini models
- Many open-source models
Example Request
Example Response
Usage with Vercel AI SDK
Theresponse_format parameter is compatible with Vercel AI SDK’s generateObject:
Usage Notes
- Works with both streaming and non-streaming requests
- The
namefield injson_schemais required and should describe the output - Response content is a JSON string; parse it with
JSON.parse()in your application - Some provider-specific limitations may apply; if you encounter issues with a specific model, try an alternative
Web Search
Enable web search in two ways: model suffixes or awebSearch object in the request body. The legacy linkup object is still supported as an alias. If webSearch.enabled (or linkup.enabled) is true, it takes precedence over any model suffix.
OpenAI native web search: GPT-5+ / o1 / o3 / o4 models use OpenAI’s built-in web search automatically. No suffix is required; you can still set webSearch.search_context_size and webSearch.user_location. To force a different provider, specify a provider or suffix.
If you need full direct control over the search call itself (query, outputType, date/domain filters, or structured schema output), use Direct Web Search API (POST /api/web).
| Use case | Recommended endpoint |
|---|---|
| Model should answer with web context in one call | POST /api/v1/chat/completions |
| You need raw/structured web payload control | POST /api/web |
Option A: model suffixes
Append one of these to yourmodel value:
:online(default web search, standard depth):online/linkup(Linkup, standard):online/linkup-deep(Linkup, deep):online/tavily(Tavily, standard):online/tavily-deep(Tavily, deep):online/brave(Brave, standard):online/brave-deep(Brave, deep):online/exa-fast(Exa, fast):online/exa-auto(Exa, auto):online/exa-neural(Exa, neural):online/exa-deep(Exa, deep):online/exa-instant(Exa, instant):online/exa-deep-reasoning(Exa, deep-reasoning):online/kagi(Kagi, standard, search):online/kagi-web(Kagi, standard, web):online/kagi-news(Kagi, standard, news):online/kagi-search(Kagi, deep, search):online/perplexity(Perplexity, standard):online/perplexity-deep(Perplexity, deep):online/valyu(Valyu, standard, all sources):online/valyu-deep(Valyu, deep, all sources):online/valyu-web(Valyu, standard, web only):online/valyu-web-deep(Valyu, deep, web only)
:online without an explicit provider uses the default web search backend (Linkup).
Option B: request body configuration (recommended)
Send awebSearch object in the request body. The legacy linkup object is accepted as an alias. This works with or without a model suffix and controls web search across all providers.
webSearch fields:
enabled(boolean, required to activate web search)provider(string):linkup|tavily|brave|exa|kagi|perplexity|valyudepth(string):- Linkup/Tavily/Brave/Perplexity/Valyu:
standardordeep - Exa:
fast,auto,neural,deep,instant,deep-reasoning(usestandardif you wantauto) - Kagi:
standardordeep(searchsource only)
- Linkup/Tavily/Brave/Perplexity/Valyu:
search_context_sizeorsearchContextSize(string, OpenAI native):low|medium|high(default:medium)user_locationoruserLocation(object, OpenAI native):{ type: "approximate", country, city, region }searchType(string, Valyu only):all|webkagiSourceorkagi_source(string, Kagi only):web|news|search
Provider-specific options (set inside webSearch)
Perplexity
searchDomainFilter max 20 entries; searchLanguageFilter max 10 entries (ISO 639-1).
Valyu
countryCode uses a 2-letter ISO country code.
Tavily
Exa
OpenAI native (GPT-5.2)
Examples
Pricing by provider
| Provider | Standard | Deep | Notes |
|---|---|---|---|
| Linkup | $0.006 | $0.06 | Default provider |
| Tavily | $0.008 | $0.016 | Good value, free tier available |
| Exa | $0.005 base | + $0.001/page | For contents retrieval |
| Kagi Web/News | $0.002 | N/A | Cheapest for enrichment |
| Kagi Search | $0.025 | N/A | Full search mode |
| Perplexity | $0.005 | N/A | Flat rate |
| Valyu | ~$0.0015/result | Variable | Dynamic pricing |
| Brave | $0.005 | $0.005 | Flat rate |
| OpenAI Native | $0.01 + tokens | N/A | Per-call fee + model token costs |
Bring your own key (BYOK)
BYOK lets you route requests through your own upstream provider credentials.- Configure keys once: https://nano-gpt.com/byok
- Opt in per request via
x-use-byok: trueorbyok.enabled: true - Optionally force the provider via
x-byok-providerorbyok.provider - BYOK usage includes a 5% platform fee (your provider bills you directly for usage)
Web search BYOK
Web search BYOK availability is provider-dependent and can change over time. See the BYOK reference for the current support matrix.Advanced behavior (optional)
- Provider routing: For GPT-5+ / o1 / o3 / o4 models,
:onlinewithout an explicit provider uses OpenAI native web search. If you setwebSearch.provideror use an explicit:online/<provider>suffix, that provider is used instead. - Model suffix normalization:
:online(and provider/depth suffixes) are stripped from the model name before routing to the base model; the suffix only controls search behavior. - Query formation (non-OpenAI providers): The search query is derived from your latest user message and may include the previous user message if the latest is short. If you need full control over the query or raw results, use the Web Search endpoint (
/api/web). scraping: trueURL handling: When enabled, NanoGPT scans messages for publichttp(s)URLs, ignores local/private URLs, de-duplicates, and caps at 5. If no eligible URLs are found, scraping is skipped. Inline scraping in chat is billed at $0.0015 per successfully scraped URL. For explicit URL lists and the standalone endpoint price ($0.001 per URL), use/scrape-urls.
Image Input
Send images using the OpenAI‑compatible chat format. Provide image parts alongside text in themessages array.
Supported Forms
- Remote URL:
{"type":"image_url","image_url":{"url":"https://..."}} - Base64 data URL:
{"type":"image_url","image_url":{"url":"data:image/png;base64,...."}}
- Prefer HTTPS URLs; some upstreams reject non‑HTTPS. If in doubt, use base64 data URLs.
- Accepted mime types:
image/png,image/jpeg,image/jpg,image/webp. - Inline markdown images in plain text (e.g.,
) are auto‑normalized into structured parts server‑side.
Message Shape
cURL — Image URL (non‑streaming)
cURL — Base64 Data URL (non‑streaming)
Embed your image as a data URL. Replace...BASE64... with your image bytes.
cURL — Streaming SSE
See also: Streaming Protocol (SSE).data: { ... } lines until a final terminator. Usage metrics appear only when requested: set stream_options.include_usage to true for streaming responses, or send "include_usage": true on non-streaming calls.
Note: Prompt-caching helpers implicitly force include_usage, so cached requests still receive usage data without extra flags.
Caching (Implicit and Explicit Controls)
For the full guide (supported models, thresholds, pricing, and usage fields), see Prompt Caching. NanoGPT automatically applies implicit caching on providers/models that support it (including OpenAI, Gemini, and many open-source provider/model routes), so most requests do not need caching flags. Set top-levelcaching: true when you want NanoGPT to route the request to any available provider that supports prompt/input caching. This is capability-based routing: you do not need to choose a provider. If no cache-capable provider is available for the model, the request fails rather than silently using a non-caching provider.
Use explicit prompt-caching controls (prompt_caching, promptCaching, and body-level cache_control alias, plus inline cache_control) when you need Claude-specific cache boundaries, TTL selection, or prompt_caching.stickyProvider consistency control. Top-level caching: true does not add Anthropic-style cache_control markers or configure cache TTLs.
Cache-Capable Provider Routing
Top-levelcaching: true is provider routing, not prompt-cache annotation. It requires the routed provider to be marked as prompt-caching capable for the requested provider-selection model.
caching: true also enables sticky provider routing. After the first successful matching request, NanoGPT will try to use the same provider for later matching requests from the same API key or session, improving the chance of provider-side cache hits. This does not guarantee that a request will be served from cache.
To require a cache-capable provider without sticky routing, set stickyprovider: false:
| Parameter | Type | Default | Description |
|---|---|---|---|
caching | boolean | false | Require a cache-capable provider for this request. If none is usable for the model, the request fails. |
stickyprovider | boolean | true when caching: true | Prefer the previously recorded provider for later matching cache-capable requests. Set false to restore non-sticky cache-capable routing. |
stickyProvider | boolean | Alias | CamelCase alias for top-level stickyprovider. Use stickyprovider in examples. |
caching: true, routing works as follows:
- Filter to providers that are available, not excluded by preferences, and marked as prompt-caching capable.
- If stickiness is enabled, prefer the previously recorded provider for the same cache-relevant request shape when still usable.
- Otherwise choose the cheapest cache-capable provider by base input + output price.
- Use cache write/read pricing only as tie-breakers.
prompt_caching / promptCaching helper accepts these options:
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | — | Enable prompt caching |
ttl | string | "5m" | Cache time-to-live: "5m" or "1h" |
cut_after_message_index / cutAfterMessageIndex | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | When true, disable automatic failover to preserve explicit prompt-cache consistency. Returns 503 error instead of switching services. |
- Each
cache_controlmarker caches the full prefix up to that block. Place them on every static chunk (system messages, tool definitions, large contexts) you plan to reuse. - Explicit TTL controls are
5mand1hfor Claude caching flows. See Prompt Caching. anthropic-beta: prompt-caching-2024-07-31is supported for compatibility (and required for Anthropic-native Claude caching flows).- For implicit-caching providers, no explicit
cache_controlmarkers are required.
cut_after_message_index is zero-based. If omitted, NanoGPT will select a cache boundary automatically; set it explicitly if you need full control. Switch back to explicit cache_control blocks if you need multiple cache breakpoints or mixed TTLs in the same payload.
Explicit Prompt Cache Consistency
NanoGPT automatically fails over to backup services when the primary service is temporarily unavailable. While this ensures high availability, it can break your prompt cache because each backend service maintains its own separate cache. If cache consistency is more important than availability for your use case, you can enable thestickyProvider option:
stickyProvider: false(default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).stickyProvider: true— If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.
stickyProvider: true:
- You have very large cached contexts where cache misses are expensive
- You prefer to retry failed requests yourself rather than pay for cache rebuilds
- Cost predictability is more important than request success rate
stickyProvider: false (default):
- You prefer requests to always succeed when possible
- Occasional cache misses are acceptable
- You’re using shorter contexts where cache rebuilds are inexpensive
Troubleshooting
- 400 unsupported image: ensure the image is a valid PNG/JPEG/WebP, not a tiny 1×1 pixel, and either HTTPS URL or a base64 data URL.
- 503 after fallbacks: try a different model, verify API key/session, and prefer base64 data URL for local or protected assets.
- Missing usage events: confirm
include_usageistruein the payload or that prompt caching is enabled.
Context Memory
Enable unlimited-length conversations with lossless, hierarchical memory.- Append
:memoryto any model name - Or send header
memory: true - Can be combined with web search:
:online:memory - Retention: default 30 days; configure via
:memory-<days>(1..365) or headermemory_expiration_days: <days>; header takes precedence
Custom Context Size Override
When Context Memory is enabled, you can override the model-derived context size used for the memory compression step withmodel_context_limit.
- Parameter:
model_context_limit(number or numeric string) - Default: Derived from the selected model’s context size
- Minimum: Values below 10,000 are clamped internally
- Scope: Only affects memory compression; does not change the target model’s own window
Reasoning Streams
The Chat Completions endpoint separates the model’s visible answer from its internal reasoning. By default, reasoning is included and delivered alongside normal content so that clients can decide whether to display it.:thinking is model-specific and only works when that exact ID (or a documented alias) exists.
-thinking is a legacy alias pattern for some model families only, not universal.
Do not assume -thinking works for arbitrary model IDs. Always check GET /api/v1/models for exact valid IDs.
See also: Extended Thinking (Reasoning).
Endpoint variants
Choose the base path that matches how your client consumes reasoning streams:https://nano-gpt.com/api/v1/chat/completions— default endpoint that streams internal thoughts throughchoices[0].delta.reasoning(and repeats them inmessage.reasoningon completion). Recommended for apps like SillyTavern that understand the modern response shape.https://nano-gpt.com/api/v1legacy/chat/completions— legacy contract that swaps the field name tochoices[0].delta.reasoning_content/message.reasoning_contentfor older OpenAI-compatible clients. Use this for LiteLLM’s OpenAI adapter to avoid downstream parsing errors.https://nano-gpt.com/api/v1thinking/chat/completions— reasoning-aware models write everything into the normalchoices[0].delta.contentstream so clients that ignore reasoning fields still see the full conversation transcript. This is the preferred base URL for JanitorAI.
Streaming payload format
Server-Sent Event (SSE) streams emit the answer inchoices[0].delta.content and the thought process in choices[0].delta.reasoning (plus optional delta.reasoning_details). Reasoning deltas are dispatched before or alongside regular content, letting you render both panes in real-time.
choices[0].message.content contains the assistant reply and choices[0].message.reasoning (plus reasoning_details when available) contains the full chain-of-thought. Non-streaming requests reuse the same formatter, so the reasoning block is present as a dedicated field.
Showing or hiding reasoning
Sendreasoning: { "exclude": true } to strip the reasoning payload from both streaming deltas and the final message. With this flag set, delta.reasoning and message.reasoning are omitted entirely.
Reasoning Effort
reasoning_effort (or reasoning.effort) controls reasoning depth and also acts as an explicit reasoning-mode signal.
Any value other than "none" is treated as a request to enable reasoning/thinking behavior.
Use "none" to explicitly disable reasoning behavior.
Parameter: reasoning_effort
| Value | Description |
|---|---|
none | Explicitly disables reasoning |
minimal | Lowest reasoning depth |
low | Low reasoning depth |
medium | Medium reasoning depth |
high | High reasoning depth |
xhigh | Maximum reasoning depth |
Usage
Thereasoning_effort parameter can be passed at the top level:
reasoning object:
reasoning_effort is authoritative for Chat Completions request shaping.
Combining effort with exclude
reasoning.exclude controls output visibility only. It hides reasoning fields/blocks, but does not inherently disable reasoning compute.
If an effort level is set to a non-none value, reasoning can still run while hidden.
Model suffix: :reasoning-exclude
You can toggle the filter without altering your JSON body by appending :reasoning-exclude to the model name.
- Equivalent to sending
{ "reasoning": { "exclude": true } } - Only the
:reasoning-excludesuffix is stripped before the request is routed; other suffixes remain active - Works for streaming and non-streaming responses on both Chat Completions and Text Completions
Combine with other suffixes
:reasoning-exclude composes safely with the other routing suffixes you already use:
:thinking(when that exact model ID exists).-thinkingvariants are legacy aliases for some families only.:onlineand:online/linkup-deep:memoryand:memory-<days>
anthropic/claude-sonnet-4.6:thinking:8192:reasoning-excludeopenai/gpt-5.2:online:reasoning-excludeanthropic/claude-opus-4.6:memory-30:online/linkup-deep:reasoning-excludezai-org/glm-5:fast:reasoning-excludezai-org/glm-5:cheap:reasoning-exclude
Legacy delta field compatibility
Older clients that expect the legacyreasoning_content field can opt in per request. Set reasoning.delta_field to "reasoning_content", or use the top-level shorthands reasoning_delta_field / reasoning_content_compat if updating nested objects is difficult. When the toggle is active, every streaming and non-streaming response exposes reasoning_content instead of reasoning, and the modern key is omitted. The compatibility pass is skipped if reasoning.exclude is true, because no reasoning payload is emitted. If you cannot change the request payload, target https://nano-gpt.com/api/v1legacy/chat/completions instead—the legacy endpoint keeps reasoning_content without extra flags. LiteLLM’s OpenAI adapter should point here to maintain compatibility. For clients that ignore reasoning-specific fields entirely, use https://nano-gpt.com/api/v1thinking/chat/completions so the full text appears in the standard content stream; this is the correct choice for JanitorAI.
Notes and limitations
- GPU-TEE models (
phala/*) require byte-for-byte SSE passthrough for signature verification. For those models, streaming cannot be filtered; the suffix has no effect on the streaming bytes. - When assistant content is an array (e.g., vision/text parts), only text parts are filtered; images and tool/metadata content are untouched.
Service tiers (flex and priority)
Setservice_tier to request a non-default capacity tier on providers that support service tiers:
autoor omitted: use NanoGPT’s normal routing and the provider default.default: request the provider’s standard tier where the provider accepts an explicit default value.flex: request lower-cost, variable-capacity processing where supported.priority: request higher-cost priority processing where supported.
- When
service_tieris"flex"or"priority", NanoGPT prefers routing to providers that support the requested tier. - Service tier availability is model- and provider-specific. Model pages show which tiers are supported.
- Not all providers support service tiers, so tiered requests may be routed differently than default requests.
- Header provider overrides (like
X-Provider) and explicit provider selection are honored for pricing and x402 estimates. - Provider-native web search can force routing; tier pricing follows that routing.
- If you explicitly force a provider that does not support service tiers, the requested tier may be ignored by the upstream provider, or routing and pricing may differ from the default route.
- Flex requests are billed at flex rates where applicable.
- Priority requests are billed at priority rates where applicable.
- High-context pricing may also apply for models and providers with separate high-context SKUs, such as
es2kpricing for GPT-5.5/GPT-5.4 where available.
- Responses now include a top-level
service_tierfield when it is provided on the request.
Example: flex tier
Example: priority tier
YouTube Transcripts
Automatically fetch and prepend YouTube video transcripts when the latest user message contains YouTube links.Defaults
- Parameter:
youtube_transcripts(boolean) - Default:
false(opt-in) - Opt-in: set
youtube_transcriptstotrue(string"true"is also accepted) to fetch transcripts - Limit: Up to 3 YouTube URLs processed per request
- Higher volume: Use the standalone
POST /api/youtube-transcribeendpoint for up to 10 URLs per request - Injection: Transcripts are added as a system message before your messages
- Billing: $0.01 per transcript fetched
Enable automatic transcripts
By default, YouTube links are ignored. Setyoutube_transcripts to true when you want the system to retrieve and bill for transcripts.
Notes
- Web scraping is separate. To scrape non‑YouTube URLs, set
scraping: true. YouTube transcripts do not requirescraping: true. - When not requested, YouTube links are ignored for transcript fetching and are not billed.
- If your balance is insufficient when enabled, the request may be blocked with a 402.
Performance Benchmarks
LinkUp achieves state-of-the-art performance on OpenAI’s SimpleQA benchmark:| Provider | Score |
|---|---|
| LinkUp Deep Search | 90.10% |
| Exa | 90.04% |
| Perplexity Sonar Pro | 86% |
| LinkUp Standard Search | 85% |
| Perplexity Sonar | 77% |
| Tavily | 73% |
Important Notes
- Web search increases input token count, which affects total cost
- Models gain access to real-time information published less than a minute ago
- Internet connectivity can provide up to 10x improvement in factuality
- All models support web search - append a suffix or send a
webSearchobject (linkupis supported as an alias)
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Headers
Optional explicit provider override for supported open-source models (case-insensitive). Explicit provider selection is billed pay-as-you-go at the selected provider's price, including provider-selection markup; for subscription users it bypasses subscription coverage for that request.
Optional billing override to force pay-as-you-go without an explicit provider, or to apply saved provider preferences to subscription-included traffic (e.g., paygo). Header name is case-insensitive.
Body
Parameters for chat completion
The model to use for completion. The model value may include supported model suffixes, including web search (':online', ':online/'), memory (':memory', ':memory-'), reasoning visibility (':reasoning-exclude'), thinking variants where listed by the model catalog (':thinking'), and provider routing preferences for eligible models (':fast', ':cheap', etc.).
"minimax/minimax-m2.7"
"zai-org/glm-5:fast"
"zai-org/glm-5:cheap"
"openai/gpt-5.2:online/exa-instant"
"openai/gpt-5.2:online/exa-deep-reasoning"
Array of message objects with role and content
Billing override to force pay-as-you-go without an explicit provider, or to apply saved provider preferences to subscription-included traffic. Accepted values (case-insensitive): paygo, pay-as-you-go, pay_as_you_go, paid, payg.
Alias for billing_mode.
Optional provider override or structured provider routing controls for provider-selection-capable models. A string explicitly selects one provider. An object can set soft order, hard pins, exclusions, sort preference, price caps, fallback behavior, and parameter-capability requirements.
When true, route to an available provider that is marked as prompt/input-caching capable for the requested provider-selection model. If no usable cache-capable provider exists, the request fails instead of falling back to a non-cache-capable provider. This is provider capability routing only; it does not add cache_control markers or configure cache TTLs.
Top-level routing control for caching: true. When caching is true, defaults to true and prefers the previously recorded provider for later matching requests from the same API key or session when still usable. Set false to require a cache-capable provider without sticky routing.
CamelCase alias for top-level stickyprovider. Distinct from prompt_caching.stickyProvider, which controls explicit prompt-cache failover behavior.
Whether to stream the response
Optional service tier: "auto", "default", "flex", or "priority". Use "flex" for lower-cost variable-capacity processing or "priority" for higher-cost priority processing where supported by the routed model/provider.
auto, default, flex, priority Classic randomness control. Accepts any decimal between 0-2. If omitted, NanoGPT does not force a value and the routed provider/model default applies
0 <= x <= 2Upper bound on generated tokens. If omitted, NanoGPT does not enforce an explicit default and the routed provider/model default applies
x >= 1Nucleus sampling. When set below 1.0, trims candidate tokens to the smallest set whose cumulative probability exceeds top_p. Works well as an alternative to tweaking temperature
0 <= x <= 1Penalizes tokens proportionally to how often they appeared previously. Negative values encourage repetition; positive values discourage it
-2 <= x <= 2Penalizes tokens based on whether they appeared at all. Good for keeping the model on topic without outright banning words
-2 <= x <= 2Provider-agnostic repetition modifier (distinct from OpenAI penalties). Values >1 discourage repetition
-2 <= x <= 2Caps sampling to the top-k highest probability tokens per step
Combines top-p and temperature behavior; leave unset unless a model description explicitly calls for it
Ensures each candidate token probability exceeds a floor (0-1). Helpful for stopping models from collapsing into low-entropy loops
0 <= x <= 1Tail free sampling. Values between 0-1 let you shave the long tail of the distribution; 1.0 disables the feature
0 <= x <= 1Cut probabilities as soon as they fall below the specified tail threshold
Cut probabilities as soon as they fall below the specified tail threshold
Typical sampling (aka entropy-based nucleus). Works like top_p but preserves tokens whose surprise matches the expected entropy
0 <= x <= 1Enables Mirostat sampling for models that support it. Set to 1 or 2 to activate
0, 1, 2 Mirostat target entropy parameter. Used when mirostat_mode is enabled
Mirostat learning rate parameter. Used when mirostat_mode is enabled
For providers that support it, enforces a minimum completion length before stop conditions fire
x >= 0Stop sequences. Accepts string or array of strings. Values are passed directly to upstream providers
Numeric array that lets callers stop generation on specific token IDs. Not supported by many providers
When true, keeps the stop sequence in the final text. Not supported by many providers
Allows completions to continue even if the model predicts EOS internally. Useful for long creative writing runs
Extension that forbids repeating n-grams of the given size. Not supported by many providers
x >= 0List of token IDs to fully block
Object mapping token IDs to additive logits. Works just like OpenAI's version
When true or a number, forwards the request to providers that support returning token-level log probabilities
Requests logprobs on the prompt itself when the upstream API allows it
Numeric seed. Wherever supported, passes the value to make completions repeatable
Helper to tag leading messages for explicit prompt-caching control (primarily Claude flows). Providers with implicit caching support (including OpenAI, Gemini, and many open-source provider routes) do not require this helper. NanoGPT injects cache_control blocks on each message up to the specified index before forwarding upstream. If cut_after_message_index is omitted, NanoGPT selects a cache boundary automatically.
Controls reasoning depth and acts as an explicit reasoning-mode signal. Any value other than "none" requests reasoning/thinking behavior. Use "none" to explicitly disable reasoning.
none, minimal, low, medium, high, xhigh Reasoning configuration. exclude controls output visibility (hides reasoning fields/blocks) and does not inherently disable reasoning compute. effort controls depth, and delta_field switches to legacy reasoning_content fields.
Shorthand for reasoning.delta_field
reasoning_content Shorthand to force legacy reasoning_content fields in the response
Response
Chat completion response