Overview
The Chat Completion endpoint provides OpenAI-compatible chat completions.Web Search
All models can access real-time web information by appending special suffixes to the model name::online- Standard web search ($0.006 per request)- Returns 10 search results
- Perfect for straightforward questions
:online/linkup-deep- Deep web search ($0.06 per request)- Iteratively searches for comprehensive information
- Ideal when initial results aren’t sufficient
Examples
Image Input
Send images using the OpenAI‑compatible chat format. Provide image parts alongside text in themessages array.
Supported Forms
- Remote URL:
{"type":"image_url","image_url":{"url":"https://..."}} - Base64 data URL:
{"type":"image_url","image_url":{"url":"data:image/png;base64,...."}}
- Prefer HTTPS URLs; some upstreams reject non‑HTTPS. If in doubt, use base64 data URLs.
- Accepted mime types:
image/png,image/jpeg,image/jpg,image/webp. - Inline markdown images in plain text (e.g.,
) are auto‑normalized into structured parts server‑side.
Message Shape
cURL — Image URL (non‑streaming)
cURL — Base64 Data URL (non‑streaming)
Embed your image as a data URL. Replace...BASE64... with your image bytes.
cURL — Streaming SSE
data: { ... } lines until a final terminator and may include a usage object at the end.
Troubleshooting
- 400 unsupported image: ensure the image is a valid PNG/JPEG/WebP, not a tiny 1×1 pixel, and either HTTPS URL or a base64 data URL.
- 503 after fallbacks: try a different model, verify API key/session, and prefer base64 data URL for local or protected assets.
Context Memory
Enable unlimited-length conversations with lossless, hierarchical memory.- Append
:memoryto any model name - Or send header
memory: true - Can be combined with web search:
:online:memory - Retention: default 30 days; configure via
:memory-<days>(1..365) or headermemory_expiration_days: <days>; header takes precedence
Custom Context Size Override
When Context Memory is enabled, you can override the model-derived context size used for the memory compression step withmodel_context_limit.
- Parameter:
model_context_limit(number or numeric string) - Default: Derived from the selected model’s context size
- Minimum: Values below 10,000 are clamped internally
- Scope: Only affects memory compression; does not change the target model’s own window
Reasoning Streams
The Chat Completions endpoint separates the model’s visible answer from its internal reasoning. By default, reasoning is included and delivered alongside normal content so that clients can decide whether to display it. Requests that use thethinking model suffix (for example :thinking or -thinking:8192) are normalized before dispatch, but the response contract remains the same.
Endpoint variants
Choose the base path that matches how your client consumes reasoning streams:https://nano-gpt.com/api/v1/chat/completions— default endpoint that streams internal thoughts throughchoices[0].delta.reasoning(and repeats them inmessage.reasoningon completion). Recommended for apps like SillyTavern that understand the modern response shape.https://nano-gpt.com/api/v1legacy/chat/completions— legacy contract that swaps the field name tochoices[0].delta.reasoning_content/message.reasoning_contentfor older OpenAI-compatible clients. Use this for LiteLLM’s OpenAI adapter to avoid downstream parsing errors.https://nano-gpt.com/api/v1thinking/chat/completions— reasoning-aware models write everything into the normalchoices[0].delta.contentstream so clients that ignore reasoning fields still see the full conversation transcript. This is the preferred base URL for JanitorAI.
Streaming payload format
Server-Sent Event (SSE) streams emit the answer inchoices[0].delta.content and the thought process in choices[0].delta.reasoning (plus optional delta.reasoning_details). Reasoning deltas are dispatched before or alongside regular content, letting you render both panes in real-time.
choices[0].message.content contains the assistant reply and choices[0].message.reasoning (plus reasoning_details when available) contains the full chain-of-thought. Non-streaming requests reuse the same formatter, so the reasoning block is present as a dedicated field.
Showing or hiding reasoning
Sendreasoning: { "exclude": true } to strip the reasoning payload from both streaming deltas and the final message. With this flag set, delta.reasoning and message.reasoning are omitted entirely.
Model suffix: :reasoning-exclude
You can toggle the filter without altering your JSON body by appending :reasoning-exclude to the model name.
- Equivalent to sending
{ "reasoning": { "exclude": true } } - Only the
:reasoning-excludesuffix is stripped before the request is routed; other suffixes remain active - Works for streaming and non-streaming responses on both Chat Completions and Text Completions
Combine with other suffixes
:reasoning-exclude composes safely with the other routing suffixes you already use:
:thinking(and variants like…-thinking:8192):onlineand:online/linkup-deep:memoryand:memory-<days>
claude-3-7-sonnet-thinking:8192:reasoning-excludegpt-4o:online:reasoning-excludeclaude-3-5-sonnet-20241022:memory-30:online/linkup-deep:reasoning-exclude
Legacy delta field compatibility
Older clients that expect the legacyreasoning_content field can opt in per request. Set reasoning.delta_field to "reasoning_content", or use the top-level shorthands reasoning_delta_field / reasoning_content_compat if updating nested objects is difficult. When the toggle is active, every streaming and non-streaming response exposes reasoning_content instead of reasoning, and the modern key is omitted. The compatibility pass is skipped if reasoning.exclude is true, because no reasoning payload is emitted. If you cannot change the request payload, target https://nano-gpt.com/api/v1legacy/chat/completions instead—the legacy endpoint keeps reasoning_content without extra flags. LiteLLM’s OpenAI adapter should point here to maintain compatibility. For clients that ignore reasoning-specific fields entirely, use https://nano-gpt.com/api/v1thinking/chat/completions so the full text appears in the standard content stream; this is the correct choice for JanitorAI.
Notes and limitations
- GPU-TEE models (
phala/*) require byte-for-byte SSE passthrough for signature verification. For those models, streaming cannot be filtered; the suffix has no effect on the streaming bytes. - When assistant content is an array (e.g., vision/text parts), only text parts are filtered; images and tool/metadata content are untouched.
YouTube Transcripts
Automatically fetch and prepend YouTube video transcripts when the latest user message contains YouTube links.Defaults
- Parameter:
youtube_transcripts(boolean) - Default:
true(backwards compatible) - Limit: Up to 3 YouTube URLs processed per request
- Injection: Transcripts are added as a system message before your messages
- Billing: $0.01 per transcript fetched
Disable automatic transcripts
Setyoutube_transcripts to false to skip detection and fetching (no transcript cost applies).
Notes
- Web scraping is separate. To scrape non‑YouTube URLs, set
scraping: true. YouTube transcripts do not requirescraping: true. - When disabled, YouTube links are ignored for transcript fetching and are not billed.
- If your balance is insufficient when enabled, the request may be blocked with a 402.
Performance Benchmarks
LinkUp achieves state-of-the-art performance on OpenAI’s SimpleQA benchmark:| Provider | Score |
|---|---|
| LinkUp Deep Search | 90.10% |
| Exa | 90.04% |
| Perplexity Sonar Pro | 86% |
| LinkUp Standard Search | 85% |
| Perplexity Sonar | 77% |
| Tavily | 73% |
Important Notes
- Web search increases input token count, which affects total cost
- Models gain access to real-time information published less than a minute ago
- Internet connectivity can provide up to 10x improvement in factuality
- All models support web search - simply append the suffix to any model name
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Body
application/json
Parameters for chat completion
The model to use for completion. Append ':online' for web search ($0.005/request) or ':online/linkup-deep' for deep web search ($0.05/request)
Examples:
"chatgpt-4o-latest"
"chatgpt-4o-latest:online"
"chatgpt-4o-latest:online/linkup-deep"
"claude-3-5-sonnet-20241022:online"
Array of message objects with role and content
Whether to stream the response
Controls randomness (0-2)
Maximum number of tokens to generate
Nucleus sampling parameter (0-1)
Penalty for frequency of tokens (-2 to 2)
Penalty for presence of tokens (-2 to 2)
Cache control settings for Claude models only