Overview
Synthesize speech (TTS) or generate music with a single request. The OpenAI-compatiblePOST /v1/audio/speech endpoint returns audio bytes directly in the HTTP response.
Optional chunked streaming is available with stream: true. In streaming mode, audio bytes are delivered progressively as generated, which reduces time-to-first-byte (TTFB) for real-time playback.
Default behavior is unchanged: omit stream (or set false) to receive one buffered audio file after generation completes.
When you use a music model (for example Minimax-Music-02), the input field is treated as a music prompt (not text to speak) and voice is ignored. For a dedicated music guide and model list, see api-reference/music-generation.mdx.
Endpoint
- Method/Path:
POST https://nano-gpt.com/api/v1/audio/speech - Auth:
Authorization: Bearer <API_KEY> - Required header:
Content-Type: application/json - You may see older examples using
POST https://nano-gpt.com/api/v1/speech. Prefer/api/v1/audio/speechfor OpenAI SDK compatibility.
Request Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | None | Audio-speech model ID (TTS or music). |
input | string | None | Text to speak (TTS) or a music prompt (music models). |
voice | string | None | Voice preset for TTS. Ignored for music models. If your client requires this field, pass any string (for example "alloy"). |
response_format | string | mp3 | Output format when supported by the selected provider/model. |
speed | number | 1 | Speaking rate multiplier for TTS (for example 0.5-2.0). Ignored for music models. |
instructions | string | None | Voice/style instructions (supported by some models/providers). |
stream | boolean | false | When true, returns chunked audio bytes progressively. |
- Some provider-backed models may support additional fields; accepted parameters vary by model.
- For unsupported models,
stream: trueis ignored and the endpoint returns the normal buffered response.
Streaming Support (TTS Models)
| Model | Provider | Streaming Support |
|---|---|---|
tts-1 | OpenAI | Yes |
tts-1-hd | OpenAI | Yes |
gpt-4o-mini-tts | OpenAI | Yes |
Elevenlabs-Turbo-V2.5 | ElevenLabs (via FAL) | Yes |
Elevenlabs-V3 | ElevenLabs (via FAL) | Yes |
stream and return buffered responses.
Response Behavior
Non-streaming (default)
- Status:
200 OK - Body: complete audio file returned once generation finishes
- Headers: typically includes
Content-Length
Streaming (stream: true)
- Status:
200 OK - Body: audio bytes arrive progressively in chunks
- Headers: no
Content-Length; usesTransfer-Encoding: chunked - Client can start playback/processing as soon as the first chunk arrives
Content-Typeby provider:- OpenAI TTS models: matches the selected output format (for example
audio/mpeg,audio/wav,audio/opus,audio/flac,audio/aac,audio/pcm) - ElevenLabs TTS models: always
audio/mpeg
- OpenAI TTS models: matches the selected output format (for example
Errors
- If an error happens before streaming starts, the API returns the standard OpenAI-style JSON error envelope:
- If streaming fails after bytes have started, the connection is terminated and clients may receive partial/corrupt audio.
invalid_model, invalid_voice, unsupported_format, input_too_long, rate_limit_exceeded.
Examples
Non-streaming request (unchanged)
Streaming request (cURL)
OpenAI Node.js SDK (streaming)
OpenAI Python SDK (streaming)
Compare TTFB (streaming vs non-streaming)
Notes & Limits
- Max input length: depends on model; measured in characters or tokens. For short, interactive prompts, prefer under ~1-2k characters.
- Typical latency: scales with input length and output format; compressed formats like
mp3are often faster thanwav. - Usage metering: billed by input characters for TTS models; output file size does not affect billing.
Audio Format Support by Provider
| Provider | Supported response_format values |
|---|---|
OpenAI (tts-1, tts-1-hd, gpt-4o-mini-tts) | mp3 (default), opus, aac, flac, wav, pcm |
ElevenLabs (Elevenlabs-Turbo-V2.5, Elevenlabs-V3) | Always returns audio/mpeg (MP3). response_format is ignored. |
Voices
- Voice IDs vary by model/provider. See model-specific voices on Text-to-Speech:
api-reference/text-to-speech.mdx. - If a voices listing endpoint is available (for example
GET /v1/voices), it returns available voice IDs and metadata (language coverage, gender/pitch, sample links).
Errors & Troubleshooting
invalid_model,invalid_voice,unsupported_format: Verifymodel,voice, andresponse_format.input_too_long: Reduce length; split long text into chunks and stitch audio client-side.rate_limit_exceeded: Exponential backoff; retry after the window resets.- Network/client tips: set
Acceptto your preferred audio type and write raw response bytes directly to a file/stream.
Security
- Do not expose API keys in browsers. Proxy via your server.
- Redact PII in logs; avoid logging raw text/audio in production.
- Rate-limit public routes.
Pricing, Quotas, and Rate Limits
- Billing is based on input character count, not output audio size.
- For streaming requests, billing is recorded after the first audio chunk is confirmed. If the upstream provider fails before any audio is produced, no charge is applied.
- If the client disconnects mid-stream after audio starts, the charge still applies because generation already began upstream.
- Rate limits: per-minute/day caps; contact support to request increases. See
api-reference/miscellaneous/pricing.mdxandapi-reference/miscellaneous/rate-limits.mdx.
Migration from Job-based TTS
Already using the asyncPOST /tts + GET /tts/status flow?
- When to switch: choose
v1/audio/speechfor short prompts, low latency, and direct playback; keep job-based TTS for long/batch generation and webhook workflows. - Parameter mapping:
text->input,voicestaysvoice, and output format can be requested withresponse_formatwhen supported. - Retries/timeouts:
v1/audio/speechreturns inline; implement client-side timeouts and simple retries on 5xx.
See Also
- Async/job-based TTS:
api-reference/endpoint/tts.mdx - TTS Status polling:
api-reference/endpoint/tts-status.mdx