Skip to main content
POST
/
tts
cURL
curl --request POST \
  --url https://nano-gpt.com/api/tts \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "text": "Hello! This is a test of the text-to-speech API.",
  "model": "Kokoro-82m",
  "voice": "af_bella",
  "speed": 1,
  "response_format": "mp3",
  "instructions": "speak with enthusiasm",
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0
}
'
{
  "audioUrl": "https://storage.url/audio-file.wav",
  "contentType": "audio/wav",
  "model": "<string>",
  "text": "<string>",
  "voice": "<string>",
  "speed": 123,
  "duration": 123,
  "cost": 123,
  "currency": "<string>"
}

Overview

Convert text into natural-sounding speech using various TTS models. Supports multiple languages, voices, and customization options including speed control and voice instructions. Looking for synchronous, low‑latency TTS that returns audio bytes directly? See api-reference/endpoint/speech.mdx (POST /v1/speech).

Supported Models

  • Kokoro-82m: 44 multilingual voices ($0.001/1k chars)
  • Elevenlabs-Turbo-V2.5: Premium quality with style controls ($0.06/1k chars)
  • tts-1: OpenAI standard quality ($0.015/1k chars)
  • tts-1-hd: OpenAI high definition ($0.030/1k chars)
  • gpt-4o-mini-tts: Ultra-low cost ($0.0006/1k chars)

Basic Usage

import requests

def text_to_speech(text, model="Kokoro-82m", voice=None, **kwargs):
    headers = {
        "x-api-key": "YOUR_API_KEY",
        "Content-Type": "application/json"
    }
    
    payload = {
        "text": text,
        "model": model
    }
    
    if voice:
        payload["voice"] = voice
    
    payload.update(kwargs)
    
    response = requests.post(
        "https://nano-gpt.com/api/tts",
        headers=headers,
        json=payload
    )
    
    if response.status_code == 200:
        content_type = response.headers.get('content-type', '')
        
        if 'application/json' in content_type:
            # JSON response with audio URL
            data = response.json()
            audio_response = requests.get(data['audioUrl'])
            with open('output.wav', 'wb') as f:
                f.write(audio_response.content)
        else:
            # Binary audio data (OpenAI models)
            with open('output.mp3', 'wb') as f:
                f.write(response.content)
        
        return response
    else:
        raise Exception(f"Error: {response.status_code}")

# Basic usage
text_to_speech(
    "Hello! Welcome to our service.",
    model="Kokoro-82m",
    voice="af_bella"
)

Async Status and Result Retrieval

Some TTS models run asynchronously. When queued, the API returns HTTP 202 with a ticket containing a runId and model. Use the TTS Status endpoint to poll until the job is complete. Synchronous models return audio immediately and do not require status polling.

Endpoints

  • Submit TTS: POST /api/tts
  • Check TTS Status (async only): GET /api/tts/status?runId=...&model=...

When you see status: “pending”

If your initial POST /api/tts returns HTTP 202 with a body like:
{
  "status": "pending",
  "runId": "98b0d593-fe8d-49b8-89c9-233022232297",
  "model": "Elevenlabs-Turbo-V2.5",
  "charged": true,
  "cost": 0.0050388,
  "paymentSource": "USD",
  "isApiRequest": true
}
…the request is queued. Poll the Status endpoint using the runId and model. If present, include cost, paymentSource, and isApiRequest from the ticket when polling to help with automatic refunds if the upstream provider later rejects content.

cURL — Submit, then Poll

# 1) Submit TTS
curl -X POST https://nano-gpt.com/api/tts \
  -H 'x-api-key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Hello there!",
    "model": "Elevenlabs-Turbo-V2.5",
    "voice": "Rachel",
    "speed": 1.0
  }'

# 2) If response is 202/pending, poll using returned values
curl "https://nano-gpt.com/api/tts/status?runId=98b0d593-fe8d-49b8-89c9-233022232297&model=Elevenlabs-Turbo-V2.5&cost=0.0050388&paymentSource=USD&isApiRequest=true" \
  -H 'x-api-key: YOUR_API_KEY'

# 3) On completion, you'll receive an audioUrl
# {
#   "status": "completed",
#   "audioUrl": "https://.../file.mp3",
#   "contentType": "audio/mpeg",
#   "model": "Elevenlabs-Turbo-V2.5"
# }

Synchronous vs. Asynchronous Models

  • Synchronous models (examples: tts-1, tts-1-hd, gpt-4o-mini-tts, Kokoro-82m) return immediately from POST /api/tts with either binary audio or JSON containing { audioUrl, contentType } depending on the provider.
  • Asynchronous models (examples: Elevenlabs-Turbo-V2.5, Elevenlabs-V3, Elevenlabs-Music-V1) return HTTP 202 with a polling ticket. Use GET /api/tts/status until completed.

Best Practices

  • Poll every 2–3 seconds; stop after 2–3 minutes and show a timeout error.
  • Always include runId and model. If available, include cost, paymentSource, and isApiRequest from the ticket for better error handling and refund automation.
  • On completed, prefer using the audioUrl directly (streaming or download). Cache URLs client‑side if you plan to replay.
  • If you receive CONTENT_POLICY_VIOLATION, do not retry the same content; surface a clear message to the user.

FAQ

  • Why did I get 202/pending? The selected model runs asynchronously; your request was queued and billed after a successful queue submission.
  • Can I cancel a pending TTS? Not currently. Let it complete or time out client‑side.
  • Do all TTS models require polling? No. Only async models. Synchronous models return immediately.

Model-Specific Examples

Kokoro-82m - Multilingual Voices

44 voices across 13 language groups:
# Popular voice examples by category
voices = {
    "american_female": ["af_bella", "af_nova", "af_aoede"],
    "american_male": ["am_adam", "am_onyx", "am_eric"],
    "british_female": ["bf_alice", "bf_emma"],
    "british_male": ["bm_daniel", "bm_george"],
    "japanese_female": ["jf_alpha", "jf_gongitsune"],
    "chinese_female": ["zf_xiaoxiao", "zf_xiaoyi"],
    "french_female": ["ff_siwis"],
    "italian_male": ["im_nicola"]
}

# Generate multilingual samples
samples = [
    {"text": "Hello, welcome!", "voice": "af_bella", "lang": "English"},
    {"text": "Bonjour et bienvenue!", "voice": "ff_siwis", "lang": "French"},
    {"text": "こんにちは!", "voice": "jf_alpha", "lang": "Japanese"},
    {"text": "你好,欢迎!", "voice": "zf_xiaoxiao", "lang": "Chinese"}
]

for sample in samples:
    text_to_speech(
        text=sample["text"],
        model="Kokoro-82m",
        voice=sample["voice"]
    )

Elevenlabs-Turbo-V2.5 - Advanced Voice Controls

Premium quality with style adjustments:
# Stable, consistent voice
text_to_speech(
    text="This is a professional announcement.",
    model="Elevenlabs-Turbo-V2.5",
    voice="Rachel",
    stability=0.9,
    similarity_boost=0.8,
    style=0
)

# Expressive, dynamic voice  
text_to_speech(
    text="This is so exciting!",
    model="Elevenlabs-Turbo-V2.5",
    voice="Rachel",
    stability=0.3,
    similarity_boost=0.7,
    style=0.8,
    speed=1.2
)

# Available voices: Rachel, Adam, Bella, Brian, etc.

OpenAI Models - Multiple Formats & Instructions

# High-definition with voice instructions
text_to_speech(
    text="Welcome to customer service.",
    model="tts-1-hd",
    voice="nova",
    instructions="Speak warmly and professionally like a customer service representative",
    response_format="flac"
)

# Ultra-low cost option
text_to_speech(
    text="This is a cost-effective option.",
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak clearly and cheerfully",
    response_format="mp3"
)

# Different format examples
formats = ["mp3", "wav", "opus", "flac", "aac"]
for fmt in formats:
    text_to_speech(
        text=f"This is {fmt.upper()} format.",
        model="tts-1",
        voice="echo",
        response_format=fmt
    )

Response Examples

JSON Response (Most Models)

{
  "audioUrl": "https://storage.url/audio-file.wav",
  "contentType": "audio/wav",
  "model": "Kokoro-82m",
  "text": "Hello world",
  "voice": "af_bella",
  "speed": 1,
  "duration": 2.3,
  "cost": 0.001,
  "currency": "USD"
}

Binary Response (OpenAI Models)

OpenAI models return audio data directly as binary with appropriate headers:
Content-Type: audio/mp3
Content-Length: 123456
[Binary audio data]

Voice Options

Kokoro-82m Voices

  • American Female: af_bella, af_nova, af_aoede, af_jessica, af_sarah
  • American Male: am_adam, am_onyx, am_eric, am_liam
  • British: bf_alice, bf_emma, bm_daniel, bm_george
  • Asian Languages: jf_alpha (Japanese), zf_xiaoxiao (Chinese)
  • European: ff_siwis (French), im_nicola (Italian)

Elevenlabs-Turbo-V2.5 Voices

Rachel, Adam, Bella, Brian, Sarah, Michael, Emily, James, Nicole, and 37 more

OpenAI Voices

alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verse

Error Handling

try:
    result = text_to_speech("Hello world!", model="Kokoro-82m")
    print("Success!")
except Exception as e:
    if "400" in str(e):
        print("Bad request - check parameters")
    elif "401" in str(e):
        print("Unauthorized - check API key")
    elif "413" in str(e):
        print("Text too long for model")
    else:
        print(f"Error: {e}")
Common errors:
  • 400: Invalid parameters or missing text
  • 401: Invalid or missing API key
  • 413: Text exceeds model character limit
  • 429: Rate limit exceeded

Authorizations

x-api-key
string
header
required

Body

application/json

Text-to-speech generation parameters

text
string
required

The text to convert to speech

Example:

"Hello! This is a test of the text-to-speech API."

model
enum<string>
default:Kokoro-82m

The TTS model to use for generation

Available options:
Kokoro-82m,
Elevenlabs-Turbo-V2.5,
tts-1,
tts-1-hd,
gpt-4o-mini-tts
voice
string

The voice to use for synthesis (available voices depend on selected model)

Example:

"af_bella"

speed
number
default:1

Speech speed multiplier (0.1-5, not supported for gpt-4o-mini-tts)

Required range: 0.1 <= x <= 5
response_format
enum<string>
default:mp3

Audio output format (OpenAI models only)

Available options:
mp3,
opus,
aac,
flac,
wav,
pcm
instructions
string

Voice instructions for fine-tuning (gpt-4o-mini-tts and tts-1-hd only)

Example:

"speak with enthusiasm"

stability
number
default:0.5

Voice stability (Elevenlabs-Turbo-V2.5 only, 0-1)

Required range: 0 <= x <= 1
similarity_boost
number
default:0.75

Voice similarity boost (Elevenlabs-Turbo-V2.5 only, 0-1)

Required range: 0 <= x <= 1
style
number
default:0

Style exaggeration (Elevenlabs-Turbo-V2.5 only, 0-1)

Required range: 0 <= x <= 1

Response

Text-to-speech response. Returns either JSON with audio URL or binary audio data depending on the model.

audioUrl
string<uri>

URL to the generated audio file

Example:

"https://storage.url/audio-file.wav"

contentType
string

MIME type of the audio file

Example:

"audio/wav"

model
string

Model used for generation

text
string

The input text that was synthesized

voice
string

Voice used for synthesis

speed
number

Speed multiplier used

duration
number

Duration of the generated audio in seconds

cost
number

Cost of the generation

currency
string

Currency of the cost