Speech-to-Text Transcription
Transcribe audio (and supported video formats) into text using speech recognition models. Supports multiple languages, diarization (model-dependent), and various formats. Most models return synchronous results; some models (for example Elevenlabs-STT and voice cloning workflows) return asynchronous job IDs.
Overview
The Speech-to-Text transcription endpoint converts audio files into text using state-of-the-art speech recognition models. Supports multiple languages, speaker diarization, and various audio formats. Looking for a drop-in OpenAI-compatible STT endpoint? UsePOST /api/v1/audio/transcriptions. See api-reference/endpoint/audio-transcriptions.mdx.
Supported Models
- Whisper-Large-V3: High-accuracy transcription (~$0.0005/min) - Synchronous
- Wizper: Fast and efficient transcription ($0.01/min) - Synchronous
- Elevenlabs-STT: Premium transcription with diarization ($0.03/min) - Asynchronous
- gpt-4o-mini-transcribe: Efficient OpenAI transcription ($0.003/min) - Synchronous
- gpt-4o-mini-transcribe-2025-03-20: Snapshot ($0.003/min) - Synchronous
- gpt-4o-mini-transcribe-2025-12-15: Snapshot ($0.003/min) - Synchronous
- gpt-4o-mini-transcribe-latest: Alias ($0.003/min) - Synchronous
- openai-whisper-with-video: Video-to-text transcription ($0.06/min) - Synchronous
- qwen-voice-clone: Voice cloning ($0.25/run) - Asynchronous
- minimax-voice-clone: Voice cloning ($1.00/run) - Asynchronous
Upload Methods
Direct File Upload (≤3MB)
URL Upload (≤500MB)
Advanced Features - Speaker Diarization
Use Elevenlabs-STT for speaker identification (asynchronous processing):Language Support
Supports 97+ languages with auto-detection:Response Examples
Synchronous Response (most models)
Asynchronous Response (Elevenlabs-STT and voice cloning)
Initial response (202):Authorizations
Body
Audio transcription parameters. Use multipart/form-data for file uploads or application/json for URL-based requests.
Direct file upload (max 3MB). Supported formats: MP3, WAV, M4A, OGG, AAC
URL to audio file (alternative to direct upload, supports up to 500MB)
The STT model to use for transcription
Whisper-Large-V3, Wizper, Elevenlabs-STT, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-03-20, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-mini-transcribe-latest, openai-whisper-with-video, qwen-voice-clone, minimax-voice-clone Language code for transcription (ISO 639-1 or ISO 639-3). Use 'auto' for auto-detection
"en"
Actual audio duration in minutes for accurate billing
Enable speaker diarization (Elevenlabs-STT only)
true, false Tag non-speech audio events like [laughter], [applause] (Elevenlabs-STT only)
true, false