Transcribe audio (and supported video formats) into text using speech recognition models. Supports multiple languages, diarization (model-dependent), and various formats. Most models return synchronous results; some models (for example Elevenlabs-STT and voice cloning workflows) return asynchronous job IDs.
POST /api/v1/audio/transcriptions. See api-reference/endpoint/audio-transcriptions.mdx.
Audio transcription parameters. Use multipart/form-data for file uploads or application/json for URL-based requests.
Direct file upload (max 3MB). Supported formats: MP3, WAV, M4A, OGG, AAC
URL to audio file (alternative to direct upload, supports up to 500MB)
The STT model to use for transcription
Whisper-Large-V3, Wizper, Elevenlabs-STT, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-03-20, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-mini-transcribe-latest, openai-whisper-with-video, qwen-voice-clone, minimax-voice-clone Language code for transcription (ISO 639-1 or ISO 639-3). Use 'auto' for auto-detection
"en"
Actual audio duration in minutes for accurate billing
Enable speaker diarization (Elevenlabs-STT only)
true, false Tag non-speech audio events like [laughter], [applause] (Elevenlabs-STT only)
true, false