Transcribe audio files into text using state-of-the-art speech recognition models. Supports multiple languages, speaker diarization, and various audio formats. Returns synchronous results for Whisper/Wizper models and asynchronous job IDs for Elevenlabs-STT.
Audio transcription parameters. Use multipart/form-data for file uploads or application/json for URL-based requests.
Direct file upload (max 3MB). Supported formats: MP3, WAV, M4A, OGG, AAC
URL to audio file (alternative to direct upload, supports up to 500MB)
The STT model to use for transcription
Whisper-Large-V3, Wizper, Elevenlabs-STT Language code for transcription (ISO 639-1 or ISO 639-3). Use 'auto' for auto-detection
"en"
Actual audio duration in minutes for accurate billing
Enable speaker diarization (Elevenlabs-STT only)
true, false Tag non-speech audio events like [laughter], [applause] (Elevenlabs-STT only)
true, false