Speech-to-Text Transcription
Transcribe audio files into text using state-of-the-art speech recognition models. Supports multiple languages, speaker diarization, and various audio formats. Returns synchronous results for Whisper/Wizper models and asynchronous job IDs for Elevenlabs-STT.
Overview
The Speech-to-Text transcription endpoint converts audio files into text using state-of-the-art speech recognition models. Supports multiple languages, speaker diarization, and various audio formats.
Supported Models
- Whisper-Large-V3: OpenAI’s flagship model ($0.01/min) - Synchronous
- Wizper: Fast and efficient model ($0.01/min) - Synchronous
- Elevenlabs-STT: Premium with diarization ($0.03/min) - Asynchronous
Upload Methods
Direct File Upload (≤3MB)
URL Upload (≤500MB)
Advanced Features - Speaker Diarization
Use Elevenlabs-STT for speaker identification (asynchronous processing):
Language Support
Supports 97+ languages with auto-detection:
Response Examples
Synchronous Response (Whisper/Wizper)
Asynchronous Response (Elevenlabs-STT)
Initial response (202):
Final response (when completed):
Authorizations
Body
Audio transcription parameters. Use multipart/form-data for file uploads or application/json for URL-based requests.
The body is of type object
.
Response
Synchronous transcription response (Whisper/Wizper models)
The response is of type object
.