POST
/
api
/
transcribe
curl --request POST \
  --url https://nano-gpt.com/api/api/transcribe \
  --header 'Content-Type: multipart/form-data' \
  --header 'x-api-key: <api-key>' \
  --form 'audioUrl=<string>' \
  --form model=Whisper-Large-V3 \
  --form language=en \
  --form 'actualDuration=<string>' \
  --form diarize=false \
  --form tagAudioEvents=false
{
  "transcription": "Hello, this is a test transcription.",
  "metadata": {
    "fileName": "<string>",
    "fileSize": 123,
    "chargedDuration": 123,
    "actualDuration": 123,
    "language": "<string>",
    "cost": 123,
    "currency": "USD",
    "model": "<string>"
  }
}

Overview

The Speech-to-Text transcription endpoint converts audio files into text using state-of-the-art speech recognition models. Supports multiple languages, speaker diarization, and various audio formats.

Supported Models

  • Whisper-Large-V3: OpenAI’s flagship model ($0.01/min) - Synchronous
  • Wizper: Fast and efficient model ($0.01/min) - Synchronous
  • Elevenlabs-STT: Premium with diarization ($0.03/min) - Asynchronous

Upload Methods

Direct File Upload (≤3MB)

import requests

def transcribe_file(file_path):
    headers = {"x-api-key": "YOUR_API_KEY"}
    
    with open(file_path, 'rb') as audio_file:
        files = {'audio': ('audio.mp3', audio_file, 'audio/mpeg')}
        data = {
            'model': 'Whisper-Large-V3',
            'language': 'en'
        }
        
        response = requests.post(
            "https://nano-gpt.com/api/transcribe",
            headers=headers,
            files=files,
            data=data
        )
        
        return response.json()

result = transcribe_file("meeting.mp3")
print(result['transcription'])

URL Upload (≤500MB)

import requests

def transcribe_url(audio_url):
    headers = {
        "x-api-key": "YOUR_API_KEY",
        "Content-Type": "application/json"
    }
    
    data = {
        "audioUrl": audio_url,
        "model": "Wizper",
        "language": "auto"
    }
    
    response = requests.post(
        "https://nano-gpt.com/api/transcribe",
        headers=headers,
        json=data
    )
    
    return response.json()

result = transcribe_url("https://example.com/audio.mp3")
print(result['transcription'])

Advanced Features - Speaker Diarization

Use Elevenlabs-STT for speaker identification (asynchronous processing):

import requests
import time

def transcribe_with_speakers(audio_url):
    headers = {
        "x-api-key": "YOUR_API_KEY",
        "Content-Type": "application/json"
    }
    
    # Submit transcription job
    data = {
        "audioUrl": audio_url,
        "model": "Elevenlabs-STT",
        "diarize": True,
        "tagAudioEvents": True
    }
    
    response = requests.post(
        "https://nano-gpt.com/api/transcribe",
        headers=headers,
        json=data
    )
    
    if response.status_code == 202:
        job_data = response.json()
        
        # Poll for results
        status_data = {
            "runId": job_data['runId'],
            "cost": job_data.get('cost'),
            "paymentSource": job_data.get('paymentSource'),
            "isApiRequest": True
        }
        
        while True:
            status_response = requests.post(
                "https://nano-gpt.com/api/transcribe/status",
                headers=headers,
                json=status_data
            )
            
            result = status_response.json()
            if result.get('status') == 'completed':
                return result
            elif result.get('status') == 'failed':
                raise Exception(f"Transcription failed: {result.get('error')}")
            
            time.sleep(5)

result = transcribe_with_speakers("https://example.com/meeting.mp3")

# Access speaker segments
for segment in result['diarization']['segments']:
    print(f"{segment['speaker']}: {segment['text']}")

Language Support

Supports 97+ languages with auto-detection:

# Common language codes
languages = {
    "auto": "Auto-detect",
    "en": "English", 
    "es": "Spanish",
    "fr": "French",
    "de": "German", 
    "zh": "Chinese",
    "ja": "Japanese",
    "ar": "Arabic"
}

Response Examples

Synchronous Response (Whisper/Wizper)

{
  "transcription": "Hello, this is a test transcription.",
  "metadata": {
    "fileName": "audio.mp3",
    "fileSize": 1234567,
    "chargedDuration": 2.5,
    "actualDuration": 2.5,
    "language": "en",
    "cost": 0.025,
    "currency": "USD",
    "model": "Whisper-Large-V3"
  }
}

Asynchronous Response (Elevenlabs-STT)

Initial response (202):

{
  "runId": "abc123def456",
  "status": "pending",
  "model": "Elevenlabs-STT",
  "cost": 0.075,
  "paymentSource": "USD"
}

Final response (when completed):

{
  "status": "completed",
  "transcription": "Speaker 1: Hello everyone. Speaker 2: Hi there!",
  "metadata": { ... },
  "diarization": {
    "segments": [
      {
        "speaker": "Speaker 1",
        "text": "Hello everyone",
        "start": 0.5,
        "end": 1.5
      }
    ]
  },
  "words": [
    {
      "text": "Hello",
      "start": 0.5,
      "end": 0.9,
      "type": "word",
      "speaker_id": "speaker_0"
    }
  ]
}

Authorizations

x-api-key
string
header
required

Body

Audio transcription parameters. Use multipart/form-data for file uploads or application/json for URL-based requests.

The body is of type object.

Response

200
application/json

Synchronous transcription response (Whisper/Wizper models)

The response is of type object.