Processing Times

Asynchronous Transcription

Asynchronous transcription refers to the transcription of pre-recorded audio/video files.

Our technology differs from other speech-to-text providers as the time to process vs. the recording length, often called Real-Time Factor (RTF), is not linear.

You usually find a long process with other providers: 25 min to process 1 hour of audio, resulting in a 42% RTF (25/60=0.42). With faster providers, it can sometimes take between 8 and 12min (15%).

With Gladia, the non-linearity of the processing time can be surprising at first, as it can process a short audio file of 20 seconds in 2 seconds (RTF = 2/20 = 10%), an audio file of 1 hour in 10 to 25 seconds (RTF = 10/3600 = 0.3%), and a 2-hour file in 50 seconds (0.7%).

In all cases, we are faster than the market standards by miles ;-)


The transcription time doesn't include the upload/download time of the file, nor the time to convert the file into a different format (where applicable), nor the Audio Intelligence add-ons processing time such as algorithmic speaker diarization, sentiment, emotion, topic detection, moderation, summarization, chapterization, etc.

Typical Processing time per Audio Intelligence modules

OperationTypical time per hour of audio recording
Upload/downloadDepends on the file settings (please refer to the supported Media Format to find the best tradeoff between transfer and conversion time:
Channel-based Diarization0sec (please refer to the diarization guide for further information)
Algorithmic-based Diarization60sec (please refer to the diarization guide for further information)
Prompt Injection0sec
Sentiment Analysis500ms-700ms
Emotion Analysis500ms-700ms
Moderation Classification500ms-700ms
Topic Classification500ms-700ms
Named Entity Recognition500ms-700ms
Direct Translation75ms
Noise Reduction5sec

Real-Time Streaming Transcription

Our Real-Time Streaming WebSocket API streams text transcriptions back to clients within a few hundred milliseconds (typically 200 to 300ms after the end of the utterance). This latency includes upload, transcription, and callback.