Processing Times
Asynchronous Transcription
Asynchronous transcription refers to the transcription of pre-recorded audio/video files.
Our technology differs from other speech-to-text providers as the time to process vs. the recording length, often called Real-Time Factor (RTF), is not linear.
You usually find a long process with other providers: 25 min to process 1 hour of audio, resulting in a 42% RTF (25/60=0.42). With faster providers, it can sometimes take between 8 and 12min (15%).
With Gladia, the non-linearity of the processing time can be surprising at first, as it can process a short audio file of 20 seconds in 2 seconds (RTF = 2/20 = 10%), an audio file of 1 hour in 10 to 25 seconds (RTF = 10/3600 = 0.3%), and a 2-hour file in 50 seconds (0.7%).
In all cases, we are faster than the market standards by miles ;-)
The transcription time doesn't include the upload/download time of the file, nor the time to convert the file into a different format (where applicable), nor the Audio Intelligence add-ons processing time such as algorithmic speaker diarization, sentiment, emotion, topic detection, moderation, summarization, chapterization, etc.
Typical Processing time per Audio Intelligence modules
Operation | Typical time per hour of audio recording |
---|---|
Upload/download | Depends on the file settings (please refer to the supported Media Format to find the best tradeoff between transfer and conversion time: https://docs.gladia.io/reference/supported-media-formats) |
Transcription | 10-15sec |
Channel-based Diarization | 0sec (please refer to the diarization guide for further information) |
Algorithmic-based Diarization | 60sec (please refer to the diarization guide for further information) |
Prompt Injection | 0sec |
Summarization | 15sec |
Chapterization | 15sec |
Sentiment Analysis | 500ms-700ms |
Emotion Analysis | 500ms-700ms |
Moderation Classification | 500ms-700ms |
Topic Classification | 500ms-700ms |
Named Entity Recognition | 500ms-700ms |
Direct Translation | 75ms |
Noise Reduction | 5sec |
Real-Time Streaming Transcription
Our Real-Time Streaming WebSocket API streams text transcriptions back to clients within a few hundred milliseconds (typically 200 to 300ms after the end of the utterance). This latency includes upload, transcription, and callback.