Asynchronous transcription refers to the transcription of pre-recorded audio/video files.
Our technology differs from other speech-to-text providers as the time to process vs. the recording length, often called Real-Time Factor (RTF), is not linear.
You usually find a long process with other providers: 25 min to process 1 hour of audio, resulting in a 42% RTF (25/60=0.42). With faster providers, it can sometimes take between 8 and 12min (15%).
With Gladia, the non-linearity of the processing time can be surprising at first, as it can process a short audio file of 20 seconds in 2 seconds (RTF = 2/20 = 10%), an audio file of 1 hour in 10 to 25 seconds (RTF = 10/3600 = 0.3%), and a 2-hour file in 50 seconds (0.7%).
In all cases, we are faster than the market standards by miles ;-)
The transcription time doesn't include the upload/download time of the file, nor the time to convert the file into a different format (where applicable), nor the Audio Intelligence add-ons processing time such as algorithmic speaker diarization, sentiment, emotion, topic detection, moderation, summarization, chapterization, etc.
|Operation||Typical time per hour of audio recording|
|Upload/download||Depends on the file settings (please refer to the supported Media Format to find the best tradeoff between transfer and conversion time: https://docs.gladia.io/reference/supported-media-formats)|
|Channel-based Diarization||0sec (please refer to the diarization guide for further information)|
|Algorithmic-based Diarization||60sec (please refer to the diarization guide for further information)|
|Named Entity Recognition||500ms-700ms|
Our Real-Time Streaming WebSocket API streams text transcriptions back to clients within a few hundred milliseconds (typically 200 to 300ms after the end of the utterance). This latency includes upload, transcription, and callback.