Features
Core features of Gladia’s real-time speech-to-text (STT) API
Language configuration
Single language
If you know the language of the conversation in advance, specify it in the language_config.languages
parameter to ensure the best transcription results.
If the spoken language is unknown, you can:
- Omit the
language_config.languages
parameter; the model will automatically detect the language from the first few seconds of audio across all supported languages. - Specify multiple languages in the
language_config.languages
parameter; the model will detect the language from the first few seconds of audio within the provided options.
Multiple languages
(Code-switching)
If you expect multiple languages to be spoken during the conversation, enable the language_config.code_switching
parameter. This will allow the model to switch languages dynamically and reflect it in the transcription results.
As with single-language configuration, you can either let the model detect the language from all supported languages or specify a set of options to narrow down the selection.
It is recommended to limit the number of languages to avoid incorrect detection, either in single or multiple languages configuration. Some languages, such as those from Eastern European countries, have similar sounds, which may cause the model to confuse them and produce a transcription in the wrong language.
Word-level timestamps
Instead of just getting timestamps for when utterances begin and end, Gladia’s real-time API provides word-level timestamps. This lets you know the exact timestamp for each word, giving you a more precise transcription, facilitating detailed analysis and more accurate synchronization with audio and video files.
To enable it, pass the following configuration:
Under each utterance, you’ll find a words
property, like this:
Custom vocabulary
To enhance the precision of words you know will recur often in your transcription, use the custom_vocabulary
feature.
Custom vocabulary has the following limitations:
- Global limit of 10k characters
- No more than 100 entries
- Each element can’t contain more than 5 words
Multiple channels
If you have multiple channels in your audio stream, specify the count in the configuration:
Gladia’s real-time API will automatically split the channels and transcribe them separately.
For each utterance, you’ll get a channel
key corresponding to the channel the utterance came from.
Transcribing an audio stream with multiple channels will be billed exponentially. For example, an audio stream with 2 channels will be billed as double the audio duration, even if the channels are identical.
Attaching custom metadata
You can attach metadata to your real-time transcription session using the custom_metadata
property. This’ll make it easy to recognize your transcription when you receive data from the GET /v2/live/:id
endpoint. And more importantly, you’ll be able to use it as a filter in the GET /v2/live
list endpoint.
For example, you can add the following to your configuration:
And use a GET request to filter results, like this:
or like this:
custom_metadata
cannot be longer than 2000 characters when stringified.
Was this page helpful?