Speech Recognition
Core feature of the Gladia API
The core functionality of the Gladia API is its Speech Recognition model, designed to convert spoken language into written text. This serves as the basis for all Gladia API offerings.
Do you want to know more about Gladia latest speech-to-text recognition AI model? Discover our state-of-the-art ASR model Whisper Zero now.
Additional capabilities, like Speaker Diarization, Summarization, Translation, Custom Prompts and more can be integrated seamlessly into the transcription process by including extra parameters in the transcription request.
Sending a transcription request
After that your audio have been processed, here’s an example output of what you should get :
Split infinity in a time when less is more, where too much is never enough. There is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet.`
Getting the result of a request
You can get your transcription results in 3 different ways:
Transcription job status
The transcription status can have different values :
Status | Description |
---|---|
queued | Audio waiting to be processed |
processing | Audio file being processed |
done | Transcription successfully completed |
error | An error occurred on your transcription |
Transcriptions can fail for various of reasons:
- No audio in the audio file
- Audio URL unreachable
- Issues with your file format
If you get another type of failure (most likely a server failure), resubmit the audio file and another server will take care of processing it.
Word-level timestamps
Instead of just getting utterances start and end timestamps, Gladia Speech-to-text API provides by default the Word-level timestamps feature. It lets you know the exact timestamp for each word and give you a more precise transcription. This feature is particularly useful for detailed analysis, as it allows you to pinpoint the exact moment each word is spoken, facilitating a more accurate synchronization with audio or video files.
Under each utterance, you’ll find a words
property like this:
// other properties...
"utterances": [
{
"words": [
{
"word": "Split",
"start": 0.21001999999999998,
"end": 0.69015,
"confidence": 1
},
{
"word": " infinity",
"start": 0.91021,
"end": 1.55038,
"confidence": 0.95
},
...
]
}
]
Sentences
In addition to getting the transcription split by utterances, you can request to semantically segment the transcription to sentences, providing a more human readable result.
You can get translated sentences by enabling both sentences
and translation
! You’ll receive sentences output for the the original transcript, and also each translation
result will contain the sentences output in the translated language!
{
"sentences": true
}
The result will contain a sentences
key (in addition to utterances
):
"sentences": {
"success": true,
"is_empty": false,
"results": [
{
"sentence": "Amy, it says you are trained in technology.",
"start": 0.4681999999999999,
"end": 2.45525,
"words": [...],
"confidence": 0.95,
"language": "en",
"speaker": 0,
"channel": 0
},
{
"sentence": "That's very good.",
"start": 2.51546,
"end": 3.5992999999999995,
"words": [...],
"confidence": 0.96,
"language": "en",
"speaker": 0,
"channel": 0
},
...
]
}
Automatic language detection
With automatic language detection, Gladia will identify the dominant language spoken in an audio file and use it during the transcription. This behaviour can be toggled with the detect_language
parameter.
Automatic language detection is turned on by default.
{
"audio_url": "YOUR_AUDIO_URL",
"detect_language": true
}
Multiple languages detection
(Code switching)
By enabling code switching, the model will continuously detect the spoken language and switch the transcription language accordingly. This behaviour is recommended for specific scenarios where the language is changed multiple times throughout the audio (e.g. a conversation between 2 people, each speaking a different language.),
Please note that certain strong accents may possibly cause this mode to transcribe to the wrong language.
To enable or disable it, set enable_code_switching
to true in the request body parameters. (default to false
)
{
"audio_url": "YOUR_AUDIO_URL",
"enable_code_switching": true
}
Guided code switching
When code switching is enabled, you may provide a list of languages to the model, ensuring the model will only detect these languages.
{
"audio_url": "YOUR_AUDIO_URL",
"enable_code_switching": true,
"code_switching_config": {
"languages": ["en", "es", "fr"]
}
}
Manual transcription language
If you already know the dominant language, you can disable language detection by setting detect_language
to false
and manually set the the language with the language
key.
In order to use manual language detect_language
must be disabled, otherwise the language
parameter will be ignored.
{
"audio_url": "YOUR_AUDIO_URL",
"detect_language": false,
"language": "fr"
}
Export SRT or VTT caption files
You can export completed transcripts in both SRT and VTT format, which can be used for subtitles and captions in videos.
You can use the subtitles
feature alongside the translation
feature.
You’ll have your subtitles in the original language, and also in languages you targeted for the translation!
{
"audio_url": "YOUR_AUDIO_URL",
"subtitles": true,
"subtitles_config": {
"formats": ["srt", "vtt"]
}
}
The JSON
response will include a new property subtitles
which is an array of every formats you requested.
With the given example, subtitles
will contains 2 items of shape:
{
"format": "srt", //format name
"subtitles": "1\n00:00:00,210 --> 00:00:04,711....." // subtitles
}
Context prompt
If you know the context of the audio you’re sending, you can provide it in the context_prompt
.
{
"audio_url": "YOUR_AUDIO_URL",
"context_prompt": "A conversation between Sansa Stark and Peter Baelish from the Game of Thrones series.",
}
Custom vocabulary
To enhance the precision of transcription, especially for words or phrases that recur often in your audio file, you
can utilize the custom_vocabulary
feature in the transcription configuration settings.
The custom vocabulary has the following limitation:
- global limit of 10k characters
- no more than 100 elements
- each element should not contain more than 5 words
{
"audio_url": "YOUR_AUDIO_URL",
"custom_vocabulary": ["westeros", "stark", "night's watch"]
}
Custom spelling
You can customize how certain words, names or phrases are spelled in the final transcript.
To use custom spelling, provide a dictionary through the custom_spelling_config
parameter. This dictionary should contain the correct spelling as the key and a list of one or more possible variations as the value.
Custom spelling is useful in scenarios where consistent spelling of specific words is crucial (e.g., technical terms in industry-specific recordings).
The keys in the dictionary are case sensitive, while the values aren’t. Values can contain multiple words.
{
"custom_spelling": true,
"custom_spelling_config": {
"spelling_dictionary": {
"Gorish": ["ghorish", "gaurish", "gaureish"],
"Data Science": ["data-science", "data science"],
".": ["period", "full stop"],
"SQL": ["sequel"]
}
}
}
In this example, the model will ensure that “Gorish” is spelled correctly throughout the transcript, even if it is pronounced in various ways such as “ghorish,” “gaurish,” or “gaureish.”
Name consistency
You can ask the model to enforce consistent spelling of names using the name_consistency
parameter. Thie will ensure the same name is spelled in the same manner throughout the transcript, at the cost of a small amount of added processing time.
This is especially useful for scenarios where people’s names may be mentioned multiple times, but these names are not known in advance (e.g. recruitment call recordings). To ensure correct spelling of names which are known in advance, use the custom vocabulary.
{
"audio_url": "YOUR_AUDIO_URL",
"name_consistency": true
}
Dual-channel or Multiple channels transcription
If you have multiples channels in your audio file with different content each, Gladia API automatically transcribe them.
In the transcription result, you will get for each utterances a channel
key corresponding to the channels the transcription
came from.
Sending an audio with 2 different channels (that does not contains the same audio data), will be billed twice as 2 different audios. If your audio has multiple channels but has the same audio content on each channels, it will only billed once.
TLDR: We charge every unique channel in an audio file, we do not charge if channels are duplicates.
Adding custom metadata
You can add metadata to your transcription using the custom_metadata
input during your POST request on /v2/transcription
endpoint.
This will allow you to recognize your transcription when you get its data from the GET /v2/transcription/{id}
endpoint, but more important, it will allow you to use it as a filter in the GET /v2/transcription
list endpoint.
For example, you can add the following when asking for a transcription:
"custom_metadata": {
"internalUserId": 2348739875894375,
"paymentMethod": {
"last4Digits": 4576
},
"internalUserName": "Spencer"
}
And then, use the following GET request to filter results like:
https://api.gladia.io/v2/transcription?custom_metadata={"internalUserId": "2348739875894375"}
or
https://api.gladia.io/v2/transcription?custom_metadata={"paymentMethod": {"last4Digits": 4576}, "internalUserName": "Spencer"}
custom_metadata
cannot be longer than 2000 characters when stringified.
Was this page helpful?