Speech Recognition
Core feature of the Gladia API
The core functionality of the Gladia API is its Speech Recognition model, designed to convert spoken language into written text. This serves as the basis for all Gladia API offerings.
Do you want to know more about Gladia latest speech-to-text recognition AI model? Discover our state-of-the-art ASR model Whisper Zero now.
Additional capabilities, like Speaker Diarization, Summarization, Translation, Custom Prompts and more can be integrated seamlessly into the transcription process by including extra parameters in the transcription request.
Sending a transcription request
After that your audio have been processed, here’s an example output of what you should get :
Split infinity in a time when less is more, where too much is never enough. There is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet.`
Getting the result of a request
You can get your transcription results in 3 different ways:
Transcription job status
The transcription status can have different values :
Status | Description |
---|---|
queued | Audio waiting to be processed |
processing | Audio file being processed |
done | Transcription successfully completed |
error | An error occurred on your transcription |
Transcriptions can fail for various of reasons:
- No audio in the audio file
- Audio URL unreachable
- Issues with your file format
If you get another type of failure (most likely a server failure), resubmit the audio file and another server will take care of processing it.
Word-level timestamps
Instead of just getting utterances start and end timestamps, Gladia Speech-to-text API provides by default the Word-level timestamps feature. It lets you know the exact timestamp for each word and give you a more precise transcription. This feature is particularly useful for detailed analysis, as it allows you to pinpoint the exact moment each word is spoken, facilitating a more accurate synchronization with audio or video files.
Under each utterance, you’ll find a words
property like this:
// other properties...
"utterances": [
{
"words": [
{
"word": "Split",
"start": 0.21001999999999998,
"end": 0.69015,
"confidence": 1
},
{
"word": " infinity",
"start": 0.91021,
"end": 1.55038,
"confidence": 0.95
},
...
]
}
]
Sentences
In addition to getting the transcription split by utterances, you can request to semantically segment the transcription to sentences, providing a more human readable result.
You can get translated sentences by enabling both sentences
and translation
! You’ll receive sentences output for the the original transcript, and also each translation
result will contain the sentences output in the translated language!
{
"sentences": true
}
The result will contain a sentences
key (in addition to utterances
):
"sentences": {
"success": true,
"is_empty": false,
"results": [
{
"sentence": "Amy, it says you are trained in technology.",
"start": 0.4681999999999999,
"end": 2.45525,
"words": [...],
"confidence": 0.95,
"language": "en",
"speaker": 0,
"channel": 0
},
{
"sentence": "That's very good.",
"start": 2.51546,
"end": 3.5992999999999995,
"words": [...],
"confidence": 0.96,
"language": "en",
"speaker": 0,
"channel": 0
},
...
]
}
Language behaviour
If you know the audio used in the language, you should use Manual, otherwise, Automatic Language Detection.
Export SRT or VTT caption files
You can export completed transcripts in both SRT and VTT format, which can be used for subtitles and captions in videos.
You can use the subtitles
feature alongside the translation
feature.
You’ll have your subtitles in the original language, and also in languages you targeted for the translation!
{
"audio_url": "YOUR_AUDIO_URL",
"subtitles": true,
"subtitles_config": {
"formats": ["srt", "vtt"]
}
}
The JSON
response will include a new property subtitles
which is an array of every formats you requested.
With the given example, subtitles
will contains 2 items of shape:
{
"format": "srt", //format name
"subtitles": "1\n00:00:00,210 --> 00:00:04,711....." // subtitles
}
Context prompt
- Context prompt : If you know the context of the audio you’re sending, you can provide it in the
context_prompt
.
{
"audio_url": "YOUR_AUDIO_URL",
"context_prompt": "A conversation between Sansa Stark and Peter Baelish from the Game of Thrones series.",
}
Custom vocabulary
- Custom vocabulary : To enhance the precision of transcription, especially for words or phrases that recur often in your audio file, you
can utilize the
custom_vocabulary
feature in the transcription configuration settings.
The custom vocabulary has a global limit of 10k characters.
{
"audio_url": "YOUR_AUDIO_URL",
"custom_vocabulary": ["westeros", "stark", "night's watch"]
}
Dual-channel or Multiple channels transcription
If you have multiples channels in your audio file with different content each, Gladia API automatically transcribe them.
In the transcription result, you will get for each utterances a channel
key corresponding to the channels the transcription
came from.
Sending an audio with 2 different channels (that does not contains the same audio data), will be billed twice as 2 different audios. If your audio has multiple channels but has the same audio content on each channels, it will only billed once.
TLDR: We charge every unique channel in an audio file, we do not charge if channels are duplicates.
Adding custom metadata
You can add metadata to your transcription using the custom_metadata
input during your POST request on /v2/transcription
endpoint.
This will allow you to recognize your transcription when you get its data from the GET /v2/transcription/{id}
endpoint, but more important, it will allow you to use it as a filter in the GET /v2/transcription
list endpoint.
For example, you can add the following when asking for a transcription:
"custom_metadata": {
"internalUserId": 2348739875894375,
"paymentMethod": {
"last4Digits": 4576
},
"internalUserName": "Spencer"
}
And then, use the following GET request to filter results like:
https://api.gladia.io/v2/transcription?custom_metadata={"internalUserId": "2348739875894375"}
or
https://api.gladia.io/v2/transcription?custom_metadata={"paymentMethod": {"last4Digits": 4576}, "internalUserName": "Spencer"}
custom_metadata
cannot be longer than 2000 characters when stringified.
Was this page helpful?