Do you want to know more about Gladia latest speech-to-text recognition AI model?
Discover our state-of-the-art ASR model Whisper Zero now.
Language configuration
Single language
If you know the language of the conversation in advance, specify it in thelanguage_config.languages
parameter to ensure the best transcription results.
- Omit the
language_config.languages
parameter; the model will automatically detect the language from the audio across all supported languages. - Specify multiple languages in the
language_config.languages
parameter; the model will detect the language from the audio within the provided options.
Multiple languages
(Code-switching)
If you expect multiple languages to be spoken during the conversation, enable the language_config.code_switching
parameter. This will allow the model to switch languages dynamically and reflect it in the transcription results.
As with single-language configuration, you can either let the model detect the language from all supported languages or specify a set of options to narrow down the selection.
It is recommended to limit the number of languages to avoid incorrect
detection, either in single or multiple languages configuration. Some
languages, such as those from Eastern European countries, have similar sounds,
which may cause the model to confuse them and produce a transcription in the
wrong language.
Enhanced punctuation
This feature is in Alpha.
- It may have restricted access in the future.
- Breaking changes could still be introduced; however, advanced notice will be provided.
- Results may vary as we are updating the feature.
Enhanced: “Hello, how are you today? I am doing fine, thanks!” Enhanced punctuation is enabled by sending the
punctuation_enhanced
parameter in the transcription request:
Word-level timestamps
Instead of just getting utterances start and end timestamps, Gladia Speech-to-text API provides by default the Word-level timestamps feature. It lets you know the exact timestamp for each word and give you a more precise transcription. This feature is particularly useful for detailed analysis, as it allows you to pinpoint the exact moment each word is spoken, facilitating a more accurate synchronization with audio or video files. Under each utterance, you’ll find awords
property like this:
Sentences
In addition to getting the transcription split by utterances, you can request to semantically segment the transcription to sentences, providing a more human readable result.You can get translated sentences by enabling both
sentences
and translation
! You’ll receive sentences output for the the original transcript, and also each translation
result will contain the sentences output in the translated language!request data
sentences
key (in addition to utterances
):
Export SRT or VTT caption files
You can export completed transcripts in both SRT and VTT format, which can be used for subtitles and captions in videos.You can use the
subtitles
feature alongside the translation
feature.
You’ll have your subtitles in the original language, and also in languages you targeted for the translation!request data
subtitles_config
object supports the following options:
formats
: Array of subtitle formats to generate (options: “srt”, “vtt”)minimum_duration
: Minimum duration of a subtitle in seconds (minimum: 0)maximum_duration
: Maximum duration of a subtitle in seconds (minimum: 1, maximum: 30)maximum_characters_per_row
: Maximum number of characters per row in a subtitle (minimum: 1)maximum_rows_per_caption
: Maximum number of rows per caption (minimum: 1, maximum: 5)style
: Style of the subtitles. Options are:- “default”: Standard subtitle style
- “compliance”: Follows the compliance mode as described in Library of Congress Recommended Format Statement
JSON
response will include a new property subtitles
which is an array of every formats you requested.
With the given example, subtitles
will contains 2 items of shape:
Context prompt
If you know the context of the audio you’re sending, you can provide it in thecontext_prompt
.
request data
Custom vocabulary
To enhance the precision of transcription, especially for words or phrases that recur often in your audio file, you can utilize thecustom_vocabulary
feature in the transcription configuration settings.request data
{"value": "string"}
default_intensity
: [optional] The global intensity of the feature (minimum 0, maximum 1, default 0.5).vocabulary.value
: [required] The text used to replace in the transcription.vocabulary.pronunciations
: [optional] The pronunciations used in the transcription language, orvocabulary.language
if present.vocabulary.intensity
: [optional] The intensity of the feature for this particular word (minimum 0, maximum 1, default 0.5).vocabulary.language
: [optional] Specify the language in which it will be pronounced when sound comparison occurs. Default to transcription language.
Custom spelling
You can customize how certain words, names or phrases are spelled in the final transcript.To use custom spelling, provide a dictionary through the
custom_spelling_config
parameter. This dictionary should contain the correct spelling as the key and a list of one or more possible variations as the value.
Custom spelling is useful in scenarios where consistent spelling of specific words is crucial (e.g., technical terms in industry-specific recordings).
The keys in the dictionary are case sensitive, while the values aren’t. Values can contain multiple words.
request data
Name consistency
You can ask the model to enforce consistent spelling of names using thename_consistency
parameter. This will ensure the same name is spelled in the same manner throughout the transcript, at the cost of a small amount of added processing time.
This is especially useful for scenarios where people’s names may be mentioned multiple times, but these names are not known in advance
(e.g. recruitment call recordings).
To ensure correct spelling of names which are known in advance, use the custom vocabulary.
request data
Dual-channel or Multiple channels transcription
If you have multiples channels in your audio file with different content each, Gladia API automatically transcribe them. In the transcription result, you will get for each utterances achannel
key corresponding to the channels the transcription
came from.
Sending an audio with 2 different channels (that does not contains the same audio data), will be billed twice as 2 different audios.
If your audio has multiple channels but has the same audio content on each channels, it will only billed once.TLDR: We charge every unique channel in an audio file, we do not charge if channels are duplicates.
Adding custom metadata
You can add metadata to your transcription using thecustom_metadata
input during your POST request on /v2/pre-recorded
endpoint.
This will allow you to recognize your transcription when you get its data from the GET /v2/pre-recorded/:id
endpoint, but more important, it will allow you to use it as a filter in the GET /v2/pre-recorded
list endpoint.
For example, you can add the following when asking for a transcription:
custom_metadata
cannot be longer than 2000 characters when stringified.