The core functionality of the Gladia API is its Speech Recognition model, designed to convert spoken language into written text. This serves as the basis for all Gladia API offerings.

Do you want to know more about Gladia latest speech-to-text recognition AI model? Discover our state-of-the-art ASR model Whisper Zero now.

Additional capabilities, like Speaker Diarization, Summarization, Translation, Custom Prompts and more can be integrated seamlessly into the transcription process by including extra parameters in the transcription request.

Word-level timestamps

Instead of just getting utterances start and end timestamps, Gladia Speech-to-text API provides by default the Word-level timestamps feature. It lets you know the exact timestamp for each word and give you a more precise transcription. This feature is particularly useful for detailed analysis, as it allows you to pinpoint the exact moment each word is spoken, facilitating a more accurate synchronization with audio or video files.

Under each utterance, you’ll find a words property like this:

// other properties...
"utterances": [
    {
      "words": [
        {
          "word": "Split",
          "start": 0.21001999999999998,
          "end": 0.69015,
          "confidence": 1
        },
        {
          "word": " infinity",
          "start": 0.91021,
          "end": 1.55038,
          "confidence": 0.95
        },
        ...
      ]
    }
  ]

Sentences

In addition to getting the transcription split by utterances, you can request to semantically segment the transcription to sentences, providing a more human readable result.

You can get translated sentences by enabling both sentences and translation! You’ll receive sentences output for the the original transcript, and also each translation result will contain the sentences output in the translated language!

request data
{
  "sentences": true
}

The result will contain a sentences key (in addition to utterances):

		"sentences": {
			"success": true,
			"is_empty": false,
			"results": [
				{
					"sentence": "Amy, it says you are trained in technology.",
					"start": 0.4681999999999999,
					"end": 2.45525,
					"words": [...],
					"confidence": 0.95,
					"language": "en",
					"speaker": 0,
					"channel": 0
				},
				{
					"sentence": "That's very good.",
					"start": 2.51546,
					"end": 3.5992999999999995,
					"words": [...],
					"confidence": 0.96,
					"language": "en",
					"speaker": 0,
					"channel": 0
				},
        ...
      ]
    }

Automatic language detection

With automatic language detection, Gladia will identify the dominant language spoken in an audio file and use it during the transcription. This behaviour can be toggled with the detect_language parameter.

Automatic language detection is turned on by default.

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "detect_language": true
}

Multiple languages detection
(Code switching)

By enabling code switching, the model will continuously detect the spoken language and switch the transcription language accordingly. This behaviour is recommended for specific scenarios where the language is changed multiple times throughout the audio (e.g. a conversation between 2 people, each speaking a different language.),

Please note that certain strong accents may possibly cause this mode to transcribe to the wrong language.

To enable or disable it, set enable_code_switching to true in the request body parameters. (default to false)

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "enable_code_switching": true
}

Guided code switching

When code switching is enabled, you may provide a list of languages to the model, ensuring the model will only detect these languages.

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "enable_code_switching": true,
  "code_switching_config": {
    "languages": ["en", "es", "fr"]
  }
}

Manual transcription language

If you already know the dominant language, you can disable language detection by setting detect_language to false and manually set the the language with the language key.

In order to use manual language detect_language must be disabled, otherwise the language parameter will be ignored.

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "detect_language": false,
  "language": "fr"
}

Export SRT or VTT caption files

You can export completed transcripts in both SRT and VTT format, which can be used for subtitles and captions in videos.

You can use the subtitles feature alongside the translation feature. You’ll have your subtitles in the original language, and also in languages you targeted for the translation!

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "subtitles": true,
  "subtitles_config": {
    "formats": ["srt", "vtt"],
    "minimum_duration": 1,
    "maximum_duration": 5,
    "maximum_characters_per_row": 42,
    "maximum_rows_per_caption": 2,
    "style": "compliance"
  }
}

The subtitles_config object supports the following options:

  • formats: Array of subtitle formats to generate (options: “srt”, “vtt”)
  • minimum_duration: Minimum duration of a subtitle in seconds (minimum: 0)
  • maximum_duration: Maximum duration of a subtitle in seconds (minimum: 1, maximum: 30)
  • maximum_characters_per_row: Maximum number of characters per row in a subtitle (minimum: 1)
  • maximum_rows_per_caption: Maximum number of rows per caption (minimum: 1, maximum: 5)
  • style: Style of the subtitles. Options are:

The JSON response will include a new property subtitles which is an array of every formats you requested. With the given example, subtitles will contains 2 items of shape:

{
  "format": "srt", //format name
  "subtitles": "1\n00:00:00,210 --> 00:00:04,711....." // subtitles
}

Context prompt

If you know the context of the audio you’re sending, you can provide it in the context_prompt.

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "context_prompt": "A conversation between Sansa Stark and Peter Baelish from the Game of Thrones series.",
}

Custom vocabulary

To enhance the precision of transcription, especially for words or phrases that recur often in your audio file, you can utilize the custom_vocabulary feature in the transcription configuration settings.
The custom vocabulary has the following limitation:

  • global limit of 10k characters
  • no more than 100 elements
  • each element should not contain more than 5 words
request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "custom_vocabulary": ["westeros", "stark", "night's watch"]
}

Custom spelling

You can customize how certain words, names or phrases are spelled in the final transcript.
To use custom spelling, provide a dictionary through the custom_spelling_config parameter. This dictionary should contain the correct spelling as the key and a list of one or more possible variations as the value.

Custom spelling is useful in scenarios where consistent spelling of specific words is crucial (e.g., technical terms in industry-specific recordings).

The keys in the dictionary are case sensitive, while the values aren’t. Values can contain multiple words.

request data
{
  "custom_spelling": true,
  "custom_spelling_config": {
    "spelling_dictionary": {
        "Gorish": ["ghorish", "gaurish", "gaureish"],
        "Data Science": ["data-science", "data science"],
        ".": ["period", "full stop"],
        "SQL": ["sequel"]
    }
  }
}

In this example, the model will ensure that “Gorish” is spelled correctly throughout the transcript, even if it is pronounced in various ways such as “ghorish,” “gaurish,” or “gaureish.”

Name consistency

You can ask the model to enforce consistent spelling of names using the name_consistency parameter. This will ensure the same name is spelled in the same manner throughout the transcript, at the cost of a small amount of added processing time.

This is especially useful for scenarios where people’s names may be mentioned multiple times, but these names are not known in advance (e.g. recruitment call recordings). To ensure correct spelling of names which are known in advance, use the custom vocabulary.

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "name_consistency": true
}

Dual-channel or Multiple channels transcription

If you have multiples channels in your audio file with different content each, Gladia API automatically transcribe them. In the transcription result, you will get for each utterances a channel key corresponding to the channels the transcription came from.

Sending an audio with 2 different channels (that does not contains the same audio data), will be billed twice as 2 different audios. If your audio has multiple channels but has the same audio content on each channels, it will only billed once.

TLDR: We charge every unique channel in an audio file, we do not charge if channels are duplicates.

Adding custom metadata

You can add metadata to your transcription using the custom_metadata input during your POST request on /v2/pre-recorded endpoint. This will allow you to recognize your transcription when you get its data from the GET /v2/pre-recorded/:id endpoint, but more important, it will allow you to use it as a filter in the GET /v2/pre-recorded list endpoint. For example, you can add the following when asking for a transcription:

"custom_metadata": {
    "internalUserId": 2348739875894375,
    "paymentMethod": {
        "last4Digits": 4576
     },
     "internalUserName": "Spencer"
}

And then, use the following GET request to filter results like:

https://api.gladia.io/v2/pre-recorded?custom_metadata={"internalUserId": "2348739875894375"}

or

https://api.gladia.io/v2/pre-recorded?custom_metadata={"paymentMethod": {"last4Digits": 4576}, "internalUserName": "Spencer"}

custom_metadata cannot be longer than 2000 characters when stringified.