Features

The core functionality of the Gladia API is its Speech Recognition model, designed to convert spoken language into written text. This serves as the basis for all Gladia API offerings.

Do you want to know more about Gladia latest speech-to-text recognition AI model? Discover our state-of-the-art ASR model Whisper Zero now.

Additional capabilities, like Speaker Diarization, Summarization, Translation, Custom Prompts and more can be integrated seamlessly into the transcription process by including extra parameters in the transcription request.

Language configuration

Single language

If you know the language of the conversation in advance, specify it in the language_config.languages parameter to ensure the best transcription results.

{
  ...rest of the config
  "language_config": {
    "languages": ["en"]
  }
}

If the spoken language is unknown, you can:

Omit the language_config.languages parameter; the model will automatically detect the language from the audio across all supported languages.
Specify multiple languages in the language_config.languages parameter; the model will detect the language from the audio within the provided options.

{
  ...rest of the config
  // Will limit to detect from English, French or Spanish
  "language_config": {
    "languages": ["en", "fr", "es"]
  }
}

Multiple languages
(Code-switching)

If you expect multiple languages to be spoken during the conversation, enable the language_config.code_switching parameter. This will allow the model to switch languages dynamically and reflect it in the transcription results. As with single-language configuration, you can either let the model detect the language from all supported languages or specify a set of options to narrow down the selection.

{
  ...rest of the config
  "language_config": {
    // Will limit to detect from English, French or Spanish
    "languages": ["en", "fr", "es"],
    "code_switching": true
  }
}

It is recommended to limit the number of languages to avoid incorrect detection, either in single or multiple languages configuration. Some languages, such as those from Eastern European countries, have similar sounds, which may cause the model to confuse them and produce a transcription in the wrong language.

Enhanced punctuation

This feature is in Alpha.

It may have restricted access in the future.
Breaking changes could still be introduced; however, advanced notice will be provided.
Results may vary as we are updating the feature.

Enhanced punctuation improves both the accuracy and natural flow of punctuation in transcriptions. It ensures precise comma placement, natural sentence break and better handling of quotation marks. Standard: “hello how are you today I am doing fine thanks”
Enhanced: “Hello, how are you today? I am doing fine, thanks!” Enhanced punctuation is enabled by sending the punctuation_enhanced parameter in the transcription request:

{
  "audio_url": "YOUR_AUDIO_URL",
  "punctuation_enhanced": true
}

Word-level timestamps

Instead of just getting utterances start and end timestamps, Gladia Speech-to-text API provides by default the Word-level timestamps feature. It lets you know the exact timestamp for each word and give you a more precise transcription. This feature is particularly useful for detailed analysis, as it allows you to pinpoint the exact moment each word is spoken, facilitating a more accurate synchronization with audio or video files. Under each utterance, you’ll find a words property like this:

// other properties...
"utterances": [
    {
      "words": [
        {
          "word": "Split",
          "start": 0.21001999999999998,
          "end": 0.69015,
          "confidence": 1
        },
        {
          "word": " infinity",
          "start": 0.91021,
          "end": 1.55038,
          "confidence": 0.95
        },
        ...
      ]
    }
  ]

Sentences

In addition to getting the transcription split by utterances, you can request to semantically segment the transcription to sentences, providing a more human readable result.

You can get translated sentences by enabling both sentences and translation! You’ll receive sentences output for the the original transcript, and also each translation result will contain the sentences output in the translated language!

request data

{
  "sentences": true
}

The result will contain a sentences key (in addition to utterances):

		"sentences": {
			"success": true,
			"is_empty": false,
			"results": [
				{
					"sentence": "Amy, it says you are trained in technology.",
					"start": 0.4681999999999999,
					"end": 2.45525,
					"words": [...],
					"confidence": 0.95,
					"language": "en",
					"speaker": 0,
					"channel": 0
				},
				{
					"sentence": "That's very good.",
					"start": 2.51546,
					"end": 3.5992999999999995,
					"words": [...],
					"confidence": 0.96,
					"language": "en",
					"speaker": 0,
					"channel": 0
				},
        ...
      ]
    }

Export SRT or VTT caption files

You can export completed transcripts in both SRT and VTT format, which can be used for subtitles and captions in videos.

You can use the subtitles feature alongside the translation feature. You’ll have your subtitles in the original language, and also in languages you targeted for the translation!

request data

{
  "audio_url": "YOUR_AUDIO_URL",
  "subtitles": true,
  "subtitles_config": {
    "formats": ["srt", "vtt"],
    "minimum_duration": 1,
    "maximum_duration": 5,
    "maximum_characters_per_row": 42,
    "maximum_rows_per_caption": 2,
    "style": "compliance"
  }
}

The subtitles_config object supports the following options:

formats: Array of subtitle formats to generate (options: “srt”, “vtt”)
minimum_duration: Minimum duration of a subtitle in seconds (minimum: 0)
maximum_duration: Maximum duration of a subtitle in seconds (minimum: 1, maximum: 30)
maximum_characters_per_row: Maximum number of characters per row in a subtitle (minimum: 1)
maximum_rows_per_caption: Maximum number of rows per caption (minimum: 1, maximum: 5)
style: Style of the subtitles. Options are:
- “default”: Standard subtitle style
- “compliance”: Follows the compliance mode as described in Library of Congress Recommended Format Statement

The JSON response will include a new property subtitles which is an array of every formats you requested. With the given example, subtitles will contains 2 items of shape:

{
  "format": "srt", //format name
  "subtitles": "1\n00:00:00,210 --> 00:00:04,711....." // subtitles
}

Context prompt

If you know the context of the audio you’re sending, you can provide it in the context_prompt.

request data

{
  "audio_url": "YOUR_AUDIO_URL",
  "context_prompt": "A conversation between Sansa Stark and Peter Baelish from the Game of Thrones series.",
}

Custom vocabulary

To enhance the precision of transcription, especially for words or phrases that recur often in your audio file, you can utilize the custom_vocabulary feature in the transcription configuration settings.

request data

{
  "audio_url": "YOUR_AUDIO_URL",
  "custom_vocabulary": true,
  "custom_vocabulary_config": {
    "vocabulary": [
      "Westeros", 
      {"value": "Stark"}, 
      { 
        "value": "Night's Watch",
        "pronunciations": ["Nightz Vatch"],
        "intensity": 0.4,
        "language": "de"
      }
    ],
    "default_intensity": 0.6
  }
}

Using a string in place of an object is equivalent to use {"value": "string"}

default_intensity: [optional] The global intensity of the feature (minimum 0, maximum 1, default 0.5).
vocabulary.value: [required] The text used to replace in the transcription.
vocabulary.pronunciations: [optional] The pronunciations used in the transcription language, or vocabulary.language if present.
vocabulary.intensity: [optional] The intensity of the feature for this particular word (minimum 0, maximum 1, default 0.5).
vocabulary.language: [optional] Specify the language in which it will be pronounced when sound comparison occurs. Default to transcription language.

Custom spelling

You can customize how certain words, names or phrases are spelled in the final transcript.
To use custom spelling, provide a dictionary through the custom_spelling_config parameter. This dictionary should contain the correct spelling as the key and a list of one or more possible variations as the value. Custom spelling is useful in scenarios where consistent spelling of specific words is crucial (e.g., technical terms in industry-specific recordings).

The keys in the dictionary are case sensitive, while the values aren’t. Values can contain multiple words.

request data

{
  "custom_spelling": true,
  "custom_spelling_config": {
    "spelling_dictionary": {
        "Gorish": ["ghorish", "gaurish", "gaureish"],
        "Data Science": ["data-science", "data science"],
        ".": ["period", "full stop"],
        "SQL": ["sequel"]
    }
  }
}

In this example, the model will ensure that “Gorish” is spelled correctly throughout the transcript, even if it is pronounced in various ways such as “ghorish,” “gaurish,” or “gaureish.”

Name consistency

You can ask the model to enforce consistent spelling of names using the name_consistency parameter. This will ensure the same name is spelled in the same manner throughout the transcript, at the cost of a small amount of added processing time. This is especially useful for scenarios where people’s names may be mentioned multiple times, but these names are not known in advance (e.g. recruitment call recordings). To ensure correct spelling of names which are known in advance, use the custom vocabulary.

request data

{
  "audio_url": "YOUR_AUDIO_URL",
  "name_consistency": true
}

Dual-channel or Multiple channels transcription

If you have multiples channels in your audio file with different content each, Gladia API automatically transcribe them. In the transcription result, you will get for each utterances a channel key corresponding to the channels the transcription came from.

Sending an audio with 2 different channels (that does not contains the same audio data), will be billed twice as 2 different audios. If your audio has multiple channels but has the same audio content on each channels, it will only billed once.TLDR: We charge every unique channel in an audio file, we do not charge if channels are duplicates.

Adding custom metadata

You can add metadata to your transcription using the custom_metadata input during your POST request on /v2/pre-recorded endpoint. This will allow you to recognize your transcription when you get its data from the GET /v2/pre-recorded/:id endpoint, but more important, it will allow you to use it as a filter in the GET /v2/pre-recorded list endpoint. For example, you can add the following when asking for a transcription:

"custom_metadata": {
    "internalUserId": 2348739875894375,
    "paymentMethod": {
        "last4Digits": 4576
     },
     "internalUserName": "Spencer"
}

And then, use the following GET request to filter results like:

https://api.gladia.io/v2/pre-recorded?custom_metadata={"internalUserId": "2348739875894375"}

https://api.gladia.io/v2/pre-recorded?custom_metadata={"paymentMethod": {"last4Digits": 4576}, "internalUserName": "Spencer"}

custom_metadata cannot be longer than 2000 characters when stringified.

Introduction

Asynchronous Speech-to-Text

Real-time Speech-to-Text

Audio Intelligence

Limits & Specifications

Guides

Integration

Language configuration

Single language

Multiple languages
(Code-switching)

Enhanced punctuation

Word-level timestamps

Sentences

Export SRT or VTT caption files

Context prompt

Custom vocabulary

Custom spelling

Name consistency

Dual-channel or Multiple channels transcription

Adding custom metadata

Introduction

Asynchronous Speech-to-Text

Real-time Speech-to-Text

Audio Intelligence

Limits & Specifications

Guides

Integration

​Language configuration

​Single language

​Multiple languages (Code-switching)

​Enhanced punctuation

​Word-level timestamps

​Sentences

​Export SRT or VTT caption files

​Context prompt

​Custom vocabulary

​Custom spelling

​Name consistency

​Dual-channel or Multiple channels transcription

​Adding custom metadata

Language configuration

Single language

Multiple languages
(Code-switching)

Enhanced punctuation

Word-level timestamps

Sentences

Export SRT or VTT caption files

Context prompt

Custom vocabulary

Custom spelling

Name consistency

Dual-channel or Multiple channels transcription

Adding custom metadata