The core functionality of the Gladia API is its Speech Recognition model, designed to convert spoken language into written text. This serves as the basis for all Gladia API offerings.

Do you want to know more about Gladia latest speech-to-text recognition AI model? Discover our state-of-the-art ASR model Whisper Zero now.

Additional capabilities, like Speaker Diarization, Summarization, Translation, Custom Prompts and more can be integrated seamlessly into the transcription process by including extra parameters in the transcription request.

Sending a transcription request

Don’t forget to update YOUR_GLADIA_API_TOKEN and YOUR_AUDIO_URL to match yours.
async function makeFetchRequest(url: string, options: any) {
  const response = await fetch(url, options);
  return response.json();
}

async function pollForResult(resultUrl: string, headers: any) {
  while (true) {
    console.log("Polling for results...");
    const pollResponse = await makeFetchRequest(resultUrl, { headers });

    if (pollResponse.status === "done") {
      console.log("- Transcription done: \n ");
      console.log(pollResponse.result.transcription.full_transcript);
      break;
    } else {
      console.log("Transcription status : ", pollResponse.status);
      await new Promise((resolve) => setTimeout(resolve, 1000));
    }
  }
}

async function startTranscription() {
  const gladiaKey = "YOUR_GLADIA_API_TOKEN";
  const requestData = {
    audio_url:
      "YOUR_AUDIO_URL",
  };
  const gladiaUrl = "https://api.gladia.io/v2/transcription/";
  const headers = {
    "x-gladia-key": gladiaKey,
    "Content-Type": "application/json",
  };

  console.log("- Sending initial request to Gladia API...");
  const initialResponse = await makeFetchRequest(gladiaUrl, {
    method: "POST",
    headers,
    body: JSON.stringify(requestData),
  });

  console.log("Initial response with Transcription ID :", initialResponse);

  if (initialResponse.result_url) {
    await pollForResult(initialResponse.result_url, headers);
  }
}

startTranscription();

After that your audio have been processed, here’s an example output of what you should get :

Split infinity in a time when less is more, where too much is never enough. There is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet.`

Getting the result of a request

You can get your transcription results in 3 different ways:

Transcription job status

The transcription status can have different values :

StatusDescription
queuedAudio waiting to be processed
processingAudio file being processed
doneTranscription successfully completed
errorAn error occurred on your transcription

Transcriptions can fail for various of reasons:

  • No audio in the audio file
  • Audio URL unreachable
  • Issues with your file format

If you get another type of failure (most likely a server failure), resubmit the audio file and another server will take care of processing it.

Word-level timestamps

Instead of just getting utterances start and end timestamps, Gladia Speech-to-text API provides by default the Word-level timestamps feature. It lets you know the exact timestamp for each word and give you a more precise transcription. This feature is particularly useful for detailed analysis, as it allows you to pinpoint the exact moment each word is spoken, facilitating a more accurate synchronization with audio or video files.

Under each utterance, you’ll find a words property like this:

// other properties...
"utterances": [
    {
      "words": [
        {
          "word": "Split",
          "start": 0.21001999999999998,
          "end": 0.69015,
          "confidence": 1
        },
        {
          "word": " infinity",
          "start": 0.91021,
          "end": 1.55038,
          "confidence": 0.95
        },
        ...
      ]
    }
  ]

Sentences

In addition to getting the transcription split by utterances, you can request to semantically segment the transcription to sentences, providing a more human readable result.

You can get translated sentences by enabling both sentences and translation! You’ll receive sentences output for the the original transcript, and also each translation result will contain the sentences output in the translated language!

request data
{
  "sentences": true
}

The result will contain a sentences key (in addition to utterances):

		"sentences": {
			"success": true,
			"is_empty": false,
			"results": [
				{
					"sentence": "Amy, it says you are trained in technology.",
					"start": 0.4681999999999999,
					"end": 2.45525,
					"words": [...],
					"confidence": 0.95,
					"language": "en",
					"speaker": 0,
					"channel": 0
				},
				{
					"sentence": "That's very good.",
					"start": 2.51546,
					"end": 3.5992999999999995,
					"words": [...],
					"confidence": 0.96,
					"language": "en",
					"speaker": 0,
					"channel": 0
				},
        ...
      ]
    }

Language behaviour

If you know the audio used in the language, you should use Manual, otherwise, Automatic Language Detection.

Export SRT or VTT caption files

You can export completed transcripts in both SRT and VTT format, which can be used for subtitles and captions in videos.

You can use the subtitles feature alongside the translation feature. You’ll have your subtitles in the original language, and also in languages you targeted for the translation!

request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "subtitles": true,
  "subtitles_config": {
    "formats": ["srt", "vtt"]
  }
}

The JSON response will include a new property subtitles which is an array of every formats you requested. With the given example, subtitles will contains 2 items of shape:

{
  "format": "srt", //format name
  "subtitles": "1\n00:00:00,210 --> 00:00:04,711....." // subtitles
}

Context prompt

  • Context prompt : If you know the context of the audio you’re sending, you can provide it in the context_prompt.
request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "context_prompt": "A conversation between Sansa Stark and Peter Baelish from the Game of Thrones series.",
}

Custom vocabulary

  • Custom vocabulary : To enhance the precision of transcription, especially for words or phrases that recur often in your audio file, you can utilize the custom_vocabulary feature in the transcription configuration settings.
    The custom vocabulary has a global limit of 10k characters.
request data
{
  "audio_url": "YOUR_AUDIO_URL",
  "custom_vocabulary": ["westeros", "stark", "night's watch"]
}

Dual-channel or Multiple channels transcription

If you have multiples channels in your audio file with different content each, Gladia API automatically transcribe them. In the transcription result, you will get for each utterances a channel key corresponding to the channels the transcription came from.

Sending an audio with 2 different channels (that does not contains the same audio data), will be billed twice as 2 different audios. If your audio has multiple channels but has the same audio content on each channels, it will only billed once.

TLDR: We charge every unique channel in an audio file, we do not charge if channels are duplicates.

Adding custom metadata

You can add metadata to your transcription using the custom_metadata input during your POST request on /v2/transcription endpoint. This will allow you to recognize your transcription when you get its data from the GET /v2/transcription/{id} endpoint, but more important, it will allow you to use it as a filter in the GET /v2/transcription list endpoint. For example, you can add the following when asking for a transcription:

"custom_metadata": {
    "internalUserId": 2348739875894375,
    "paymentMethod": {
        "last4Digits": 4576
     },
     "internalUserName": "Spencer"
}

And then, use the following GET request to filter results like:

https://api.gladia.io/v2/transcription?custom_metadata={"internalUserId": "2348739875894375"}

or

https://api.gladia.io/v2/transcription?custom_metadata={"paymentMethod": {"last4Digits": 4576}, "internalUserName": "Spencer"}

custom_metadata cannot be longer than 2000 characters when stringified.