Migrate from V1 API

General flow changes

In the first version of Gladia API, to get your transcription through an HTTP call, you had to send everything (audio file/url, parameters, etc) in a single HTTP call, and then keep the connection open until you get your result.

This was not ideal for many scenarios that could lead to longer waiting time to get your results, or in case of connection errors, not getting your results at all despite the transcription being successful.

In V2, we addressed this by decomposing the process in multiple steps, and have merged both audio & video endpoints:

Upload your file

This step is optional if you are already working with audio URLs.

If you’re working with audio or video files, you’ll need to upload it first using our /upload endpoint with multipart/form-data content-type since Gladia /v2/pre-recorded endpoint only accept audio URLs now.

More information about this step in the API Reference

curl --request POST \
  --url https://api.gladia.io/v2/upload \
  --header 'Content-Type: multipart/form-data' \
  --header 'x-gladia-key: YOUR_GLADIA_API_TOKEN' \
  --form audio=@/path/to/your/audio/conversation.wav

Example response :

{
  "audio_url": "https://api.gladia.io/file/636c70f6-92c1-4026-a8b6-0dfe3ecf826f",
  "audio_metadata": {
    "id": "636c70f6-92c1-4026-a8b6-0dfe3ecf826f",
    "filename": "conversation.wav",
    "extension": "wav",
    "size": 99515383,
    "audio_duration": 4146.468542,
    "number_of_channels": 2
  }
}

We will now proceed to the next steps using the returned audio_url.

Transcribe

We’ll now make the transcription request to Gladia’s API.

Instead of /audio/text/audio-transcription now we’ll use /v2/pre-recorded

Since /v2/pre-recorded does not accept any audio file, Content-Type is not multipart/form-data anymore, but application/json.

More information about this step in the API Reference

  curl --request POST \
    --url https://api.gladia.io/audio/text/audio-transcription/ \
    # Content-Type is multipart/form-data, v2 is application/json
    --header 'Content-Type: multipart/form-data' \
    --header 'x-gladia-key: YOUR_GLADIA_API_TOKEN' \
    --form audio_url=http://youraudiofileurl.com/conversation.wav \
    # Diarization toggle & config
    --form toggle_diarization=true \
    --form diarization_min_speakers=1 \
    --form diarization_max_speakers=5 \
    --form diarization_num_speakers=3 \
    # Output format
    --form output_format=srt \
    # Language behaviour
    --form 'language_behaviour=automatic single language' \
    # Translation
    --form toggle_direct_translate=true \
    --form target_translation_language=french

Old V1 : The HTTP connection is kept opened until you get your transcription result, and there’s no third step.
New V2 : You get an instant response from the request with an id and a result_url.
The id is your transcription ID that you will use to get your transcription result once it’s done. You don’t have to keep any HTTP connection open on your side.
result_url is returned for convenience. This is a pre-built url with your transcription id in it that you can use to get your result in the next step.

Get the transcription result

As on V1 you get the transcription results in the previous step, this step is only relevant for V2.

You can get your transcription results in 3 different ways:

Polling

Webhook

Callback URL

Callback are HTTP calls that you can use to get notified when your transcripts are ready.

Instead of polling and keeping your server busy and maintaining work, you can use the callback feature to receive the result to a specified endpoint:

{
  "audio_url": "YOUR_AUDIO_URL",
  "callback": true,
  "callback_config": {
    "url": "https://yourserverurl.com/your/callback/endpoint/",
    "method": "POST"
  }
}

Once the transcription is done, a request will be made to the url you provided in callback_config.url using the HTTP method you provided in callback_config.method. Allowed methods are POST and PUT with the default being POST.

The request body is a JSON object containing the transcription id and an event property that tells you if it’s a success or an error.

Transcription Input & Output changes

In addition to the transcription flow changes, input & output also changed. To get the exhaustive documentation of the V2 input/output, please refer to the API Reference part of the documentation.

Input changes

The most efficient way to get the new inputs list is to check the API Reference. But here’s a quick recap table about the most used parameters changes :

V1	V2
`toggle_diarization`	`diarization`
`language_behaviour`	`detect_language`, `enable_code_switching`, `language`
`output_format`	`subtitles` + `subtitles_config`
`webhook_url`	`callback_url`

Output changes

Here is a general changelog for the output part of the transcription’s core features:

{
//"prediction" do not exist anymore, equivalent on V2 would be result.transcription.utterances
"prediction": [
  {
    "words": [
      {
        "word": "Split",
        "time_begin": 0.21001999999999998, // v2 -> utterances[i].words[n].start
        "time_end": 0.69015, // v2 -> utterances[i].words[n].end
        "confidence": 1
      },
      {
        "word": " infinity",
        "time_begin": 0.91021,
        "time_end": 1.55038,
        "confidence": 0.95
      },
      // More words ...
    ],
    "language": "en",
    "transcription": "Split infinity in a time when less is more,", // v2 -> utterances[i].text 
    "confidence": 0.84,
    "time_begin": 0.21001999999999998,  // v2 -> utterances[i].start
    "time_end": 4.71123, // v2 -> utterances[i].end
    "speaker": 0, // Not present if diarization is not enabled
    "channel": "channel_0" // not a string anymore but integer (channel: 0)
  },
  // More transcriptions...
],
// "prediction_raw" do not exist anymore
"prediction_raw": {
  "metadata": { // v2-> result.metadata
    // Most of those metadata are not present anymore
    "audio_conversion_time": 1.0006206035614014,
    "vad_time": 15.870725154876709,
    "audio_preprocessing_time": 0.8548638820648193,
    "language_discovery_time": 0.8840088844299316,
    "inference_time": 4.302580833435059,
    "translation_time": 0.0000059604644775390625,
    "formatting_time": 0.00007605552673339844,
    "total_transcription_time": 22.91288471221924, // -> result.metadata.transcription_time 
    "provided_file_metadata": {
      "channels": 1, // -> result.metadata.number_of_channels 
      "sample_rate": 44100, 
      "duration": 20.555465, // -> result.metadata.audio_duration 
      "nb_channels": 1,
      "sample_width": 1,
      "number_similar_channels": 0,
      "original_file_type": "audio"
    },
    "nb_silent_channels": -1,
    "total_speech_duration": 0,
    "summarization_time": 0,
    "chapterization_time": 0
  },
  //"prediction_raw.transcription" do not exist anymore
  "transcription": [
    {
      "words": [
        {
          "word": "Split",
          "time_begin": 0.21001999999999998,
          "time_end": 0.69015,
          "confidence": 1
        },
        {
          "word": " infinity",
          "time_begin": 0.91021,
          "time_end": 1.55038,
          "confidence": 0.95
        },
        // More words ...
      ],
      "language": "en",
      "transcription": "Split infinity in a time when less is more,",
      "confidence": 0.84,
      "time_begin": 0.21001999999999998,
      "time_end": 4.71123,
      "speaker": 0,
      "channel": "channel_0"
    },
  // More transcriptions...
  ]
}
}

To dive deeper into the V2 version of the API, please take a look at those next:

Introduction

Asynchronous Speech-to-Text

Real-time Speech-to-Text

Audio Intelligence

Limits & Specifications

Guides

Integration

Migrate from V1 API

General flow changes

Transcription Input & Output changes

Input changes

Output changes

Introduction

Asynchronous Speech-to-Text

Real-time Speech-to-Text

Audio Intelligence

Limits & Specifications

Guides

Integration

​General flow changes

​Transcription Input & Output changes

​Input changes

​Output changes

General flow changes

Transcription Input & Output changes

Input changes

Output changes