The Real-time V2 STT API offers more features and improves significantly latencies over the V1.
Here is a guide to migrate to the V2 and benefits from everything it offers.

Initiating the connection to the WebSocket

In V1, you always connect to the same WebSocket url (wss://api.gladia.io/audio/text/audio-transcription) and send your configuration through the WebSocket connection.

In V2, you first generate a unique WS url with a call to our POST /v2/live endpoint and then connects to it.
This url contains a token unique to this live session. You’ll be able to resume your session in case of lost connection or give the url to a web client without leaking your Gladia API key.

Configuration

As V2 offers more features, the configuration has to evolve. You can find the full configuration definition in the POST /v2/live API reference page.
Here, we will show you how to migrate your V1 configuration object to the V2 one.

Audio encoding

encoding, bit_depth and sample_rate are still present in V2 but it supports less options for now.
As wav and wav/pcm are the same encoding, V2 dropped wav support and default to wav/pcm now.
amb, mp3, flac, ogg/vorbis, opus, sphere and amr-nb are no longer supported.

We also dropped support for the bit_depth option 64.

If you are using an unsupported encoding or bit_depth, please contact us with your use case and in the mean time, keep using the V1 API.

Model

Only one model is supported in V2 for now so omit the property model.

Endpointing and maximum audio duration

endpointing is now declared in seconds instead of milliseconds.
maximum_audio_duration has been renamed to maximum_duration_without_endpointing.

Language

Automatic single language

Automatic single language behaviour is the default in both V1 and V2 so you can just omit those from your configuration.

Automatic multiple languages

Manual

Language are now specified with a 2-letter code like Pre-recorded API.
Check this page for a complete list of codes.

Frames format

The parameter frames_format is no longer present as it became useless.
You can send audio chunk as bytes or base64 and we will detect the format automatically.

Audio enhancer

audio_enhancer has been moved into the pre_processing object.

Word timestamps

word_timestamps has been renamed to words_accurate_timestamps and moved into the realtime_processing object.

Other properties

prosody, reinject_context and transcription_hint have no equivalent for now as they were not good enough.
They may return under another form in the feature.

Full config migration sample

Send audio chunks

If you were sending chunks as bytes, nothing changed.
If you were sending them as base64, the format of the JSON messages changed in V2. See the API reference for the full format.

Transcription message

In V1, we send through WebSocket only two kind of messages:

  • the “connected” message
  • the “transcript” messages

In V2, we send a lot more:

  • lifecycle event messages
  • acknowledgment messages
  • addon messages
  • post-processing messages

To read a transcription message, in V1, you usually test that the type field is "final" and/or the transcription field is not empty.
In V2, you should test that the type field is transcript and data.is_final is true.

Below is an example of a transcript message in V1 and V2 so you can see the differences.
See the API reference for the full format.

Also, if you are not interested in new messages and want the simplicity of V1 API, you can configure what kind of messages you want when calling the POST /v2/live endpoint to initiate the session.
With the following configuration, you will only receive final transcript messages:

End the live session

The format of the message also changed. See the API reference for the full format.