Real-time V2 is the latest real-time speech-to-text API from Gladia. It offers more features and has significant improvements in latency compared to V1. Here is a guide on how to migrate to V2, so you can start enjoying all the benefits.

Please make sure you migrate sooner rather than later as we’re looking to remove support for V1 sometime in the future. Before we do so however, we’ll of course reach out to those of you who are still on V1.

Initiating the connection to the WebSocket

In V1, you always connect to the same WebSocket URL (wss://api.gladia.io/audio/text/audio-transcription) and send your configuration through the WebSocket connection.

In V2, you first generate a unique WebSocket URL with a call to our POST /v2/live endpoint, and then connect to it. This URL contains a token that is unique to your live session. You’ll be able to resume your session in case of a lost connection, or give the URL to a web client without exposing your Gladia API key.

Configuration

With V2 offering more features, the configuration comes with some changes. You’ll find the full configuration definition in the POST /v2/live API reference page.

Here, we’ll show you how to migrate your V1 configuration object to the V2 one.

Audio encoding

encoding, bit_depth and sample_rate are still present in V2, but with less options for now.

As wav is the same encoding as wav/pcm, V2 has dropped support for wav and defaults to wav/pcm.

amb, mp3, flac, ogg/vorbis, opus, sphere and amr-nb are no longer supported.

bit_depth option 64 is no longer supported.

If you’re using an unsupported encoding or bit_depth, please contact us with your use case. In the mean time, keep using V1.

Model

Only one model is supported in V2 for now, so omit the property model.

End-pointing and maximum audio duration

endpointing is now declared in seconds instead of milliseconds. maximum_audio_duration has been renamed to maximum_duration_without_endpointing.

Language

Automatic single language

Automatic single language behavior is the default in both V1 and V2, so you can just omit those parameters from your configuration.

Automatic multiple languages

Manual

Languages are now specified with a 2-letter code, as in the API for asynchronous speech-to-text.
See this page for a complete list of codes.

Frames format

You can send audio chunk as bytes or base64 and we’ll detect the format automatically. The parameter frames_format is no longer present.

Audio enhancer

audio_enhancer has been moved into the pre_processing object.

Word timestamps

word_timestamps has been renamed to words_accurate_timestamps and moved into the realtime_processing object.

Other properties

prosody, reinject_context and transcription_hint are not supported for now. They may return in another form in the future.

Full config migration sample

Send audio chunks

If you were sending chunks as bytes, nothing has changed. If you were sending them as base64, the format of the JSON messages changed in V2. See the API reference for the full format.

Transcription message

In V1, we only send two kinds of messages through WebSocket:

  • the “connected” message
  • the “transcript” messages

In V2, we send more:

  • lifecycle event messages
  • acknowledgment messages
  • add-on messages
  • post-processing messages

To read a transcription message in V1, you verify that the type field is "final" and/or the transcription field is not empty.
In V2, you should confirm that the type field is transcript and that data.is_final is true.

Below are examples of transcript messages in V1 and V2, so you can see the differences. See the API reference for the full format.

If you’re not interested in new messages and simply want the ones from the V1 API, you can always configure what kind of messages you want when calling the POST /v2/live endpoint to initiate the session.

With the following configuration, you will only receive final transcript messages:

End the live session

The format of this message also changed. See the API reference for the full format.