Live transcription V2 is the latest real-time speech-to-text API from Gladia. It offers more features and has significant improvements in latency compared to V1. Here is a guide on how to migrate to V2, so you can start enjoying all the benefits. Please make sure you migrate sooner rather than later as we’re looking to remove support for V1 sometime in the future. Before we do so however, we’ll of course reach out to those of you who are still on V1.Documentation Index
Fetch the complete documentation index at: https://docs.gladia.io/llms.txt
Use this file to discover all available pages before exploring further.
Initiating the connection to the WebSocket
In V1, you always connect to the same WebSocket URL (wss://api.gladia.io/audio/text/audio-transcription) and send your configuration through the WebSocket connection. In V2, you first generate a unique WebSocket URL with a call to our POST /v2/live endpoint, and then connect to it. This URL contains a token that is unique to your live session. You’ll be able to resume your session in case of a lost connection, or give the URL to a web client without exposing your Gladia API key.V1
V1
V2
V2
Configuration
With V2 offering more features, the configuration comes with some changes. You’ll find the full configuration definition in the POST /v2/live API reference page. Here, we’ll show you how to migrate your V1 configuration object to the V2 one.Audio encoding
encoding, bit_depth and sample_rate are still present in V2, but with less options for now.
As wav is the same encoding as wav/pcm, V2 has dropped support for wav and defaults to wav/pcm.
amb, mp3, flac, ogg/vorbis, opus, sphere and amr-nb are no longer supported.
bit_depth option 64 is no longer supported.
If you’re using an unsupported encoding or bit_depth, please contact us with your use case. In the mean time, keep using V1.
Model
Only one model is supported in V2 for now, so omit the propertymodel.
End-pointing and maximum audio duration
endpointing is now declared in seconds instead of milliseconds.
maximum_audio_duration has been renamed to maximum_duration_without_endpointing.
V1
V1
V2
V2
Language
Automatic single language
Automatic single language behavior is the default in both V1 and V2, so you can just omit those parameters from your configuration.
V1
V1
V2
V2
Automatic multiple languages
V1
V1
V2
V2
Manual
Languages are now specified with a 2-letter code, as in the API for asynchronous speech-to-text.
See this page for a complete list of codes.
See this page for a complete list of codes.
V1
V1
V2
V2
Frames format
You can send audio chunk as bytes or base64 and we’ll detect the format automatically. The parameterframes_format is no longer present.
Audio enhancer
audio_enhancer has been moved into the pre_processing object.
V1
V1
V2
V2
Word timestamps
word_timestamps has been renamed to words_accurate_timestamps and moved into the realtime_processing object.
V1
V1
V2
V2
Other properties
prosody, reinject_context and transcription_hint are not supported for now.
They may return in another form in the future.
Full config migration sample
V1
V1
V2
V2
Send audio chunks
If you were sending chunks as bytes, nothing has changed. If you were sending them as base64, the format of the JSON messages changed in V2. See the API reference for the full format.V1
V1
V2
V2
Transcription message
In V1, we only send two kinds of messages through WebSocket:- the “connected” message
- the “transcript” messages
- lifecycle event messages
- acknowledgment messages
- add-on messages
- post-processing messages
- …
type field is "final" and/or the transcription field is not empty. In V2, you should confirm that the
type field is transcript and that data.is_final is true.
Below are examples of transcript messages in V1 and V2, so you can see the differences.
See the API reference for the full format.
V1
V1
V2
V2
End the live session
The format of this message also changed. See the API reference for the full format.V1
V1
V2
V2