Migration guide from V1 to V2
Migrate to the latest version of Gladia’s API
Real-time V2 is the latest real-time speech-to-text API from Gladia. It offers more features and has significant improvements in latency compared to V1. Here is a guide on how to migrate to V2, so you can start enjoying all the benefits.
Please make sure you migrate sooner rather than later as we’re looking to remove support for V1 sometime in the future. Before we do so however, we’ll of course reach out to those of you who are still on V1.
Initiating the connection to the WebSocket
In V1, you always connect to the same WebSocket URL (wss://api.gladia.io/audio/text/audio-transcription) and send your configuration through the WebSocket connection.
In V2, you first generate a unique WebSocket URL with a call to our POST /v2/live endpoint, and then connect to it. This URL contains a token that is unique to your live session. You’ll be able to resume your session in case of a lost connection, or give the URL to a web client without exposing your Gladia API key.
Configuration
With V2 offering more features, the configuration comes with some changes. You’ll find the full configuration definition in the POST /v2/live API reference page.
Here, we’ll show you how to migrate your V1 configuration object to the V2 one.
Audio encoding
encoding
, bit_depth
and sample_rate
are still present in V2, but with less options for now.
As wav
is the same encoding
as wav/pcm
, V2 has dropped support for wav
and defaults to wav/pcm
.
amb
, mp3
, flac
, ogg/vorbis
, opus
, sphere
and amr-nb
are no longer supported.
bit_depth
option 64
is no longer supported.
If you’re using an unsupported encoding
or bit_depth
, please contact us with your use case. In the mean time, keep using V1.
Model
Only one model is supported in V2 for now, so omit the property model
.
End-pointing and maximum audio duration
endpointing
is now declared in seconds instead of milliseconds.
maximum_audio_duration
has been renamed to maximum_duration_without_endpointing
.
Language
Automatic single language
Automatic multiple languages
Manual
See this page for a complete list of codes.
Frames format
You can send audio chunk as bytes or base64 and we’ll detect the format automatically.
The parameter frames_format
is no longer present.
Audio enhancer
audio_enhancer
has been moved into the pre_processing
object.
Word timestamps
word_timestamps
has been renamed to words_accurate_timestamps
and moved into the realtime_processing
object.
Other properties
prosody
, reinject_context
and transcription_hint
are not supported for now.
They may return in another form in the future.
Full config migration sample
Send audio chunks
If you were sending chunks as bytes, nothing has changed. If you were sending them as base64, the format of the JSON messages changed in V2. See the API reference for the full format.
Transcription message
In V1, we only send two kinds of messages through WebSocket:
- the “connected” message
- the “transcript” messages
In V2, we send more:
- lifecycle event messages
- acknowledgment messages
- add-on messages
- post-processing messages
- …
To read a transcription message in V1, you verify that the type
field is "final"
and/or the transcription
field is not empty.
In V2, you should confirm that the type
field is transcript
and that data.is_final
is true
.
Below are examples of transcript messages in V1 and V2, so you can see the differences. See the API reference for the full format.
If you’re not interested in new messages and simply want the ones from the V1 API, you can always configure what kind of messages you want when calling the POST /v2/live endpoint to initiate the session.
With the following configuration, you will only receive final transcript messages:
End the live session
The format of this message also changed. See the API reference for the full format.
Was this page helpful?