Gladia Audio Transcription WebSocket API Documentation
Overview
The Gladia Audio Transcription WebSocket API allows developers to connect to a streaming audio transcription endpoint and receive real-time transcription results. This documentation provides the necessary information and guidelines for establishing a WebSocket connection and sending audio data for transcription.
WebSocket Endpoint
The WebSocket endpoint for Gladia Audio Transcription API is:
wss://api.gladia.io/audio/text/audio-transcription
Establishing a WebSocket Connection
To connect to the Gladia Audio Transcription WebSocket API, follow the steps below:
- Create a WebSocket connection to the Gladia Audio Transcription endpoint:
wss://api.gladia.io/audio/text/audio-transcription
. - After establishing the connection, send an initial configuration message as a JSON object. This message specifies the transcription options and settings.
Initial Configuration Message
The initial configuration message should be sent as a JSON object and include the following options:
x_gladia_key
(mandatory, string): Your API key obtained from app.gladia.io.encoding
(optional, string): Specifies the audio encoding format. Valid values are "WAV", "WAV/PCM", "WAV/ALAW", "WAV/ULAW", "AMB", "MP3", "FLAC", "OGG/VORBIS," "OPUS", "SPHERE" and "AMR-NB". Default value is "WAV."bit_depth
(optional, integer): Specifies the bit depth of the audio submitted for transcription. The bit depth represents the number of bits of information stored for each sample of audio. Available options are8
,16
,24
,32
, and64
. If not specified, the default value is16
.sample_rate
(optional, integer): Specifies the sample rate of the audio in Hz. Valid values are8000
,16000
,32000
, and48000
. default value is16000
language
(optional, string): e.g. “english” - define the language to use for the transcription. Default value allows language detection for each new data sent.transcription_hint
(optional, string): Provides a custom vocabulary to the model to improve accuracy of transcribing context specific words, technical terms, names, etc. If empty, this argument is ignored.endpointing
(optional, integer): Specifies the endpointing duration in milliseconds. i.e. the duration of silence which will cause the utterance to be considered finished and a result of type ‘final’ to be sent. For more info see Transcription Results below. Default value is 300 ms.model_type
(optional, string): Specifics which transcription model to use. possible values are "accurate" and "fast". Default value is "fast".
The fast model will return less accurate 'partial' frames during an utterance with 400-500ms latency, and a more accurate 'final' frame at the end of each utterance with 700-800 ms latency.
The accurate model will only return final frames at the end of each utterance, with very high accuracy and a latency of 1000-1100ms.frames_format
(optional, string): Specifies how formated the frames will be. Possible values are "base64" and "bytes". Default value is "base64".reinject_context
[experimental] (optional, boolean or string): Enables or disables reinjecting the last 10 sentences to guide the transcription. Depending on the use case and context, turning this on may increase accuracy or degrade it. Default value isfalse
.
Note: Do not enable if code switching is required. Reinjecting context will force the language to remain consistent throughout the streaming, audio sent in a different language will be transcribed to the first language detected.word_timestamps
(optional, string): It can have two values: to disable it,0
or to enable it,1
. It enables word timestamps. Default value is0
.
Here's an example initial configuration message:
{
"x_gladia_key": "your_gladia_key",
"sample_rate": 16000,
"encoding": "mp3"
}
Sending Audio Data for Transcription
Once the WebSocket connection is established and the initial configuration message is sent, you can start sending audio data for transcription.
Using base64 encoded frames
Send JSON objects as messages to the WebSocket with the following structure:
{
"frames": "base64"
}
frames
(string): Contains the audio data in base64-encoded format.
To send audio data, convert your audio to base64-encoded bytes and include them in the frames
field of the JSON object.
Using raw frames (bytes)
In case you specified in the configuration message frames_format
to be bytes
, you need to forward the binary frames as it without any formating nor wrapping.
Transcription Results
The Gladia Audio Transcription API will send back transcription results as JSON objects in real-time. Each result object contains the following information:
type
(string): Specifies the type of transcription result. It can have two values: "final" or "partial." If the type is "final," it means there are more thanendpointing
seconds at the end of the utterance; otherwise, it is considered final. When choosing the accurate model, only "final" frames will be sent.transcription
(string): Contains the full transcription text since the last 'final' result.language
(string): Represents the detected language for the transcription.words
(array): Contains the words within the transcription.
With word timestamps enabled, the result can be the following:
{
"words": [
{
"word": "Dall'associazione",
"time_begin": 0.08019,
"time_end": 0.58137,
"confidence": 0.77
},
],
"transcription": "Dall'associazione",
"language": "it",
"time_begin": 0.08019,
"time_end": 0.58137,
"type": "final",
"duration": 0.58137
}
Please note that multiple result objects may be received during the transcription process, depending on the audio duration and WebSocket connection.
Example Workflow
- Establish a WebSocket connection to
wss://api.gladia.io/audio/text/audio-transcription
. - Send the initial configuration message as a JSON object to specify the transcription options.
{
"x_gladia_key": "your_gladia_key",
"sample_rate": 16000,
"encoding": "mp3"
}
- Start sending audio data for transcription by converting the audio to base64-encoded bytes and including them in the
frames
field of a JSON object.
{
"frames": "base64"
}
- Receive real-time transcription results in JSON format. Each result object will include the
type
,transcription
,language
, andwords
fields.
{
"type": "partial",
"transcription": "Hello world.",
"language": "en",
}
- Continue sending audio data and receiving transcription results until the audio stream is complete.
Limitations
Chunk Size Limitation
Our streaming service is designed to handle data in smaller, manageable chunks. While we aim to provide flexibility in the content you can stream, we have established a maximum chunk duration of 120 seconds for the following reasons:
- Optimal Streaming Performance: Smaller chunks allow for better real-time processing and distribution, reducing the risk of buffering or delays in playback.
- Resource Management: Longer chunks could potentially strain our server resources and impact the overall performance of the service for all users.
- Reliability: Shorter chunks improve the reliability of the streaming service by minimizing the likelihood of errors or data loss during transmission.
Thirty Seconds Behavior
If endpointing fails to be activated and the audio that is being buffered is greater than 30s, we force the following behavior:
- force a final utterance is forced
- the buffer state is reinitialised
Conclusion
Congratulations! You have successfully learned how to connect to the Gladia Audio Transcription WebSocket API and send audio data for real-time transcription. Make sure to handle WebSocket connection errors, manage audio streaming, and process the received transcription results according to your application's needs.
For further assistance or inquiries, please refer to the Gladia documentation or contact our support team at [email protected].
WebSocket close codes
Close code (uint16) | Codename | Internal | Customizable | Description |
---|---|---|---|---|
0 - 999 | Yes | No | Unused | |
1000 | CLOSE_NORMAL | No | No | Successful operation / regular socket shutdown |
1001 | CLOSE_GOING_AWAY | No | No | Client is leaving (browser tab closing) |
1002 | CLOSE_PROTOCOL_ERROR | Yes | No | Endpoint received a malformed frame |
1003 | CLOSE_UNSUPPORTED | Yes | No | Endpoint received an unsupported frame (e.g. binary-only endpoint received text frame) |
1004 | Yes | No | Reserved | |
1005 | CLOSED_NO_STATUS | Yes | No | Expected close status, received none |
1006 | CLOSE_ABNORMAL | Yes | No | No close code frame has been receieved |
1007 | Unsupported payload | Yes | No | Endpoint received inconsistent message (e.g. malformed UTF-8) |
1008 | Policy violation | No | No | Generic code used for situations other than 1003 and 1009 |
1009 | CLOSE_TOO_LARGE | No | No | Endpoint won't process large frame |
1010 | Mandatory extension | No | No | Client wanted an extension which server did not negotiate |
1011 | Server error | No | No | Internal server error while operating |
1012 | Service restart | No | No | Server/service is restarting |
1013 | Try again later | No | No | Temporary server condition forced blocking client's request |
1014 | Bad gateway | No | No | Server acting as gateway received an invalid response |
1015 | TLS handshake fail | Yes | No | Transport Layer Security handshake failure |
1016 - 1999 | Yes | No | Reserved for later | |
2000 - 2999 | Yes | Yes | Reserved for websocket extensions | |
3000 - 3999 | No | Yes | Registered first come first serve at IANA | |
4000 - 4999 | No | Yes | Available for applications |
Code samples
import asyncio
import websockets
import json
import base64
ERROR_KEY = 'error'
TYPE_KEY = 'type'
TRANSCRIPTION_KEY = 'transcription'
LANGUAGE_KEY = 'language'
# retrieve gladia key
gladiaKey = '' # replace with your gladia key
if not gladiaKey:
print('You must provide a gladia key. Go to app.gladia.io')
exit(1)
else:
print('using the gladia key : ' + gladiaKey)
# connect to api websocket
gladiaUrl = "wss://api.gladia.io/audio/text/audio-transcription"
async def send_audio(socket):
# Configure stream with a configuration message
configuration = {
"x_gladia_key": gladiaKey,
# "model_type":"accurate" <- Less faster but more accurate model_type, useful if you need precise addresses for example.
}
await socket.send(json.dumps(configuration))
# Once the initial message is sent, send audio data
file = '../../../data/anna-and-sasha-16000.wav'
with open(file, 'rb') as f:
fileSync = f.read()
base64Frames = base64.b64encode(fileSync).decode('utf-8')
partSize = 20000 # The size of each part
numberOfParts = -(-len(base64Frames) // partSize) # equivalent of math.ceil
# Split the audio data into parts and send them sequentially
for i in range(numberOfParts):
start = i * partSize
end = min((i + 1) * partSize, len(base64Frames))
part = base64Frames[start:end]
# Delay between sending parts (500 mseconds in this case)
await asyncio.sleep(0.5)
message = {
'frames': part
}
await socket.send(json.dumps(message))
await asyncio.sleep(2)
print("final closing")
# get ready to receive transcriptions
async def receive_transcription(socket):
while True:
response = await socket.recv()
utterance = json.loads(response)
if utterance:
if ERROR_KEY in utterance:
print(f"{utterance[ERROR_KEY]}")
break
else:
print(f"{utterance[TYPE_KEY]}: ({utterance[LANGUAGE_KEY]}) {utterance[TRANSCRIPTION_KEY]}")
else:
print('empty,waiting for next utterance...')
# run both tasks concurrently
async def main():
async with websockets.connect(gladiaUrl) as socket:
send_task = asyncio.create_task(send_audio(socket))
receive_task = asyncio.create_task(receive_transcription(socket))
await asyncio.gather(send_task, receive_task)
asyncio.run(main())
Working with Typescript
Streaming a file
import WebSocket from "ws";
import fs from "fs";
import { resolve } from "path";
import { exit } from "process";
const ERROR_KEY = "error";
const TYPE_KEY = "type";
const TRANSCRIPTION_KEY = "transcription";
const LANGUAGE_KEY = "language";
const sleep = (delay: number): Promise<void> =>
new Promise((f) => setTimeout(f, delay));
// retrieve gladia key
const gladiaKey = process.argv[2];
if (!gladiaKey) {
console.error("You must provide a gladia key. Go to app.gladia.io");
exit(1);
} else {
console.log("using the gladia key : " + gladiaKey);
}
// connect to api websocket
const gladiaUrl = "wss://api.gladia.io/audio/text/audio-transcription";
const socket = new WebSocket(gladiaUrl);
// get ready to receive transcriptions
socket.on("message", (event: any) => {
if (event) {
const utterance = JSON.parse(event.toString());
if (Object.keys(utterance).length !== 0) {
if (ERROR_KEY in utterance) {
console.error(`${utterance[ERROR_KEY]}`);
socket.close();
} else {
console.log(
`${utterance[TYPE_KEY]}: (${utterance[LANGUAGE_KEY]}) ${utterance[TRANSCRIPTION_KEY]}`
);
}
}
} else {
console.log("empty ...");
}
});
socket.on("error", (error: WebSocket.ErrorEvent) => {
console.log(error.message);
});
socket.on("open", async () => {
// Configure stream with a configuration message
const configuration = {
x_gladia_key: gladiaKey,
// "model_type":"accurate"
};
socket.send(JSON.stringify(configuration));
// Once the initial message is sent, send audio data
const file = resolve("../data/anna-and-sasha-16000.wav");
const fileSync = fs.readFileSync(file);
const newBuffers = Buffer.from(fileSync)
const segment = newBuffers.slice(44, newBuffers.byteLength);
const base64Frames = segment.toString("base64");
const partSize = 20000; // The size of each part
const numberOfParts = Math.ceil(base64Frames.length / partSize);
// Split the audio data into parts and send them sequentially
for (let i = 0; i < numberOfParts; i++) {
const start = i * partSize;
const end = Math.min((i + 1) * partSize, base64Frames.length);
const part = base64Frames.substring(start, end);
// Delay between sending parts (500 mseconds in this case)
await sleep(500);
const message = {
frames: part,
};
// console.log(part)
socket.send(JSON.stringify(message));
}
await sleep(2000);
console.log("final closing");
socket.close();
});
Streaming a microphone
import { exit } from "process";
import WebSocket from "ws";
import mic from "mic";
const ERROR_KEY = "error";
const TYPE_KEY = "type";
const TRANSCRIPTION_KEY = "transcription";
const LANGUAGE_KEY = "language";
const SAMPLE_RATE = 16_000
// retrieve gladia key
const gladiaKey = process.argv[2];
if (!gladiaKey) {
console.error("You must provide a gladia key. Go to app.gladia.io");
exit(1);
} else {
console.log("using the gladia key : " + gladiaKey);
}
// connect to api websocket
const gladiaUrl = "wss://api.gladia.io/audio/text/audio-transcription";
const socket = new WebSocket(gladiaUrl);
// get ready to receive transcriptions
socket.on("message", (event: any) => {
if (event) {
const utterance = JSON.parse(event.toString());
if (Object.keys(utterance).length !== 0) {
if (ERROR_KEY in utterance) {
console.error(`${utterance[ERROR_KEY]}`);
socket.close();
} else {
console.log(
`${utterance[TYPE_KEY]}: (${utterance[LANGUAGE_KEY]}) ${utterance[TRANSCRIPTION_KEY]}`
);
}
}
} else {
console.log("empty ...");
}
});
socket.on("error", (error: WebSocket.ErrorEvent) => {
console.log(error.message);
});
socket.on("open", async () => {
// Configure stream with a configuration message
const configuration = {
x_gladia_key: gladiaKey,
sample_rate: SAMPLE_RATE
// "model_type":"accurate"
};
socket.send(JSON.stringify(configuration));
// create micrphone instance
const micophoneInstance = mic({
rate: SAMPLE_RATE,
channels: '1',
});
const microphoneInputStream = micophoneInstance.getAudioStream();
microphoneInputStream.on('data', function (data: any) {
const base64 = data.toString('base64')
socket.send(JSON.stringify({ frames: base64 }));
});
microphoneInputStream.on('error', function (err: any) {
console.log("Error in Input Stream: " + err);
});
micophoneInstance.start();
})