Getting started

Initiate your real-time session

First, call the endpoint and pass your configuration. It’s important to correctly define the properties encoding, sample_rate, bit_depth and channels as we need them to parse your audio chunks.

  const response = await fetch('https://api.gladia.io/v2/live', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-Gladia-Key': '<YOUR_GLADIA_API_KEY>',
    },
    body: JSON.stringify({
      encoding: 'wav/pcm',
      sample_rate: 16000,
      bit_depth: 16,
      channels: 1,
    }),
  });
  if (!response.ok) {
    // Look at the error message
    // It might be a configuration issue
    console.error(`${response.status}: ${(await response.text()) || response.statusText}`);
    process.exit(response.status);
  }

  const {id, url} = await response.json();

You’ll receive a response with a WebSocket URL to connect to. If you loose connection, you can reconnect to that same URL and resume where you left off. Here’s an example of a response:

{
  "id": "636c70f6-92c1-4026-a8b6-0dfe3ecf826f",
  "url": "wss://api.gladia.io/v2/live?token=636c70f6-92c1-4026-a8b6-0dfe3ecf826f"
}

Connect to the WebSocket

Now that you’ve initiated the session and have the URL, you can connect to the WebSocket using your preferred language/framework. Here’s an example in JavaScript:

  import WebSocket from "ws";

  const socket = new WebSocket(url);

  socket.addEventListener("open", function() {
    // Connection is opened. You can start sending audio chunks.
  });

  socket.addEventListener("error", function(error) {
    // An error occurred during the connection.
    // Check the error to understand why
  });

  socket.addEventListener("close", function({code, reason}) {
    // The connection has been closed
    // If the "code" is equal to 1000, it means we closed intentionally the connection (after the end of the session for example).
    // Otherwise, you can reconnect to the same url.
  });

  socket.addEventListener("message", function(event) {
    // All the messages we are sending are in JSON format
    const message = JSON.parse(event.data.toString());
    console.log(message);
  });

Send audio chunks

You can now start sending us your audio chunks through the WebSocket. You can send them directly as binary, or in JSON by encoding your chunk in base64, like this:

  // as binary
  socket.send(buffer);

  // as json
  socket.send(JSON.stringify({
    type: 'audio_chunk',
    data: {
      chunk: buffer.toString("base64"),
    },
  }));

Read messages

During the whole session, we will send various messages through the WebSocket, the callback URL or webhooks. You can specify which kind of messages you want to receive in the initial configuration. See messages_config for WebSocket messages and callback_config for callback messages. Here’s an example of how to read a transcript message received through a WebSocket:

  socket.addEventListener("message", function(event) {
    // All the messages we are sending are in JSON format
    const message = JSON.parse(event.data.toString());
    if (message.type === 'transcript' && message.data.is_final) {
      console.log(`${message.data.id}: ${message.data.utterance.text}`)
    }
  });

Sending multiple audio tracks in real-time

If you have multiple audio sources (like different participants in a conversation) that you need to transcribe simultaneously, you can merge these separate audio tracks into a single multi-channel audio stream and send it over one WebSocket connection.

Merging multiple audio tracks into one multi-channel WebSocket

This approach allows you to consolidate multiple audio tracks from different participants into a single WebSocket connection while maintaining the ability to identify each speaker through their dedicated channel. Benefits:

Reduce the number of WebSocket connections from multiple to just one
Maintain speaker identity through channel mapping
Simplify synchronization of audio streams from multiple participants
Reduce network overhead and connection management complexity

Creating a multi-channel audio stream

To combine multiple audio tracks into a single multi-channel stream, you need to interleave the audio samples. Here’s a TypeScript function that merges multiple audio buffers into a single multi-channel buffer:

  export function interleaveAudio(
    channelsData: Buffer[],
    bitDepth = 16,
  ): Buffer {
    const nbChannels = channelsData.length;
    if (nbChannels === 1) {
      return channelsData[0];
    }

    const bytesPerSample = bitDepth / 8;
    const samplesPerChannel = channelsData[0].byteLength / bytesPerSample;
    const audio = Buffer.alloc(nbChannels * samplesPerChannel * bytesPerSample);

    for (let i = 0; i < samplesPerChannel; i++) {
      for (let j = 0; j < nbChannels; j++) {
        const sample = channelsData[j].subarray(
          i * bytesPerSample,
          (i + 1) * bytesPerSample,
        );
        audio.set(sample, (i * nbChannels + j) * bytesPerSample);
      }
    }

    return audio;
  }

Example use case

Consider a scenario with three participants in a room: Sami, Maxime, and Mark. Instead of opening three separate WebSocket connections (one for each participant), you can merge their audio tracks and send them over a single WebSocket:

Collect audio buffers from each participant
Merge them into a single multi-channel audio stream using the interleaveAudio function
Specify the number of channels in your API configuration (3 in this case)
Send the combined audio over a single WebSocket

  // Collect audio buffers from each participant
  const samiAudio = getSamiAudioBuffer();
  const maximeAudio = getMaximeAudioBuffer();
  const markAudio = getMarkAudioBuffer();
  
  // Merge into a multi-channel audio
  // Channel ordering: [0]=Sami, [1]=Maxime, [2]=Mark
  const channelsData = [samiAudio, maximeAudio, markAudio];
  const mergedAudio = interleaveAudio(channelsData, 16); // 16-bit depth
  
  // Initialize a single WebSocket session with multi-channel config
  const response = await fetch('https://api.gladia.io/v2/live', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-Gladia-Key': '<YOUR_GLADIA_API_KEY>',
    },
    body: JSON.stringify({
      encoding: 'wav/pcm',
      sample_rate: 16000,
      bit_depth: 16,
      channels: 3, // Specify the number of channels
    }),
  });
  
  const {url} = await response.json();
  const socket = new WebSocket(url);
  
  // Send the merged audio over a single WebSocket
  socket.addEventListener("open", function() {
    socket.send(mergedAudio);
  });

Understanding the response

When you send a multi-channel audio stream to Gladia, the channel order is preserved in the transcription results. Each transcription message will include a channel field that indicates which audio channel (and thus which participant) the transcription belongs to:

{
  "type": "transcript",
  "session_id": "de70f43f-3041-46e0-892c-8e7f53800a22",
  "created_at": "2025-04-09T08:44:16.471Z",
  "data": {
    "id": "00_00000000",
    "utterance": {
      "text": "Hello, I'm Sami. I'm the first speaker",
      "start": 0.188,
      "end": 2.852,
      "language": "en",
      "channel": 0,  // This indicates the first channel (Sami)
    }
  }
}

{
  "type": "transcript",
  "session_id": "de70f43f-3041-46e0-892c-8e7f53800a22",
  "created_at": "2025-04-09T08:44:19.693Z",
  "data": {
    "id": "01_00000000",
    "utterance": {
      "text": "And this is Maxime, nice to meet you, I am the second speaker.",
      "start": 3.468,
      "end": 6.132,
      "language": "en",
      "channel": 1,  // This indicates the second channel (Maxime)
    }
  }
}

{
  "type": "transcript",
  "session_id": "a587386c-8755-4c67-ad67-d2c304eb8a49",
  "created_at": "2025-04-09T08:56:16.370Z",
  "data": {
    "id": "00_00000002",
    "utterance": {
      "text": "And this is Mark",
      "start": 8.614,
      "end": 10.574,
      "language": "en",
      "channel": 2,  // This indicates the third channel (Mark)
    }
  }
}

The channel numbers directly correspond to the order in which you added the audio tracks to the channelsData array:

Channel 0 → Sami (first in the array)
Channel 1 → Maxime (second in the array)
Channel 2 → Mark (third in the array)

Remember to keep track of channel assignments in your application to properly attribute transcriptions to the correct participants.

As mentioned in the Multiple channels section, transcribing a multi-channel audio stream will be billed based on the total duration multiplied by the number of channels.

Read messages

During the whole session, we will send various messages through the WebSocket, the callback URL or webhooks. You can specify which kind of messages you want to receive in the initial configuration. See for WebSocket messages and for callback messages. Here’s an example of how to read a message received through a WebSocket:

  socket.addEventListener("message", function(event) {
    // All the messages we are sending are in JSON format
    const message = JSON.parse(event.data.toString());
    if (message.type === 'transcript' && message.data.is_final) {
      console.log(`${message.data.id}: ${message.data.utterance.text}`)
    }
  });

Stop the recording

Once you’re done, send us the stop_recording message. We will process remaining audio chunks and start the post-processing phase, in which we put together the final audio file and results with the add-ons you requested. You’ll receive a message at every step of the process in the WebSocket, or in the callback if configured. Once the post-processing is done, the WebSocket is closed with a code 1000.

  socket.send(JSON.stringify({
    type: "stop_recording",
  }));

Instead of sending the stop_recording message, you can also close the WebSocket with the code 1000. We will still do the post-processing in background and send you the messages through the callback you defined.

  socket.close(1000);

Get the final results

If you want to get the complete result, you can call the GET /v2/live/:id endpoint with the id you received from the initial request.

  const response = await fetch(`https://api.gladia.io/v2/live/${id}`, {
    method: 'GET',
    headers: {
      'X-Gladia-Key': '<YOUR_GLADIA_API_KEY>',
    },
  });
  if (!response.ok) {
    // Look at the error message
    // It might be a configuration issue
    console.error(`${response.status}: ${(await response.text()) || response.statusText}`)
    return;
  }

  const result = await response.json();
  console.log(result)

Full sample

You can find a complete sample in our Github repository:

Introduction

Asynchronous Speech-to-Text

Real-time Speech-to-Text

Audio Intelligence

Limits & Specifications

Guides

Integration

Getting started

Initiate your real-time session

Connect to the WebSocket

Send audio chunks

Read messages

Sending multiple audio tracks in real-time

Merging multiple audio tracks into one multi-channel WebSocket

Creating a multi-channel audio stream

Example use case

Understanding the response

Read messages

Stop the recording

Get the final results

Full sample

Introduction

Asynchronous Speech-to-Text

Real-time Speech-to-Text

Audio Intelligence

Limits & Specifications

Guides

Integration

​Initiate your real-time session

​Connect to the WebSocket

​Send audio chunks

​Read messages

​Sending multiple audio tracks in real-time

​Merging multiple audio tracks into one multi-channel WebSocket

​Creating a multi-channel audio stream

​Example use case

​Understanding the response

​Read messages

​Stop the recording

​Get the final results

​Full sample

Initiate your real-time session

Connect to the WebSocket

Send audio chunks

Read messages

Sending multiple audio tracks in real-time

Merging multiple audio tracks into one multi-channel WebSocket

Creating a multi-channel audio stream

Example use case

Understanding the response

Read messages

Stop the recording

Get the final results

Full sample