This document describes how to connect to the Gladia WebSocket API for audio transcription. The API endpoint is available at wss://

To establish a WebSocket connection with the API, follow the steps below:

Request an API token from the Gladia team. This token is required to authenticate your requests to the API.

Use a WebSocket library of your choice to connect to the endpoint wss://

Once the WebSocket connection is established, you can start sending audio frames to the API for transcription. The audio frames must be in the form of base64-encoded bytes.

Each WebSocket message must contain the following fields:

  • x_gladia_key: The API token you received from the Gladia team. This is used for authentication purposes.
  • sample_rate: The sample rate of the audio being sent, in Hertz (Hz).
  • frames: The audio frames, base64-encoded.

Example message:

    "x_gladia_key": "253035b9-c068-467a-9d88-5c95b9bd4ff8",  
    "sample_rate": 16000,  
    "frames": "W29iamVjdCBPYmplY3Rd"  

Note that the frames field in the example message contains the base64-encoded bytes for the string "object Object".

The API will respond to each WebSocket message with a JSON object containing the transcription results.

Example response:

    "text": "This is the transcribed text.",  
    "confidence": 0.85,  
    "words": [  
            "text": "This",  
            "start_time": 0.0,  
            "end_time": 0.3,  
            "confidence": 0.9  
            "text": "is",  
            "start_time": 0.3,  
            "end_time": 0.5,  
            "confidence": 0.8  
            "text": "the",  
            "start_time": 0.5,  
            "end_time": 0.6,  
            "confidence": 0.7  
            "text": "transcribed",  
            "start_time": 0.6,  
            "end_time": 1.2,  
            "confidence": 0.85  
            "text": "text.",  
            "start_time": 1.2,  
            "end_time": 1.6,  
            "confidence": 0.9  

The text field contains the transcribed text, while the confidence field represents the confidence level of the transcription. The words field contains an array of word objects, where each object represents a word in the transcription. Each word object contains the text of the word, the start and end times of the word in the audio file (in seconds), and the confidence level of the transcription for that word.