The Speaker Diarization feature of the Gladia API enables the identification of different speakers within an audio file and transcribes what each one says. When this feature is activated, the transcript produced will include a series of utterances, with each utterance representing a continuous speech segment from an individual speaker.

If you have an audio setup where 1 speaker = 1 audio channel, we already perform a mechanical diarization by providing channel of each utterance, so you don’t need to enable this feature.

Diarization example

Don’t forget to update YOUR_GLADIA_API_TOKEN and YOUR_AUDIO_URL to match yours.
async function makeFetchRequest(url: string, options: any) {
  const response = await fetch(url, options);
  return response.json();

async function pollForResult(resultUrl: string, headers: any) {
  while (true) {
    console.log("Polling for results...");
    const pollResponse = await makeFetchRequest(resultUrl, { headers });

    if (pollResponse.status === "done") {
      console.log("- Transcription done: \n");
      const utterances = pollResponse.result.transcription.utterances;
      utterances.forEach((utterance: any) => {
        const speaker = utterance.speaker;
        const start = utterance.start.toFixed(2);
        const end = utterance.end.toFixed(2);
        const text = utterance.text;
        console.log(`Speaker ${speaker} (${start}s -> ${end}s): "${text}"`);
    } else {
      console.log("Transcription status : ", pollResponse.status);
      await new Promise((resolve) => setTimeout(resolve, 1000));

async function startTranscription() {
  const gladiaKey = "YOUR_GLADIA_API_TOKEN";
  const requestData = {
    diarization: true,
  const gladiaUrl = "";
  const headers = {
    "x-gladia-key": gladiaKey,
    "Content-Type": "application/json",

  console.log("- Sending initial request to Gladia API...");
  const initialResponse = await makeFetchRequest(gladiaUrl, {
    method: "POST",
    body: JSON.stringify(requestData),

  console.log("Initial response with Transcription ID :", initialResponse);

  if (initialResponse.result_url) {
    await pollForResult(initialResponse.result_url, headers);


In the result, you’ll get, for each utterance, a speaker property that you can use to determine who’s speaking. In the above code, we’re using the speaker parameter alongside with timestamps and utterance text to get an output like this:

Speaker 0 (0.21s -> 4.71s): "First utterance"
Speaker 1 (5.55s -> 7.77s): "Second one"
Speaker 2 (8.53s -> 10.81s): "....."

Set number of speakers

If you already know the number of speakers in your sent audio, you can improve the diarization performance by providing this on the diarization_config parameter:

min_speakersThe minimum number of speaker present in the audio
min_speakersThe maximum number of speaker present in the audio
number_of_speakersThe exact number of speaker present in the audio
request data
  "audio_url": "YOUR_AUDIO_URL",
  "diarization_config": {
    // either set a range
    "min_speakers": 2,
    "max_speakers": 6,
    // or you know your exact number of speaker 
    "number_of_speakers": 5