What are SRT and VTT Caption files?

SRT and VTT are standard caption file formats for displaying video subtitles or captions. SRT stands for SubRip Subtitle File, while VTT stands for Web Video Text Tracks. Both formats contain time-stamped text entries that can be displayed in sync with a video.

SRT Files

SRT files have the extension ".srt" and contain plain text entries with a specific format. Each entry has three parts:

  • an index number,
  • the timecode for when the subtitle should appear, and
  • the subtitle text.

Example of SRT file

[00:00:01.000] Hello, this is a fake SRT file.
[00:00:05.000] It is generated by an AI language model called ChatGPT.
[00:00:10.000] This SRT file is not associated with any actual media content.
[00:00:15.000] It is solely created for demonstration purposes.
[00:00:20.000] Thank you for watching!

Calling the API

curl -X 'POST' \
    'https://api.gladia.io/audio/text/audio-transcription/' \
    -H 'accept: application/json' \
    -H 'x-gladia-key: fd5f6819-e2a3-474d-a18b-326f03e1c681' \
    -H 'Content-Type: multipart/form-data' \
    -F "audio_url=http://files.gladia.io/example/audio-transcription/split_infinity.wav" \
    -F "output_format=srt"

Expected Results

{
  "prediction": "1\n00:00:00,900 --> 00:00:02,600\nSplit infinity\n\n2\n00:00:02,120 --> 00:00:05,190\nin a time when less is more\n\n3\n00:00:05,510 --> 00:00:20,390\nWhere too much is never enough, there is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet\n",
  "prediction_raw": {
    "metadata": {
      "total_speech_duration": 19.919999999999998,
      "total_speech_duration_channel_0": 19.919999999999998,
      "audioConversionTime": 0.35996150970458984,
      "vadTime": 0.011051177978515625,
      "inferenceTime": 2.076643466949463,
      "diarizationTime": 0.000002384185791015625,
      "totalTranscriptionTime": 2.4476585388183594,
      "nbSilentChannels": 0,
      "nbSimilarChannels": 0,
      "providedFileMetadata": {
        "nb channels": 1,
        "sample rate": 44100,
        "sample width": 16,
        "original file type": "audio"
      }
    },
    "transcription": "1\n00:00:00,900 --> 00:00:02,600\nSplit infinity\n\n2\n00:00:02,120 --> 00:00:05,190\nin a time when less is more\n\n3\n00:00:05,510 --> 00:00:20,390\nWhere too much is never enough, there is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet\n",
    "chapterization": "not_activated",
    "summarization": "not_activated"
  }
}

VTT Files

VTT files have the extension ".vtt" and are similar to SRT files but use a slightly different format. Each entry in a VTT file consists of a timecode for when the subtitle should appear, the subtitle text, and optional settings for the subtitle, such as text color and background color.

Example of VTT file

WEBVTT

STYLE
::cue(.red) {
color: #ff0000;
text-shadow: -1px -1px 0 #000, 1px -1px 0 #000, -1px 1px 0 #000, 1px 1px 0 #000;
}
::cue(.bold) {
font-weight: bold;
}

NOTE This is a fake VTT file with color and other options

00:00:01.000 --> 00:00:05.000 class:red
Hello, this is a <span class="bold">fake</span> VTT file.

00:00:05.000 --> 00:00:10.000 class:bold
It is generated by an AI language model called ChatGPT.

00:00:10.000 --> 00:00:15.000
This VTT file is not associated with any actual media content.

00:00:15.000 --> 00:00:20.000 class:red bold
It is solely created for demonstration purposes.

NOTE This VTT file showcases the use of the ::cue() pseudo-element to apply different styles to the captions based on their classes, and the use of the NOTE keyword to add comments.

📘

Gladia does not provide any formating in the VTT output but this show the capability of VTT vs SRT

Calling the API

curl -X 'POST' \
    'https://api.gladia.io/audio/text/audio-transcription/' \
    -H 'accept: application/json' \
    -H 'x-gladia-key: fd5f6819-e2a3-474d-a18b-326f03e1c681' \
    -H 'Content-Type: multipart/form-data' \
    -F "audio_url=http://files.gladia.io/example/audio-transcription/split_infinity.wav" \
    -F "output_format=vtt"

Expected Results

{
  "prediction": "WEBVTT\n\n1\n00:00:00.090 --> 00:00:02.069\nSplit infinity\n\n2\n00:00:02.129 --> 00:00:05.190\nin a time when less is more\n\n3\n00:00:05.519 --> 00:00:20.399\nWhere too much is never enough, there is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet\n",
  "prediction_raw": {
    "metadata": {
      "total_speech_duration": 19.919999999999998,
      "total_speech_duration_channel_0": 19.919999999999998,
      "audioConversionTime": 0.27975964546203613,
      "vadTime": 0.0074002742767333984,
      "inferenceTime": 1.83445143699646,
      "diarizationTime": 0.0000045299530029296875,
      "totalTranscriptionTime": 2.1216158866882324,
      "nbSilentChannels": 0,
      "nbSimilarChannels": 0,
      "providedFileMetadata": {
        "nb channels": 1,
        "sample rate": 44100,
        "sample width": 16,
        "original file type": "audio"
      }
    },
    "transcription": "WEBVTT\n\n1\n00:00:00.090 --> 00:00:02.069\nSplit infinity\n\n2\n00:00:02.129 --> 00:00:05.190\nin a time when less is more\n\n3\n00:00:05.519 --> 00:00:20.399\nWhere too much is never enough, there is always hope for the future. The future can be read from the past. The past foreshadows the present, and the present hasn't been written yet\n",
    "chapterization": "not_activated",
    "summarization": "not_activated"
  }
}