Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gladia.io/llms.txt

Use this file to discover all available pages before exploring further.

Pre-recorded Live As Speech-to-text models are trained on general vocabulary, under-represented words such as brand names, proper nouns, or domain-specific terms are often transcribed incorrectly. Custom Vocabulary is a post-processing operation that compares phonemes between the transcript and your pronunciations entries. When the phonetic match is close enough, the transcribed text is replaced with your term.
If you already know which text variants the model produces and only need to normalize spelling, use Custom spelling instead. Custom spelling is not based on phonemes but literal matching.

How it works

Custom vocabulary operates at a text level and is based on phoneme similarity. Once the transcription is generated, Gladia converts both the transcribed words and your vocabulary entries into phonemes, then compares them. The intensity controls how aggressively the model applies replacements: a higher intensity means the model will replace words more readily (wider phoneme matching), while a lower intensity requires a closer phoneme match before a replacement is made. The pronunciations field lets you provide plain-text alternative spellings that reflect how the word actually sounds in speech. These are not phonetic notation. Just write the word the way someone might naively spell it based on how it sounds. Gladia converts these strings to phonemes internally. For example, if your term is “Nietzsche”, you might add ["Niche", "Neechee"] as pronunciations. This widens the phoneme net without having to raise the intensity (which would increase false positives across the board).

When to use custom vocabulary vs. custom spelling

Use Custom spelling when the model outputs a recognizable but wrong form. It applies literal string matching on variants you list (e.g. “data-science”“Data Science”). List every close variant the model might output. Use Custom vocabulary when the model outputs garbled or sound-alike text. It applies phoneme-based matching on entries you define (e.g. “le vin” / “levine”“Levain”). Add pronunciations for each spelling the model might produce.
Custom spellingCustom vocabulary
Matches onExact text in the transcriptHow words sound
Best forWrong spelling, punctuation, formattingPhonetically similar mis-transcriptions
You provideAll the words that the model outputs wronglyvalue, pronunciations, intensity
Rule of thumb: start with a transcription run without any custom vocabulary. Look at what the output actually says. If the word appears but is just misspelled, custom spelling is the simpler and safer fix. If the word is completely garbled, that’s when custom vocabulary is the right tool.

Example configuration

{
  "audio_url": "YOUR_AUDIO_URL",
  "custom_vocabulary": true,
  "custom_vocabulary_config": {
    "vocabulary": [
      "Gladia",
      {"value": "Solaria"},
      {
        "value": "Salesforce",
        "pronunciations": ["sell force", "sale forces"],
        "intensity": 0.5,
        "language": "en"
      },
    ],
    "default_intensity": 0.4
  }
}

Parameter reference

vocabulary
object | string[]
default_intensity
number
Global intensity for entries. We suggest 0.4–0.6 raise if terms are missed, lower if unrelated words get replaced.

Tuning tips

  • Start at default_intensity 0.4 and adjust per entry only when needed.
  • Add pronunciations before raising intensity — variants narrow what can match without loosening every comparison.
  • Keep lists focused — every transcribed word is compared against every entry; long lists increase false positives.
  • Move stable misspellings to custom spelling when the model already outputs a recognizable (but wrong) form.
  1. Transcribe without custom vocabulary and note mis-transcribed terms.
  2. Route each term: garbled or phonetically wrong output → custom vocabulary; recognizable but misspelled → custom spelling.
  3. Add entries with pronunciations and default_intensity around 0.4–0.6.
  4. Transcribe again — confirm targets appear and scan for false positives.
  5. Refine: lower intensity, tighten pronunciations, or move stubborn terms to custom spelling.