Benchmarking at a glance
0. Define your goal
Decide what “good” means for your product before comparing systems.
1. Normalize transcripts
Normalize both references and predictions before computing WER.
2. Compute WER
Measure substitutions, deletions, and insertions on normalized text.
3. Use the right dataset
Benchmark on audio that matches your real traffic and target users.
4. Interpret results carefully
Look beyond one average score and inspect meaningful slices.
0. Define your evaluation goal
Before comparing providers and models, the first step is to define which aspects of performance matter most for your use case. Below are examples of performance aspects that would be more weighted for domain applications of speech to text:- Accuracy on noisy backgrounds: for contact centers, telephony, and field recordings.
- Speaker diarization quality: for meeting assistants and multi-speaker calls.
- Named entity accuracy: for workflows that extract people, organizations, phone numbers, or addresses.
- Domain-specific vocabulary handling: for medical, legal, or financial transcription.
- Timestamp accuracy: for media workflows that need readable, well-timed captions.
- Filler-word handling: for agentic workflows .
1. Normalize transcripts before computing WER
Normalization removes surface-form differences (casing, abbreviations, numeric rendering) so you compare apples to apples when judging transcription output.| Reference | Prediction | Why raw WER is wrong |
|---|---|---|
It's $50 | it is fifty dollars | Contraction and currency formatting differ, but the semantic content is the same. |
Meet at Point 14 | meet at point fourteen | The normalization should preserve the numbered entity instead of collapsing it into an unrelated form. |
Mr. Smith joined at 3:00 PM | mister smith joined at 3 pm | Honorific and timestamp formatting differ, but the transcript content is equivalent. |
whisper-normalizer. It does not affect numbers, and applies aggressive lowercasing and punctuation stripping.
Gladia’s recommended approach is gladia-normalization, our open-source library designed for transcript evaluation:
It's $50->it is 50 dollarsMeet at Point 14->meet at point 14Mr. Smith joined at 3:00 PM->mister smith joined at 3 pm
gladia-normalization
Open-source transcript normalization library used before WER computation.
2. Compute WER correctly
Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. The standard formula is:S= substitutionsD= deletionsI= insertionsN= number of words in the reference transcript
- Prepare a reference transcript for each audio sample.
- Run each provider on the exact same audio.
- Normalize both the reference and each prediction with the same pipeline.
- Compute WER on the normalized outputs.
- Aggregate results across the full dataset.
3. Choose a representative dataset
Start from your evaluation goal: the right dataset depends on the use case and traffic shape you want to measure. A good benchmark dataset should look as close as possible to your real production audio. If the audio in the benchmark does not match what you actually process, the results will not tell you much. When choosing your dataset, make sure it matches your real audio on:- Language: the target language, accents, and whether speakers switch languages.
- Audio quality: telephony, browser microphone, studio recordings, noisy field audio, overlapping speech, or compressed audio.
- Topics: medical, operational, legal, financial, customer support, or any other domain you care about.
- Important words: numbers, names, acronyms, product names, addresses, or domain-specific terminology.
- Interaction style: single-speaker dictation, calls, meetings, interviews, or long-form recordings.
- Benchmarking call-center audio with clean podcast recordings overestimates real-world performance.
- Benchmarking English-only speech does not capture code-switching traffic.
- Benchmarking short clips can hide failures that appear on long recordings with multiple speakers.
4. Interpret results carefully
Do not stop at a single WER number. Review:- overall average WER
- median WER and spread across files
- breakdowns by language, domain, or audio condition
- failure modes on proper nouns, acronyms, and numbers
- whether differences are consistent or concentrated in a few hard samples
Common pitfalls
- Comparing providers on different datasets
- Using low-quality or inconsistent ground truth
- Treating punctuation and formatting differences as recognition errors
- Drawing conclusions from too few samples
- Reporting one average score without any slice analysis
- Not inspecting the reference transcript: if it contains text not present in the audio, for example an intro like “this audio is a recording of…”, it will inflate WER across all providers
- Not experimenting with provider configurations: for example, using Gladia’s custom vocabulary to improve proper noun accuracy, then comparing against the ground truth