Humanizing Word Error Rate for ASR Transcript Readability and Accessibility

Podcasting has grown to be a popular and powerful medium for storytelling, news, and entertainment. Without transcripts, podcasts may be inaccessible to people who are hard-of-hearing, deaf, or deaf-blind. However, ensuring that auto-generated podcast transcripts are readable and accurate is a challenge. The text needs to accurately reflect the meaning of what was spoken and be easy to read. The Apple Podcasts catalog contains millions of podcast episodes, which we transcribe using automatic speech recognition (ASR) models. To evaluate the quality of our ASR output, we compare a small number of human-generated, or reference, transcripts to corresponding ASR transcripts.

The industry standard for measuring transcript accuracy, word error rate (WER), lacks nuance. It equally penalizes all errors in the ASR text—insertions, deletions, and substitutions—regardless of their impact on readability. Furthermore, the reference text is subjective: It is based on what the human transcriber discerns as they listen to the audio.

Building on existing research into better readability metrics, we set ourselves a challenge to develop a more nuanced quantitative assessment of the readability of ASR passages. As shown in Figure 1, our solution is the human evaluation word error rate (HEWER) metric. HEWER focuses on major errors, those that adversely impact readability, such as misspelled proper nouns, capitalization, and certain punctuation errors. HEWER ignores minor errors, such as filler words (“um,” “yeah,” “like”) or alternate spellings (“ok” vs. “okay”). We found that for an American English test set of 800 segments with an average ASR transcript WER of 9.2% sampled from 61 podcast episodes, the HEWER was just 1.4%, indicating that the ASR transcripts were of higher quality and more readable than WER might suggest.

Figure 1: Word error rate (WER) considers all tokens as errors of equal weight, whereas human evaluation word error rate (HEWER) considers only some of those tokens as major errors—errors that change the meaning of the text, affect its readability, or misspell proper nouns—in calculating the metric.

Our findings provide data-driven insights that we hope have laid the groundwork for improving the accessibility of Apple Podcasts for millions of users. In addition, Apple engineering and product teams can use these insights to help connect audiences with more of the content they seek.

Selecting Sample Podcast Segments

We worked with human annotators to identify and classify errors in 800 segments of American English podcasts pulled from manually transcribed episodes with WER of less than 15%. We chose this WER maximum to ensure the ASR transcripts in our evaluation samples:

Met the threshold of quality we expect for any transcript shown to an Apple Podcasts audience
Required our annotators to spend no more than 5 minutes to classify errors as major or minor

Of the 66 podcast episodes in our initial dataset, 61 met this criterion, representing 32 unique podcast shows. Figure 2 shows the selection process.

diagram of the podcast segment selection phases — Figure 2: We found that for an American English test set of 800 segments with an average ASR transcript WER of 9.2% sampled from 61 podcast episodes, the HEWER was just 1.4%, indicating that the ASR transcripts were of higher quality and more readable than WER might suggest.

For example, one episode in the initial dataset from the podcast show Yo, Is This Racist? titled “Cody’s Marvel dot Ziglar (with Cody Ziglar)” had a WER of 19.2% and was excluded from our evaluation. But we included an episode titled “I’m Not Trying to Put the Plantation on Blast, But…” from the same show, with a WER of 14.5%.

Segments with a relatively higher episode WER were weighted more heavily in the selection process, because such episodes can provide more insights than episodes whose ASR transcripts are nearly flawless. The mean episode WER across all segments was 7.5%, while the average WER of the selected segments was 9.2%. Each audio segment was approximately 30 seconds in duration, providing enough context for annotators to understand the segments without making the task too taxing. Also we, aimed to select segments that started and ended at a phrase boundary, such as a sentence break or long pause.

Evaluating Major and Minor Errors in Transcript Samples

WER is a widely used measurement of the performance of speech recognition and machine translation systems. It divides the total number of errors in the auto-generated text by the total number of words in the human-generated (reference) text. Unfortunately, WER scoring gives equal weight to all ASR errors—insertions, substitutions, and deletions—which can be misleading. For example, a passage with a high WER may still be readable or even indistinguishable in semantic content from the reference transcript, depending on the types of errors.   Previous research on readability has focused on subjective and imprecise metrics. For example, in their paper “A Metric for Evaluating Speech Recognizer Output Based on Human Perception Model,” Nobuyasu Itoh and team devised a scoring rubric on a scale of 0 to 5, with 0 being the highest quality. Participants in their experiment were first presented with auto-generated text without corresponding audio and were asked to assess transcripts based on how easy the transcript was to understand. They then listened to the audio and scored the transcript based on perceived accuracy.

Other readability research—for example, “The Future of Word Error Rate”—has not to our knowledge been implemented across any datasets at scale. To address these limitations, our researchers developed a new metric for measuring readability, HEWER, that builds on the WER scoring system.

The HEWER score provides human-centric insights considering readability nuances. Figure 3 shows three versions of a 30-second sample segment from transcripts of the April 23, 2021, episode, “The Herd,” of the podcast show This American Life.

Figure 3: Depending on the type of errors, a passage with a high WER may still be readable or even indistinguishable from the reference transcript. HEWER provides human-centric insights that consider these readability nuances. The HEWER calculation also takes into account errors not considered by WER, such as punctuation or capitalization errors.

Our dataset comprised 30-second audio segments from a superset of 66 podcast episodes, and each segment’s corresponding reference and model-generated transcripts. Human annotators started by identifying errors in wording, punctuation, or in the transcripts, and classifying as “major errors” only those errors that:

Changed the meaning of the text
Affected the readability of the text
Misspelled proper nouns

WER and HEWER are calculated based on an alignment of the reference and model-generated text. Figure 3 shows each metric’s scoring of the same output. WER counts errors as all words that differ between the reference and model-generated text, but ignores case and punctuation. HEWER, on the other hand, takes both case and punctuation into account, and therefore, the total number of tokens, shown in the denominator, is larger because each punctuation mark counts as a token.

Unlike WER, HEWER ignores minor errors, such as filler words like “uh” only present in the reference transcript, or the use of “you know” in the model-generated text in place of “yeah” in the reference transcript. Additionally, HEWER ignores differences in comma placement that do not affect readability or meaning, as well as missing hyphens. The only major errors in the Figure 3 HEWER sample are “quarantine” in place of “quarantining” and “Antibirals” in place of “Antivirals.”

In this case, WER is somewhat high, at 9.4%. However, that value gives us a false impression about the quality of the model-generated transcript, which is actually quite readable. The HEWER value of 2.2% seems to indicate that it is a better reflection of the human experience of reading the transcript.

Conclusion

Given the rigidity and limitations of WER, the established industry standard for measuring ASR accuracy, we are working to build on existing research and create HEWER, a more nuanced quantitative assessment of the readability of ASR passages. We applied this new metric to a dataset of sample segments from auto-generated transcripts of podcast episodes to glean insights into transcript readability and help ensure the greatest accessibility and best possible experience for all Apple Podcasts audiences and creators.

Acknowledgments

Many people contributed to this research, including Nilab Hessabi, Sol Kim, Filipe Minho, Issey Masuda Mora, Samir Patel, Alejandro Woodward Riquelme, João Pinto Carrilho Do Rosario, Clara Bonnin Rossello, Tal Singer, Eda Wang, Anne Wootton, Regan Xu, and Phil Zepeda.

Apple Resources

Apple Newsroom. 2024. “Apple Introduces Transcripts for Apple Podcasts.” [link.]

Apple Podcasts. n.d. “Endless Topics. Endlessly Engaging.” [link.]

External References

Glass, Ira, host. 2021. “The Herd.” This American Life. Podcast 736, April 23, 58:56. [link.]

Hughes, John. 2022. “The Future of Word Error Rate (WER).” Speechmatics. [link.]

Itoh, Nobuyasu, Gakuto Kurata, Ryuki Tachibana, and Masafumi Nishimura. 2015. “A Metric for Evaluating Speech Recognizer Output based on Human-Perception Model.” In 16th Annual Conference of the International Speech Communication Association (Interspeech 2015). Speech Beyond Speech: Towards a Better Understanding of the Most Important Biosignal, 1285–88. [link.]

Humanizing Word Error Rate for ASR Transcript Readability and Accessibility

Selecting Sample Podcast Segments

Evaluating Major and Minor Errors in Transcript Samples

Conclusion

Acknowledgments

Apple Resources

External References

Related readings and updates.

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

On Modeling ASR Word Confidence

Discover opportunities in Machine Learning.