View publication

Audio descriptions make videos accessible to those who cannot see them by describing visual content in audio. Producing audio descriptions is challenging due to the synchronous nature of the audio description that must fit into gaps of other video content. An experienced audio description author will produce content that fits narration necessary to understand, enjoy, or experience the video content into the time available. This can be especially tricky for novices to do well. In this paper, we introduce a tool, Rescribe, that helps authors create and refine their audio descriptions. Using Rescribe, authors first create a draft of all the content they would like to include in the audio description. Rescribe then uses a dynamic programming approach to optimize between the length of the audio description, available automatic shortening approaches, and source track lengthening approaches. Authors can iteratively visualize and refine the audio descriptions produced by Rescribe, working in concert with the tool. We evaluate the effectiveness of Rescribe through interviews with blind and visually impaired audio description users who preferred Rescribe-edited descriptions to extended descriptions. In addition, we invite novice users to create audio descriptions with Rescribe and another tool, finding that users produce audio descriptions with fewer placement errors using Rescribe.

Related readings and updates.

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate…
See paper details

Improving Neural Network Acoustic Models by Cross-bandwidth and Cross-lingual Initialization

Users expect Siri speech recognition to work well regardless of language, device, acoustic environment, or communication channel bandwidth. Like many other supervised machine learning tasks, achieving such high accuracy usually requires large amounts of labeled data. Whenever we launch Siri in a new language, or extend support to different audio channel bandwidths, we face the challenge of having enough data to train our acoustic models. In this article, we discuss transfer learning techniques that leverage data from acoustic models already in production. We show that the representations are transferable not only across languages but also across audio channel bandwidths. As a case study, we focus on recognizing narrowband audio over 8 kHz Bluetooth headsets in new Siri languages. Our techniques help to improve significantly Siri’s accuracy on the day we introduce a new language.

See article details