Modality Dropout for Improved Performance-Driven Talking Faces

AuthorsAhmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, Sachin Kajareker

We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (for example a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: the improvement of audiovisual-driven animation over the equivalent video-only approach, and the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51 percent of the test sequences compared with only 18 percent for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74 percent, but decreases to 8 percent for video-only.

Modality Dropout for Improved Performance-Driven Talking Faces

Related readings and updates.

Audiovisual Speech Synthesis using Tacotron2

On the Role of Visual Cues in Audiovisual Speech Enhancement

Discover opportunities in Machine Learning.