Apple Machine Learning Journal

Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis

Vol. 1, Issue 4 August 2017 by Siri Team

Siri is a personal assistant that communicates using speech synthesis. Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through. This article presents more details about the deep learning based technology behind Siri’s voice.

View the article "Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis"

Inverse Text Normalization as a Labeling Problem

Vol. 1, Issue 3 August 2017 by Siri Team

Siri displays entities like dates, times, addresses and currency amounts in a nicely formatted way. This is the result of the application of a process called inverse text normalization (ITN) to the output of a core speech recognition component. To understand the important role ITN plays, consider that, without it, Siri would display “October twenty third twenty sixteen” instead of “October 23, 2016”. In this work, we show that ITN can be formulated as a labelling problem, allowing for the application of a statistical model that is relatively simple, compact, fast to train, and fast to apply. We demonstrate that this approach represents a practical path to a data-driven ITN system.

View the article "Inverse Text Normalization as a Labeling Problem"

Improving Neural Network Acoustic Models by Cross-bandwidth and Cross-lingual Initialization

Vol. 1, Issue 2 August 2017 by Siri Team

Users expect Siri speech recognition to work well regardless of language, device, acoustic environment, or communication channel bandwidth. Like many other supervised machine learning tasks, achieving such high accuracy usually requires large amounts of labeled data. Whenever we launch Siri in a new language, or extend support to different audio channel bandwidths, we face the challenge of having enough data to train our acoustic models. In this article, we discuss transfer learning techniques that leverage data from acoustic models already in production. We show that the representations are transferable not only across languages but also across audio channel bandwidths. As a case study, we focus on recognizing narrowband audio over 8 kHz Bluetooth headsets in new Siri languages. Our techniques help to improve significantly Siri’s accuracy on the day we introduce a new language.

View the article "Improving Neural Network Acoustic Models by Cross-bandwidth and Cross-lingual Initialization"

Improving the Realism of Synthetic Images

Vol. 1, Issue 1 July 2017

Most successful examples of neural nets today are trained with supervision. However, to achieve high accuracy, the training sets need to be large, diverse, and accurately annotated, which is costly. An alternative to labelling huge amounts of data is to use synthetic images from a simulator. This is cheap as there is no labeling cost, but the synthetic images may not be realistic enough, resulting in poor generalization on real test images. To help close this performance gap, we’ve developed a method for refining synthetic images to make them look more realistic. We show that training models on these refined images leads to significant improvements in accuracy on various machine learning tasks.

View the article "Improving the Realism of Synthetic Images"


July 2017

Welcome to the Apple Machine Learning Journal. Here, you can read posts written by Apple engineers about their work using machine learning technologies to help build innovative products for millions of people around the world. If you’re a machine learning researcher or student, an engineer or developer, we’d love to hear your questions and feedback. Write us at