A voice replicator is a powerful tool for people at risk of losing their ability to speak, including those with a recent diagnosis of amyotrophic lateral sclerosis (ALS) or other conditions that can progressively impact speaking ability. First introduced in May 2023 and made available on iOS 17 in September 2023, Personal Voice is a tool that creates a synthesized voice for such users to speak in FaceTime, phone calls, assistive communication apps, and in-person conversations.
To start, the user reads aloud a randomized set of text prompts to record 150 sentences on the latest iPhone, iPad or Mac.The voice audio is then tuned with machine learning techniques overnight directly on the device while the device is charging, locked and connected to Wi-Fi. This is only for downloading the pre-trained asset. By the next day, the person can type what they want to say using the Live Speech text-to-speech (TTS) feature, as illustrated in Figure 1, and be heard in conversation in a voice that sounds like theirs. Because model training and inference are done entirely on-device, users can take advantage of Personal Voice whenever they want, and keep their information both private and secure.
In this research highlight, we discuss the three machine learning approaches behind Personal Voice:
- Personal Voice TTS system
- Voice model pretraining and fine-tuning
- On-device speech recording enhancement
Personal Voice TTS System
The first machine learning approach we will discuss is a typical neural TTS system, which takes in text and provides speech output. A TTS system includes three major components:
- Text processing: Converts graphemes (written text) to phonemes, a written notation that represents a distinct units of sound (such as the h of hat and the c of cat in English)
- Acoustic model: Converts phonemes to acoustic features (for example, to the Mel spectrum, a frequency representation of sound, engineered to represent the range of human speech)
- Vocoder model: Converts acoustic features to speech waveforms, providing a representation of the audio signal over time
To develop Personal Voice, Apple researchers worked on the Open SLR LibriTTS dataset. The cleaned dataset includes 300 hours of 1000 speakers with very different speaking styles or accents. Personal Voice must produce speech output that others can recognize as the voice of the target speaker. Both the acoustic model and vocoder model are speaker-dependent in a typical TTS system. To clone the target speaker’s voice, we fine-tuned the acoustic model with on-device training. For the vocoder model, we considered both a universal model and on-device adaptation. Our team found that fine-tuning only the acoustic model, and using a universal vocoder, often generates poorer voice quality. Unusual prosody, audio glitches, and noise were more prevalent, when tested against unseen speakers. Fine-tuning both models, as seen in Figure 2, requires extra training time on device but results in better overall quality.
Listening tests showed that fine-tuning both models achieves the best voice quality and similarity to the target speaker’s voice, as measured by mean opinion score (MOS) and voice similarity (VS) score, respectively. The MOS is 0.43 higher than the universal vocoder version on average. In addition, fine-tuning can reduce the actual model size enough to achieve real-time speech synthesis for a faster and more satisfying conversation experience.
Voice Model Pretraining and Fine-Tuning
The next machine learning approaches we will discuss are voice model pretraining and fine-tuning. The models contain two parts:
- Modified FastSpeech2-based acoustic model
- WaveRNN-based vocoder model
The acoustic model follows an architecture similar to FastSpeech2. However, we add speaker ID as part of the decoder input to learn general voice information during the pretraining stage. Further, our team uses dilated convolution layers for decoding instead of transformer-based layers. This results in faster training and inference, as well as reduced memory consumption, making the models shippable on iPhone and iPad.
We use a general pretraining and fine-tuning strategy for Personal Voice. Both the acoustic and vocoder models are pretrained with the same Open SLR LibriTTS dataset.
During the fine-tuning stage with target-speaker data, we fine-tuned only on the acoustic model's decoder and variance adapters part. The variance adapters are used to predict target speaker phoneme-wise duration, pitch, and energy. However, we do a full model adaptation, in which all parameters will be fine-tuned, for the vocoder model. Moreover, the entire fine-tuning stage (and Personal TTS system) occurs on the user's Apple device, not the server. To speed up the on-device training performance, we use full bfloat16 precision with fp32 accumulation for vocoder model fine-tuning with a batch size of 32. Each batch contains 10ms audio samples.
On-Device Speech Recording Enhancement
The final machine learning approach we will discuss is on-device speech recording enhancement. Those who use the Personal Voice feature can record their voice samples wherever they choose. As a result, those recordings might include unwanted sounds, such as traffic noise or other people’s voices nearby. In our research, we found that the quality of the generated or synthesized voice is highly related to the quality of the user’s recordings. Hence, we apply speech augmentation to the target-speaker data to achieve the best voice quality.
Our speech augmentation contains four major components as seen in Figure 3:
- Sound pressure level (SPL) and signal-to-noise ratio (SNR) filtering: Screens out very noisy recordings that are difficult to enhance
- Voice isolation: Removes general noise and leaves only speech
- Mel spectrum augmentation: Model-based solution that provides a cleaner Mel spectrum with better audio fidelity
- Audio recovery: Model-based solution to recover the audio signal from the enhanced Mel spectrum
Our Mel spectrum augmentation model is a model based on U-Net, trained with noisy Mel spectrum as the input and clean Mel spectrum as the output. The audio recovery model is a simple Chunked Autoregressive GAN (CarGAN) model that converts a clean Mel spectrum to an audio signal.
With the speech augmentation flow, we found the generated voice quality improved significantly, especially with the real-world iPhone recorded data that we collected from internal and external speakers. The MOS score is 0.25 higher compared with the baseline flow which does not have audio augmentation.
Figure 4 shows the final results of quality evaluation for Personal Voice, in both mean opinion score and voice similarity score.
In this research highlight, we cover the technical details behind the Personal Voice feature, which accessibility users can use to create their own voice overnight fully on device, and use with real-time speech synthesis to talk with others. Our hope is that people at risk of losing their ability to speak, such as those with ALS or other conditions that can diminish their ability to speak, may benefit greatly from the Personal Voice feature.
Many people contributed to this work, including Dipjyoti Paul, Jiangchuan Li, Luke Chang, Petko Petkov, Pierre Su, Shifas Padinjaru Veettil, and Ye Tian.
Apple Developer. 2023. “Extended Speech Synthesis with Personal and Custom Voices.” [link.]
Apple Newsroom. 2023. “Apple Introduces New Features for Cognitive Accessibility, Along with Live Speech, Personal Voice, and Point and Speak in Magnifier.” [link.]
Apple Support. 2023. “Create a Personal Voice on your iPhone, iPad, or Mac.” [link.]
Apple Youtube. 2023. "Personal Voice on iPhone - The Lost Voice." [link.]
Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, et al. 2018. “Efficient Neural Audio Synthesis.” [link.]
Morrison, Max, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. 2022. “Chunked Autoregressive GAN for Conditional Waveform Synthesis.” March. [link.]
Open SLR. n.d. "LibriTTS Corpus." [link.]
Ren, Yi, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. “FastSpeech 2: Fast and High-Quality End-To-End Text to Speech.” March. [link.]
Silva-Rodríguez, J., M. F. Dolz, M. Ferrer, A. Castelló, V. Naranjo, and G. Piñero. 2021. “Acoustic Echo Cancellation Using Residual U-Nets.” September. [link.]
Related readings and updates.
A growing number of consumer devices, including smart speakers, headphones, and watches, use speech as the primary means of user input. As a result, voice trigger detection systems—a mechanism that uses voice recognition technology to control access to a particular device or feature—have become an important component of the user interaction pipeline as they signal the start of an interaction between the user and a device. Since these systems are deployed entirely on-device, several considerations inform their design, like privacy, latency, accuracy, and power consumption.
Scene analysis is an integral core technology that powers many features and experiences in the Apple ecosystem. From visual content search to powerful memories marking special occasions in one’s life, outputs (or "signals") produced by scene analysis are critical to how users interface with the photos on their devices. Deploying dedicated models for each of these individual features is inefficient as many of these models can benefit from sharing resources. We present how we developed Apple Neural Scene Analyzer (ANSA), a unified backbone to build and maintain scene analysis workflows in production. This was an important step towards enabling Apple to be among the first in the industry to deploy fully client-side scene analysis in 2016.