A voice replicator is a powerful tool for people at risk of losing their ability to speak, including those with a recent diagnosis of amyotrophic lateral sclerosis (ALS) or other conditions that can progressively impact speaking ability. First introduced in May 2023 and made available on iOS 17 in September 2023, Personal Voice is a tool that creates a synthesized voice for such users to speak in FaceTime, phone calls, assistive communication apps, and in-person conversations.

To start, the user reads aloud a randomized set of text prompts to record 150 sentences on the latest iPhone, iPad or Mac.The voice audio is then tuned with machine learning techniques overnight directly on the device while the device is charging, locked and connected to Wi-Fi. This is only for downloading the pre-trained asset. By the next day, the person can type what they want to say using the Live Speech text-to-speech (TTS) feature, as illustrated in Figure 1, and be heard in conversation in a voice that sounds like theirs. Because model training and inference are done entirely on-device, users can take advantage of Personal Voice whenever they want, and keep their information both private and secure.

In this research highlight, we discuss the three machine learning approaches behind Personal Voice:

  • Personal Voice TTS system
  • Voice model pretraining and fine-tuning
  • On-device speech recording enhancement

Personal Voice TTS System

The first machine learning approach we will discuss is a typical neural TTS system, which takes in text and provides speech output. A TTS system includes three major components:

  • Text processing: Converts graphemes (written text) to phonemes, a written notation that represents a distinct units of sound (such as the h of hat and the c of cat in English)
  • Acoustic model: Converts phonemes to acoustic features (for example, to the Mel spectrum, a frequency representation of sound, engineered to represent the range of human speech)
  • Vocoder model: Converts acoustic features to speech waveforms, providing a representation of the audio signal over time

To develop Personal Voice, Apple researchers worked on the Open SLR LibriTTS dataset. The cleaned dataset includes 300 hours of 1000 speakers with very different speaking styles or accents. Personal Voice must produce speech output that others can recognize as the voice of the target speaker. Both the acoustic model and vocoder model are speaker-dependent in a typical TTS system. To clone the target speaker’s voice, we fine-tuned the acoustic model with on-device training. For the vocoder model, we considered both a universal model and on-device adaptation. Our team found that fine-tuning only the acoustic model, and using a universal vocoder, often generates poorer voice quality. Unusual prosody, audio glitches, and noise were more prevalent, when tested against unseen speakers. Fine-tuning both models, as seen in Figure 2, requires extra training time on device but results in better overall quality.

A personal voice text-to-speech system diagram.
Figure 2: A personal voice text-to-speech system diagram. The system takes phoneme as the input, then FastSpeech2 model converts phoneme to target speaker Mel spectrum, after that WaveRNN converts Mel spectrum to speech waveform which is the output of the system.

Listening tests showed that fine-tuning both models achieves the best voice quality and similarity to the target speaker’s voice, as measured by mean opinion score (MOS) and voice similarity (VS) score, respectively. The MOS is 0.43 higher than the universal vocoder version on average. In addition, fine-tuning can reduce the actual model size enough to achieve real-time speech synthesis for a faster and more satisfying conversation experience.

Voice Model Pretraining and Fine-Tuning

The next machine learning approaches we will discuss are voice model pretraining and fine-tuning. The models contain two parts:

  • Modified FastSpeech2-based acoustic model
  • WaveRNN-based vocoder model

The acoustic model follows an architecture similar to FastSpeech2. However, we add speaker ID as part of the decoder input to learn general voice information during the pretraining stage. Further, our team uses dilated convolution layers for decoding instead of transformer-based layers. This results in faster training and inference, as well as reduced memory consumption, making the models shippable on iPhone and iPad.

We use a general pretraining and fine-tuning strategy for Personal Voice. Both the acoustic and vocoder models are pretrained with the same Open SLR LibriTTS dataset.

During the fine-tuning stage with target-speaker data, we fine-tuned only on the acoustic model's decoder and variance adapters part. The variance adapters are used to predict target speaker phoneme-wise duration, pitch, and energy. However, we do a full model adaptation, in which all parameters will be fine-tuned, for the vocoder model. Moreover, the entire fine-tuning stage (and Personal TTS system) occurs on the user's Apple device, not the server. To speed up the on-device training performance, we use full bfloat16 precision with fp32 accumulation for vocoder model fine-tuning with a batch size of 32. Each batch contains 10ms audio samples.

On-Device Speech Recording Enhancement

The final machine learning approach we will discuss is on-device speech recording enhancement. Those who use the Personal Voice feature can record their voice samples wherever they choose. As a result, those recordings might include unwanted sounds, such as traffic noise or other people’s voices nearby. In our research, we found that the quality of the generated or synthesized voice is highly related to the quality of the user’s recordings. Hence, we apply speech augmentation to the target-speaker data to achieve the best voice quality.

Our speech augmentation contains four major components as seen in Figure 3:

  1. Sound pressure level (SPL) and signal-to-noise ratio (SNR) filtering: Screens out very noisy recordings that are difficult to enhance
  2. Voice isolation: Removes general noise and leaves only speech
  3. Mel spectrum augmentation: Model-based solution that provides a cleaner Mel spectrum with better audio fidelity
  4. Audio recovery: Model-based solution to recover the audio signal from the enhanced Mel spectrum
Caption.
Figure 3: This diagram shows the flow of speech augmentation. The flow takes noisy speech as the input, and outputs enhanced speech. The flow contains 4 components, SPL/SNR, Voice Isolation, U-Net model like Mel spectrum enhancement, CarGAN model (from top to bottom).

Our Mel spectrum augmentation model is a model based on U-Net, trained with noisy Mel spectrum as the input and clean Mel spectrum as the output. The audio recovery model is a simple Chunked Autoregressive GAN (CarGAN) model that converts a clean Mel spectrum to an audio signal.

With the speech augmentation flow, we found the generated voice quality improved significantly, especially with the real-world iPhone recorded data that we collected from internal and external speakers. The MOS score is 0.25 higher compared with the baseline flow which does not have audio augmentation.

Figure 4 shows the final results of quality evaluation for Personal Voice, in both mean opinion score and voice similarity score.

Figure 4: Evaluation set contains 44 adult English speakers who are randomly selected from different US cities. And each speaker has 10 mins of recording which is recorded by iPhone. Left chart shows MOS (Mean Opinion Score), our personal voice system achieves 3.68 compared with original recording 3.85. Right chart shows personal voice speaker similarity score, our system achieves 3.8 which indicates the similarity is close to somewhat same level (4). The MOS score ranges from 1 (bad) to 5 (excellent). The VS score ranges from 1 (definitely different) to 5 (definitely same).

Conclusion

In this research highlight, we cover the technical details behind the Personal Voice feature, which accessibility users can use to create their own voice overnight fully on device, and use with real-time speech synthesis to talk with others. Our hope is that people at risk of losing their ability to speak, such as those with ALS or other conditions that can diminish their ability to speak, may benefit greatly from the Personal Voice feature.

Acknowledgments

Many people contributed to this work, including Dipjyoti Paul, Jiangchuan Li, Luke Chang, Petko Petkov, Pierre Su, Shifas Padinjaru Veettil, and Ye Tian.

Apple Resources

Apple Developer. 2023. “Extended Speech Synthesis with Personal and Custom Voices.” [link.]

Apple Newsroom. 2023. “Apple Introduces New Features for Cognitive Accessibility, Along with Live Speech, Personal Voice, and Point and Speak in Magnifier.” [link.]

Apple Support. 2023. “Create a Personal Voice on your iPhone, iPad, or Mac.” [link.]

Apple Youtube. 2023. "Personal Voice on iPhone - The Lost Voice." [link.]

References

Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, et al. 2018. “Efficient Neural Audio Synthesis.” [link.]

Morrison, Max, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. 2022. “Chunked Autoregressive GAN for Conditional Waveform Synthesis.” March. [link.]

Open SLR. n.d. "LibriTTS Corpus." [link.]

Ren, Yi, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. “FastSpeech 2: Fast and High-Quality End-To-End Text to Speech.” March. [link.]

Silva-Rodríguez, J., M. F. Dolz, M. Ferrer, A. Castelló, V. Naranjo, and G. Piñero. 2021. “Acoustic Echo Cancellation Using Residual U-Nets.” September. [link.]

Related readings and updates.

Voice Trigger System for Siri

A growing number of consumer devices, including smart speakers, headphones, and watches, use speech as the primary means of user input. As a result, voice trigger detection systems—a mechanism that uses voice recognition technology to control access to a particular device or feature—have become an important component of the user interaction pipeline as they signal the start of an interaction between the user and a device. Since these systems are deployed entirely on-device, several considerations inform their design, like privacy, latency, accuracy, and power consumption.

See highlight details

A Multi-Task Neural Architecture for On-Device Scene Analysis

Scene analysis is an integral core technology that powers many features and experiences in the Apple ecosystem. From visual content search to powerful memories marking special occasions in one’s life, outputs (or "signals") produced by scene analysis are critical to how users interface with the photos on their devices. Deploying dedicated models for each of these individual features is inefficient as many of these models can benefit from sharing resources. We present how we developed Apple Neural Scene Analyzer (ANSA), a unified backbone to build and maintain scene analysis workflows in production. This was an important step towards enabling Apple to be among the first in the industry to deploy fully client-side scene analysis in 2016.

See highlight details