View publication

Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals.

Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network.

We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.



The NSynth dataset is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license Link.

The Freesound Loop Dataset is published by António Ramires et. al. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license Link.

Related readings and updates.

Downbeat Tracking with Tempo-Invariant Convolutional Neural Networks

The human ability to track musical downbeats is robust to changes in tempo, and it extends to tempi never previously encountered. We propose a deterministic time-warping operation that enables this skill in a convolutional neural network (CNN) by allowing the network to learn rhythmic patterns independently of tempo. Unlike conventional deep learning approaches, which learn rhythmic patterns at the tempi present in the training dataset, the…
See paper details

Optimizing Siri on HomePod in Far‑Field Settings

The typical audio environment for HomePod has many challenges — echo, reverberation, and noise. Unlike Siri on iPhone, which operates close to the user’s mouth, Siri on HomePod must work well in a far-field setting. Users want to invoke Siri from many locations, like the couch or the kitchen, without regard to where HomePod sits. A complete online system, which addresses all of the environmental issues that HomePod can experience, requires a tight integration of various multichannel signal processing technologies. Accordingly, the Audio Software Engineering and Siri Speech teams built a system that integrates both supervised deep learning models and unsupervised online learning algorithms and that leverages multiple microphone signals. The system selects the optimal audio stream for the speech recognizer by using top-down knowledge from the “Hey Siri” trigger phrase detectors. In this article, we discuss the machine learning techniques we use for online signal processing, as well as the challenges we faced and our solutions for achieving environmental and algorithmic robustness while ensuring energy efficiency.

See article details