Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector Based Pseudo-Labels

AuthorsZak Aldeneh, Takuya Higuchi, Jee-weon Jung†, Li-Wei Chen†, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe†, Tatiana Likhomanenko, Barry Theobald

View publication

Iterative self-training, or iterative pseudo-labeling (IPL) — using an improved model from the current iteration to provide pseudo-labels for the next iteration — has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameter tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.

† Carnegie Mellon University

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector Based Pseudo-Labels

Related readings and updates.

Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

Improving On-Device Speaker Verification Using Federated Learning With Privacy

Discover opportunities in Machine Learning.