View publication

This paper was accepted at the Federated Learning in the Age of Foundation Models workshop at NeurIPS 2023.

While automatic speech recognition (ASR) has witnessed remarkable achievements in recent years, it has not garnered a widespread focus within the federated learning (FL) and differential privacy (DP) communities. Meanwhile, ASR is also a well suited benchmark for FL and DP as there is (i) a natural data split across users by using speaker information; (ii) heterogeneous data across speakers close to practical settings; (iii) interplay between acoustic and language modeling; (iv) and it is a sequence-to-sequence task. Recent production-ready state-of-the-art models in ASR include large conformer and transformer models, optimization of which is known to pose challenges even for the central training. While the main trends and benchmarks in FL and DP focus on small models, we show the necessity of disentangling optimization and model size: the behaviour of FL and DP for large models is different from the one for small models. We speculate that FL and DP is harder for small models due to harder optimization problem even in central training. In this paper, we analyze the key FL parameters (optimizers, training from scratch or from a seed model pre-trained centrally, cohort size, data heterogeneity) and propose first benchmark of FL with DP in the context of large models in ASR. We examine the applicability of prior results and present an overview of observed departures from the trends in prior works and from training different ASR models. Through this work, we provide researchers and practitioners in the fields of FL and DP with valuable insights into the fundamental differences that may arise when applying FL and DP research to large-scale ASR training.

Figure 1: Comparison of word error rates (WERs) between central training and federated learning (FL), and the impact of the cohort size LL and seed models for large transformer models (~250M parameters) trained on Common Voice (CV): English (left) and French/German (right). We set practical T=2T=2k total central steps and 1010 local epochs. FL models can achieve nearly optimal performance for training both from scratch and from a seed model centrally pre-trained even on out-of-domain data e.g. on LibriSpeech (LS).
Figure 2: Results for federated learning with differential privacy (DP) and model pre-trained on LibriSpeech (\sim100h) used as central data and afterwards fine-tuned on Common Voice (\sim1.6k hours) used as clients data. We set δ=109\delta=10^{-9} and report ϵ\epsilon for which (ϵ,δ)(\epsilon, \delta)-DP holds for given cohort size and population using the moments accountant. For scaling cohort size and population where it is practically intractable to run model training (due to dataset), we extrapolate (ϵ,δ)(\epsilon, \delta)-DP assuming the training dynamic remains unchanged and thus similar word error rate could be obtained.

Related readings and updates.

Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR

In this paper, we start by training End-to-End Automatic Speech Recognition (ASR) models using Federated Learning (FL) and examining the fundamental considerations that can be pivotal in minimizing the performance gap in terms of word error rate between models trained using FL versus their centralized counterpart. Specifically, we study the effect of (i) adaptive optimizers, (ii) loss characteristics via altering Connectionist Temporal…
See paper details

Population Expansion for Training Language Models with Private Federated Learning

Federated learning (FL) combined with differential privacy (DP) offers machine learning (ML) training with distributed devices and with a formal privacy guarantee. With a large population of devices, FL with DP produces a performant model in a timely manner. However, for applications with a smaller population, not only does the model utility degrade as the DP noise is inversely proportional to population, but also the training latency increases…
See paper details