Mean Estimation with User-level Privacy under Data Heterogeneity
AuthorsRachel Cummings*, Vitaly Feldman*, Audra McMillan*, Kunal Talwar*
Mean Estimation with User-level Privacy under Data Heterogeneity
AuthorsRachel Cummings*, Vitaly Feldman*, Audra McMillan*, Kunal Talwar*
A key challenge in many modern data analysis tasks is that user data is heterogeneous. Different users may possess vastly different numbers of data points. More importantly, it cannot be assumed that all users sample from the same underlying distribution. This is true, for example in language data, where different speech styles result in data heterogeneity. In this work we propose a simple model of heterogeneous user data that differs in both distribution and quantity of data, and we provide a method for estimating the population-level mean while preserving user-level differential privacy. We demonstrate asymptotic optimality of our estimator and also prove general lower bounds on the error achievable in our problem. In particular, while the optimal non-private estimator can be shown to be linear, we show that privacy constrains us to use a non-linear estimator.
*=Equal Contributors
User-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates
January 29, 2024research area Methods and Algorithms, research area Privacyconference AISTATS
We study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items. Existing work for user-level DP-SCO either requires super-polynomial runtime or requires a number of users that grows polynomially with the dimensionality of the problem. We develop new algorithms for user-level DP-SCO that obtain optimal rates, run in polynomial time, and require a number of users that…
Subspace Recovery from Heterogeneous Data with Non-isotropic Noise
November 10, 2022research area Methods and Algorithms, research area Privacyconference NeurIPS
*= Equal Contributions
Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from users with user contributing data samples from a -dimensional distribution with mean …