Apple sponsored the Neural Information Processing Systems (NeurIPS) conference, which was held virtually from December 6 to 12. NeurIPS is a global conference focused on fostering the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects.

Learn more about NeurIPS

Visit our NeurIPS 2020 virtual booth

Conference Accepted Papers

Collegial Ensembles

Etai Littwin, Ben Myara, Sima Sabah, Joshua Susskind, Shuangfei Zhai, Oren Golan

Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a single model. We show that the optimization dynamics of CE simplify dramatically when the number of models in the ensemble is large, resembling the dynamics of wide models, yet scale much more favorably. We use recent theoretical results on the finite width corrections of the NTK to perform efficient architecture search in a space of finite width CE that aims to either minimize capacity, or maximize trainability under a set of constraints. The resulting ensembles can be efficiently implemented in practical architectures using group convolutions and block diagonal layers. Finally, we show how our framework can be used to analytically derive optimal group convolution modules originally found using expensive grid searches, without having to train a single model.

Faster Differentially Private Samplers via Rényi Divergence Analysis of Discretized Langevin MCMC

Arun Ganesh, Kunal Talwar

Various differentially private algorithms instantiate the exponential mechanism, and require sampling from the distribution exp(−f) for a suitable function f. When the domain of the distribution is high-dimensional, this sampling can be computationally challenging. Using heuristic sampling schemes such as Gibbs sampling does not necessarily lead to provable privacy. When f is convex, techniques from log-concave sampling lead to polynomial-time algorithms, albeit with large polynomials. Langevin dynamics-based algorithms offer much faster alternatives under some distance measures such as statistical distance. In this work, we establish rapid convergence for these algorithms under distance measures more suitable for differential privacy. For smooth, strongly-convex f, we give the first results proving convergence in Rényi divergence. This gives us fast differentially private algorithms for such f. Our techniques and simple and generic and apply also to underdamped Langevin dynamics.

On the Error Resistance of Hinge-Loss Minimization

Kunal Talwar

Commonly used classification algorithms in machine learning, such as support vector machines, minimize a convex surrogate loss on training examples. In practice, these algorithms are surprisingly robust to errors in the training data. In this work, we identify a set of conditions on the data under which such surrogate loss minimization algorithms provably learn the correct classifier. This allows us to establish, in a unified framework, the robustness of these algorithms under various models on data as well as error. In particular, we show that if the data is linearly classifiable with a slightly non-trivial margin (i.e. a margin at least C/d‾‾√ for d-dimensional unit vectors), and the class-conditional distributions are near isotropic and logconcave, then surrogate loss minimization has negligible error on the uncorrupted data even when a constant fraction of examples are adversarially mislabeled.

Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

Raef Bassily, Vitaly Feldman, Criztobal Guzman, Kunal Talwar

Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al. (2016) provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important progress in understanding of the generalization properties of SGD and several applications to differentially private convex optimization for smooth losses.

Stochastic Optimization with Laggard Data Pipelines

Naman Agarwal, Rohan Anil, Tomer Koren, Kunal Talwar, Cyril Zhang

State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of "data-echoed" extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Vitaly Feldman, Chiyuan Zhang

Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given.

In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman, 2019).

Talks and Workshops

Machine Learning for Mobile Health Workshop

This workshop, held on December 12, aimed to assemble researchers from the key areas in mobile health to better address the challenges currently facing the widespread use of mobile health technologies. We presented our workshop accepted paper which discusses synthetic data generated using a realistic ECG simulator and a structured noise model.

Machine Learning for Mobile Health Workshop Accepted Paper:
Representing and Denoising Wearable ECG Recordings
Jeffrey Chan, Andrew C. Miller, Emily Fox

Privacy Preserving Machine Learning (PPML) Workshop

This workshop, held on December 11, focused on privacy preserving techniques for machine learning and disclosure in large scale data analysis both in the distributed and centralized settings. We will be presenting our workshop accepted papers which discusses how considering sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, and how random shuffling of input data amplifies differential privacy guarantees.

PPML Workshop Accepted Papers:
Individual Privacy Accounting via a Renyi Filter
Vitaly Feldman, Tijana Zrnic

A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling
Vitaly Feldman, Audra McMillan, Kunal Talwar

Offline Reinforcement Learning Workshop

This workshop, held on December 12, brought attention to offline Reinforcement Learning. This workshop facilitated discussions surrounding algorithmic challenges, solutions and real-world applications.

Offline Reinforcement Learning Workshop Accepted Paper:
Uncertainty Weighted Offline Reinforcement Learning
Yue Wu, Shuangfei Zhai, Nitish Srivastava, Josh Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh

Expo Day

Accelerated Training with ML Compute on M1-Powered Mac
Apple presented a talk on Accelerated Training with ML Compute on M1-Powered Mac during NeurIPS Expo Day on December 6.

In November, Apple announced Mac powered by the M1 chip, featuring a powerful machine learning accelerator and high-performance GPU. ML Compute, a new framework available in macOS Big Sur, enables developers to accelerate the training of neural networks using the CPU and GPU.

In this talk, we discussed how we use ML Compute to speed up the training of ML models on M1-powered Mac with popular deep learning frameworks such as TensorFlow. We showed how to replace the TensorFlow ops in graph and eager mode with an ML Compute graph. We also present the performance and watt improvements when training neural networks on Mac with M1. Finally, we examined how unified memory and other memory optimizations on M1-powered Mac allow us to minimize the memory footprint when training neural networks. Learn more about how we leverage ML Compute for Accelerated Training on Mac.

Affinity Group Workshops

Apple sponsored the Black in AI, LatinX in AI, Queer in AI, and Women in Machine Learning workshops throughout the week.

At the Women in Machine Learning workshop on December 9, we gave a talk on how our on-device machine learning powers intelligent experiences on Apple products.

Learn more about Apple’s company-wide inclusion and diversity efforts

Related readings and updates.

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of…
See paper details

Apple at ICML 2020

Apple sponsored the thirty-seventh International Conference on Machine Learning (ICML), which was held virtually from July 12 to 18. ICML is a leading global gathering dedicated to advancing the machine learning field.

See event details