Apple sponsored the Neural Information Processing Systems (NeurIPS) conference, which was held virtually from December 6 to 12. NeurIPS is a global conference focused on fostering the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects.
Visit our NeurIPS 2020 virtual booth
Conference Accepted Papers
Etai Littwin, Ben Myara, Sima Sabah, Joshua Susskind, Shuangfei Zhai, Oren Golan
Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a single model. We show that the optimization dynamics of CE simplify dramatically when the number of models in the ensemble is large, resembling the dynamics of wide models, yet scale much more favorably. We use recent theoretical results on the finite width corrections of the NTK to perform efficient architecture search in a space of finite width CE that aims to either minimize capacity, or maximize trainability under a set of constraints. The resulting ensembles can be efficiently implemented in practical architectures using group convolutions and block diagonal layers. Finally, we show how our framework can be used to analytically derive optimal group convolution modules originally found using expensive grid searches, without having to train a single model.
Arun Ganesh, Kunal Talwar
Various differentially private algorithms instantiate the exponential mechanism, and require sampling from the distribution exp(−f) for a suitable function f. When the domain of the distribution is high-dimensional, this sampling can be computationally challenging. Using heuristic sampling schemes such as Gibbs sampling does not necessarily lead to provable privacy. When f is convex, techniques from log-concave sampling lead to polynomial-time algorithms, albeit with large polynomials. Langevin dynamics-based algorithms offer much faster alternatives under some distance measures such as statistical distance. In this work, we establish rapid convergence for these algorithms under distance measures more suitable for differential privacy. For smooth, strongly-convex f, we give the first results proving convergence in Rényi divergence. This gives us fast differentially private algorithms for such f. Our techniques and simple and generic and apply also to underdamped Langevin dynamics.
Commonly used classification algorithms in machine learning, such as support vector machines, minimize a convex surrogate loss on training examples. In practice, these algorithms are surprisingly robust to errors in the training data. In this work, we identify a set of conditions on the data under which such surrogate loss minimization algorithms provably learn the correct classifier. This allows us to establish, in a unified framework, the robustness of these algorithms under various models on data as well as error. In particular, we show that if the data is linearly classifiable with a slightly non-trivial margin (i.e. a margin at least C/d‾‾√ for d-dimensional unit vectors), and the class-conditional distributions are near isotropic and logconcave, then surrogate loss minimization has negligible error on the uncorrupted data even when a constant fraction of examples are adversarially mislabeled.
Raef Bassily, Vitaly Feldman, Criztobal Guzman, Kunal Talwar
Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al. (2016) provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important progress in understanding of the generalization properties of SGD and several applications to differentially private convex optimization for smooth losses.
Naman Agarwal, Rohan Anil, Tomer Koren, Kunal Talwar, Cyril Zhang
State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of "data-echoed" extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
Vitaly Feldman, Chiyuan Zhang
Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given.
In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman, 2019).
Talks and Workshops
This workshop, held on December 12, aimed to assemble researchers from the key areas in mobile health to better address the challenges currently facing the widespread use of mobile health technologies. We presented our workshop accepted paper which discusses synthetic data generated using a realistic ECG simulator and a structured noise model.
Machine Learning for Mobile Health Workshop Accepted Paper:
Representing and Denoising Wearable ECG Recordings
Jeffrey Chan, Andrew C. Miller, Emily Fox
This workshop, held on December 11, focused on privacy preserving techniques for machine learning and disclosure in large scale data analysis both in the distributed and centralized settings. We will be presenting our workshop accepted papers which discusses how considering sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, and how random shuffling of input data amplifies differential privacy guarantees.
PPML Workshop Accepted Papers:
Individual Privacy Accounting via a Renyi Filter
Vitaly Feldman, Tijana Zrnic
A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling
Vitaly Feldman, Audra McMillan, Kunal Talwar
This workshop, held on December 12, brought attention to offline Reinforcement Learning. This workshop facilitated discussions surrounding algorithmic challenges, solutions and real-world applications.
Offline Reinforcement Learning Workshop Accepted Paper:
Uncertainty Weighted Offline Reinforcement Learning
Yue Wu, Shuangfei Zhai, Nitish Srivastava, Josh Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh
Accelerated Training with ML Compute on M1-Powered Mac
Apple presented a talk on Accelerated Training with ML Compute on M1-Powered Mac during NeurIPS Expo Day on December 6.
In November, Apple announced Mac powered by the M1 chip, featuring a powerful machine learning accelerator and high-performance GPU. ML Compute, a new framework available in macOS Big Sur, enables developers to accelerate the training of neural networks using the CPU and GPU.
In this talk, we discussed how we use ML Compute to speed up the training of ML models on M1-powered Mac with popular deep learning frameworks such as TensorFlow. We showed how to replace the TensorFlow ops in graph and eager mode with an ML Compute graph. We also present the performance and watt improvements when training neural networks on Mac with M1. Finally, we examined how unified memory and other memory optimizations on M1-powered Mac allow us to minimize the memory footprint when training neural networks. Learn more about how we leverage ML Compute for Accelerated Training on Mac.
Affinity Group Workshops
At the Women in Machine Learning workshop on December 9, we gave a talk on how our on-device machine learning powers intelligent experiences on Apple products.
Related readings and updates.
Apple attended the 33rd Conference and Workshop on Neural Information Processing Systems (NeurIPS) in December. The conference took place in Vancouver, Canada from December 8th to 14th.
Apple product teams are engaged in state of the art research in machine hearing, speech recognition, natural language processing, machine translation, text-to-speech, and artificial intelligence, improving the lives of millions of customers every day.