The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon
AuthorsVimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss and Josh Susskind
AuthorsVimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss and Josh Susskind
This paper was accepted to the “Has it Trained Yet?” (HITY) workshop at NeurIPS 2022.
The grokking phenomenon as reported by Power et al., refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking almost exclusively happens at the onset of Slingshots, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin.
April 2, 2025research area Methods and Algorithms, research area Privacyconference COLT
We consider the problem of instance-optimal statistical estimation under the constraint of differential privacy where mechanisms must adapt to the difficulty of the input dataset. We prove a new instance specific lower bound using a new divergence and show it characterizes the local minimax optimal rates for private statistical estimation. We propose two new mechanisms that are universally instance-optimal for general estimation problems up to...
April 22, 2024research area Computer Vision, research area Methods and AlgorithmsTransactions on Machine Learning Research (TMLR)
Adaptive gradient methods, notably Adam, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers. In this paper, we present a novel optimization anomaly called the Slingshot Effect, which manifests during extremely late stages of training. We identify a distinctive characteristic of this phenomenon through cyclic phase transitions between stable and unstable training regimes, as evidenced by the...