paperFebruary 2026

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

AuthorsBruno Mlodozeniec†**, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi

Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as μP, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete(d) Parameterisation that unifies scaling in width and depth — using an adaptation of CompleteP — as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.

† University of Cambridge
** Work done while at Apple

Diagram illustrating hyperparameter optimisation at the 50M parameter scale, comparing global and per-module strategies and highlighting transfer to a much larger FLOP budget using the Complete(d)P parameterisation. — Figure 1: We optimise hyperparameters at a small 50M parameters/1.6B tokens scale (learning rate, initialisation scale, Adam ε, momenta, and weight decay) with an evolutionary strategy. These hyperparameters (HPs) can be optimised either globally with a shared value across the entire model, or per-module (with 13 module types, some additionally tuned per depth). The per-module approach leads to better results at the 50M scale—optimal global HPs require 2.3× longer training to achieve the same performance. Crucially, our new parameterisation, Complete(d)P, enables direct transfer (without subsequent tuning) to a ~14000× larger FLOP budget.

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Related readings and updates.

Privacy-Computation Trade-offs in Private Repetition and Metaselection

Computational Bottlenecks of Training Small-Scale Large Language Models

Discover opportunities in Machine Learning.