View publication

This paper was accepted at the Mathematics of Modern Machine Learning (M3L) Workshop at NeurIPS 2024.

We investigate the unreasonable effectiveness of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM and DDIM, and neither sampler with CFG generates the gamma-powered distribution. Then, we clarify the behavior of CFG by showing that it is a kind of Predictor-Corrector (PC) method that alternates between denoising and sharpening, which we call Predictor-Corrector Guidance (PCG). We show that in the SDE limit, DDPM-CFG is equivalent to PCG with a DDIM predictor applied to the conditional distribution, and Langevin dynamics corrector applied to a gamma-powered distribution. While the standard PC corrector applies to the conditional distribution and improves sampling accuracy, our corrector sharpens the distribution.

Related readings and updates.

Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization

Learning with identical train and test distributions has been extensively investigated both practically and theoretically. Much remains to be understood, however, in statistical learning under distribution shifts. This paper focuses on a distribution shift setting where train and test distributions can be related by classes of (data) transformation maps. We initiate a theoretical study for this framework, investigating learning scenarios where…
See paper details

Controllable Music Production with Diffusion Models and Guidance Gradients

This paper was accepted at the Diffusion Models workshop at NeurIPS 2023. We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer…
See paper details