Learning Unmasking Policies for Diffusion Language Models

AuthorsMetod Jazbec*†, Theo X. Olausson*‡, Louis Béthune, Pierre Ablin, Michael Kirchhof, João Monteiro, Victor Turrisi, Jason Ramapuram, Marco Cuturi

View publication

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the sampling procedure that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.

* Equal Contributors
† University of Amsterdam
‡ Massachusetts Institute of Technology
** Work done while at Apple

Learning Unmasking Policies for Diffusion Language Models

Related readings and updates.

Residual Context Diffusion Language Models

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Discover opportunities in Machine Learning.