On Inductive Biases That Enable Generalization of Diffusion Transformers
AuthorsJie An†**, De Wang, Pengsheng Guo, Jiebo Luo†, Alexander Schwing
On Inductive Biases That Enable Generalization of Diffusion Transformers
AuthorsJie An†**, De Wang, Pengsheng Guo, Jiebo Luo†, Alexander Schwing
Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available.
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation
December 11, 2025research area Computer Vision
In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures—including PixArt-style and MMDiT variants—and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard…
EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
April 15, 2025research area Computer Vision, research area Speech and Natural Language Processingconference ICLR
Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT…