View publication

Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences. In this paper, we address these issues by presenting a novel approach to enhance the efficiency and scalability of image generation models, incorporating state space models (SSMs) as the core component and deviating from the widely adopted transformer-based and U-Net architectures. We introduce a class of SSM-based models that significantly reduce forward pass complexity while maintaining comparable performance and taking input exact sequences without patchy approximations. Through extensive experiments and rigorous evaluation, we demonstrate that our proposed approach reduces the Gflops utilized in the model without sacrificing the quality of generated images. Our findings suggest that state space models can be an effective alternative to attention mechanisms in transformer-based architectures, offering a more efficient solution for large-scale image generation tasks.

Related readings and updates.

HumMUSS: Human Motion Understanding using State Space Models

Understanding human motion from video is crucial for applications such as pose estimation, mesh recovery, and action recognition. While state-of-the-art methods predominantly rely on Transformer-based architectures, these approaches have limitations in practical scenarios. They are notably slower when processing a continuous stream of video frames in real time and do not adapt to new frame rates. Given these challenges, we propose an attention…
See paper details

Deploying Attention-Based Vision Transformers to Apple Neural Engine

Motivated by the effective implementation of transformer architectures in natural language processing, machine learning researchers introduced the concept of a vision transformer (ViT) in 2021. This innovative approach serves as an alternative to convolutional neural networks (CNNs) for computer vision applications, as detailed in the paper, An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.

See highlight details