Multimodal Autoregressive Pre-Training of Large Vision Encoders
AuthorsEnrico Fini*, Mustafa Shukor*, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Louis Béthune, Zhe Gan, Victor Turrisi, Alexander Toshev, Marcin Eichner, Yinfei Yang, Moin Nabi, Josh Susskind, Alaaeldin El-Nouby*
A dominant paradigm in large multimodal models is to pair a large language de- coder with a vision encoder. While it is well-known how to pre-train and tune language decoders for multimodal tasks, it is less clear how the vision encoder should be pre-trained. A de facto standard is to pre-train the vision encoder with a discriminative objective, such as contrastive loss. This causes a mismatch between pre-training and the generative autoregressive downstream task. At the same time, following their success in the language domain, autoregressive image models have been shown to be capable of pre-training strong and scalable vision encoders. This paper presents AIMv2, a family of large, strong vision encoders pre-trained with a multimodal autoregressive objective. Thanks to a multimodal decoder that gen- erates both raw patches and text tokens. Our models excel not only at multimodal tasks but also in visual recognition benchmarks such as localization, grounding, and classification. In addition, we show that AIMv2 models are efficient to train, outperforming the current state of the art with significantly fewer samples seen during pre-training.
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful…
This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the…