MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
AuthorsPavan Kumar Anasosalu Vasu*, Hadi Pour Ansari*, Fartash Faghri*, Raviteja Vemulapalli, Oncel Tuzel
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
AuthorsPavan Kumar Anasosalu Vasu*, Hadi Pour Ansari*, Fartash Faghri*, Raviteja Vemulapalli, Oncel Tuzel
*Equal Contributors
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP — a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3 faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover, we show that the proposed approach achieves 10-1000 improved learning efficiency when compared with non-reinforced CLIP training.
MobileCLIP2: Improving Multi-Modal Reinforced Training
September 22, 2025research area Computer Vision, research area Methods and AlgorithmsTransactions on Machine Learning Research (TMLR)
This paper received Featured Certification from Transactions on Machine Learning Research (TMLR) 2025.
Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal…
CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
May 27, 2025research area Computer Vision
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various…