*Equal Contributors

Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as Visual\textbf{V}isual-enriched\textbf{e}nriched Captions\textbf{Cap}tions (VeCap)(VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to +25.2%\textbf{+25.2\%} gain in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves +3%\textbf{+3\%} gain while only using 14%\textbf{14\%} of the data employed in the vanilla CLIP and 11%\textbf{11\%} in ALIGN. We also note the VeCap data is complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. 83.1%\textbf{83.1\%} accuracy@1 on ImageNet zero-shot for a H/14 model.

Related readings and updates.

Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. Notably, the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still unclear. Additionally, different multimodal foundation models may have distinct preferences for specific caption formats while the efforts of studying the optimal captions for each foundation…
See paper details

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

*Equal Contributors Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models…
See paper details