paperMarch 2024

VeCLIP: Improving CLIP Training via Visual-enriched Captions

AuthorsJeff Lai*, Haotian Zhang*, Bowen Zhang, Wentao Wu, Felix Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao

View publication

View source code (GitHub)

*Equal Contributors

Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as $\textbf{V}isual$ - $\textbf{e}nriched$ $\textbf{Cap}tions$ $(VeCap)$ . To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to $\textbf{+25.2\%}$ gain in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves $\textbf{+3\%}$ gain while only using $\textbf{14\%}$ of the data employed in the vanilla CLIP and $\textbf{11\%}$ in ALIGN. We also note the VeCap data is complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. $\textbf{83.1\%}$ accuracy@1 on ImageNet zero-shot for a H/14 model.

VeCLIP: Improving CLIP Training via Visual-enriched Captions

Related readings and updates.

Updates to Apple's On-Device and Server Foundation Language Models

Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models

Discover opportunities in Machine Learning.