In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees a named entity recognition model to extract entities from text, and uses a CLIP model to select the right entities as the labels of the paired image. The approach is simple, and can be readily scaled up to billions of image-text pairs mined from the web, through which we have successfully created a dataset with 2 millions of distinct entities. We study new training approaches on the collected new dataset with large scale entity labels, including supervised pre-training, contrastive pre-training, and mulit-task learning. Experiments show that supervised pre-training with large scale entity labels is very effective for image retrieval tasks, and multi-task training can further improve the performance. The final model, named \textbf{MOFI}, achieves 83.59% mAP on the challenging GPR1200 dataset, compared to the previous state-of-the-art 67.33% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification tasks also show that our MOFI model outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the new dataset for learning general-purpose image representations.

Related readings and updates.

Entity Disambiguation via Fusion Entity Decoding

Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly…
See paper details

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

*Equal Contributors Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models…
See paper details