MOFI: Learning Image Representation from Noisy Entity Annotated Images
AuthorsWentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathan Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang
AuthorsWentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathan Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang
In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees a named entity recognition model to extract entities from text, and uses a CLIP model to select the right entities as the labels of the paired image. The approach is simple, and can be readily scaled up to billions of image-text pairs mined from the web, through which we have successfully created a dataset with 2 millions of distinct entities. We study new training approaches on the collected new dataset with large scale entity labels, including supervised pre-training, contrastive pre-training, and mulit-task learning. Experiments show that supervised pre-training with large scale entity labels is very effective for image retrieval tasks, and multi-task training can further improve the performance. The final model, named \textbf{MOFI}, achieves 83.59% mAP on the challenging GPR1200 dataset, compared to the previous state-of-the-art 67.33% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification tasks also show that our MOFI model outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the new dataset for learning general-purpose image representations.