View publication

This paper has been accepted to the UniReps Workshop in NeurIPS 2023.

Contrastive language image pretraining has become the standard approach for training vision language models. Despite the utility of CLIP visual features as global representations for images, they have limitations when it comes to tasks involving object localization, pixel-level understanding of the image, or 3D perception. Multi-task training is a popular solution to address this drawback, but collecting a large-scale annotated multi-task dataset incurs significant costs. Furthermore, training on separate task specific datasets is also challenging from optimization and training perspective due to aligning gradients and knowledge coming from different input distributions and tasks. To overcome these shortcomings, we study pseudo-labeling with task-specific experts to improve CLIP features for more challenging down-stream tasks. In our approach, we leverage multiple existing open-source pretrained models and pseudo-label an uncurated web-scale image-caption dataset with the experts. We then train CLIP with contrastive loss and task specific losses with pseudo labels through the light-weight heads that we attach to the vision backbone.

Related readings and updates.

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source…
See paper details

Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient parallel data resources. We…
See paper details