View publication

This paper has been accepted to the UniReps Workshop in NeurIPS 2023.

Contrastive language image pretraining has become the standard approach for training vision language models. Despite the utility of CLIP visual features as global representations for images, they have limitations when it comes to tasks involving object localization, pixel-level understanding of the image, or 3D perception. Multi-task training is a popular solution to address this drawback, but collecting a large-scale annotated multi-task dataset incurs significant costs. Furthermore, training on separate task specific datasets is also challenging from optimization and training perspective due to aligning gradients and knowledge coming from different input distributions and tasks. To overcome these shortcomings, we study pseudo-labeling with task-specific experts to improve CLIP features for more challenging down-stream tasks. In our approach, we leverage multiple existing open-source pretrained models and pseudo-label an uncurated web-scale image-caption dataset with the experts. We then train CLIP with contrastive loss and task specific losses with pseudo labels through the light-weight heads that we attach to the vision backbone.

Related readings and updates.

Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient parallel data resources. We…
See paper details

Continuous Soft Pseudo-Labeling in ASR

This paper was accepted at the workshop "I Can’t Believe It’s Not Better: Understanding Deep Learning Through Empirical Falsification" Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end…
See paper details