View publication

Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data?", and propose a simple task-oriented knowledge transfer approach as a highly effective solution to this problem. Our experimental results on five target tasks show that the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining by up to 11.6%, 22.1%, 13.7%, and 29.8%, respectively. Furthermore, the proposed approach also demonstrates up to 9x, 4x and 15x reduction in pretraining compute cost when compared to task-agnostic VFM distillation, ImageNet pretraining and DINO pretraining, respectively, while outperforming them. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and introduce a retrieval-augmented knowledge transfer strategy that uses web-scale image retrieval to curate effective transfer sets.

Related readings and updates.

Memory-Retaining Finetuning via Distillation

This paper was accepted at the Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML) Workshop at NeurIPS 2024. Large language models (LLMs) pretrained on large corpora of internet text possess much of the world's knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities, such as math and coding, or "posttraining" (a.k.a., alignment) techniques to make the models follow users'…
See paper details

Merge Vision Foundation Models via Multi-Task Distillation

As the repository of publicly available pre-trained vision foundation models (VFMs) — such as CLIP, DINOv2, and SAM — grows, users face challenges in storage, memory, and computational efficiency when deploying multiple models concurrently. To address these concerns, we introduce a unique approach that merges the capabilities of multiple VFMs into a single efficient multi-task model. Our method, termed "joint distillation," seamlessly integrates…
See paper details