Merge Vision Foundation Models via Multi-Task Distillation
AuthorsHaoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pour Ansari
AuthorsHaoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pour Ansari
As the repository of publicly available pre-trained vision foundation models (VFMs) — such as CLIP, DINOv2, and SAM — grows, users face challenges in storage, memory, and computational efficiency when deploying multiple models concurrently. To address these concerns, we introduce a unique approach that merges the capabilities of multiple VFMs into a single efficient multi-task model. Our method, termed "joint distillation," seamlessly integrates teacher-student learning with self-distillation, operating with just unlabeled image data and drastically cutting down on computational requirements compared to traditional multi-task learning. In a practical demonstration of merging CLIP and SAM, we reveal that the resultant merged model, SAM-CLIP, not only maintains the foundational strengths of both parent models but also uncovers synergistic functions, such as text-prompted zero-shot segmentation. Given the increasing availability of VFMs, our methodology promises to deliver significant value in streamlining model deployment and operations.