This paper was accepted to the workshop on Distribution Shifts in NeurIPS 2023.

Large-scale training of models has become exceedingly more expensive. In an ever changing world where Petabytes of new data is generated every day, we want to be able to continually train models. In this paper, we create a benchmark for continual large-scale training of CLIP models where the data distribution varies only by time. Compared with traditional continual learning literature, there is no hard separation of tasks, i.e., we assume an infinite stream of data in a canonical format arrives that exhibits natural distribution shifts as time passes. We create multiple such benchmarks for CLIP training based on standard benchmarks such as DataComp and YFCC15M. We propose various evaluations and demonstrate that models trained on data up to a certain year will lose performance on certain categories of rapidly changing data. We propose simple learning rate schedules, and training with replay buffers to reduce the gap in forward transfer. We demonstrate that a simple baseline that continues training from the last checkpoint and replays old data can be competitive with an Oracle that gets all data up to now in one pass and trains with a large budget.

Related readings and updates.

DataComp: In Search of the Next Generation of Multimodal Datasets

*=Equal Contributors Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our…
See paper details

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

This paper was accepted at the UniReps Workshop at NeurIPS 2023. The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple…
See paper details