This paper was accepted to the workshop on Distribution Shifts in NeurIPS 2023.

Large-scale training of models has become exceedingly more expensive. In an ever changing world where Petabytes of new data is generated every day, we want to be able to continually train models. In this paper, we create a benchmark for continual large-scale training of CLIP models where the data distribution varies only by time. Compared with traditional continual learning literature, there is no hard separation of tasks, i.e., we assume an infinite stream of data in a canonical format arrives that exhibits natural distribution shifts as time passes. We create multiple such benchmarks for CLIP training based on standard benchmarks such as DataComp and YFCC15M. We propose various evaluations and demonstrate that models trained on data up to a certain year will lose performance on certain categories of rapidly changing data. We propose simple learning rate schedules, and training with replay buffers to reduce the gap in forward transfer. We demonstrate that a simple baseline that continues training from the last checkpoint and replays old data can be competitive with an Oracle that gets all data up to now in one pass and trains with a large budget.

Related readings and updates.

TiC-CLIP: Continual Training of CLIP Models

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps…
See paper details

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

This paper was accepted at the UniReps Workshop at NeurIPS 2023. The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple…
See paper details