This paper was accepted to the workshop on Distribution Shifts in NeurIPS 2023.

Large-scale training of models has become exceedingly more expensive. In an ever changing world where Petabytes of new data is generated every day, we want to be able to continually train models. In this paper, we create a benchmark for continual large-scale training of CLIP models where the data distribution varies only by time. Compared with traditional continual learning literature, there is no hard separation of tasks, i.e., we assume an infinite stream of data in a canonical format arrives that exhibits natural distribution shifts as time passes. We create multiple such benchmarks for CLIP training based on standard benchmarks such as DataComp and YFCC15M. We propose various evaluations and demonstrate that models trained on data up to a certain year will lose performance on certain categories of rapidly changing data. We propose simple learning rate schedules, and training with replay buffers to reduce the gap in forward transfer. We demonstrate that a simple baseline that continues training from the last checkpoint and replays old data can be competitive with an Oracle that gets all data up to now in one pass and trains with a large budget.

Related readings and updates.

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

*Equal Contributors Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models…
See paper details

TiC-CLIP: Continual Training of CLIP Models

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps…
See paper details