DataComp-LM: In Search of the Next Generation of Training Sets for Language Models
AuthorsJeffrey Li, Alex Fang, Hadi Pour Ansari, Fartash Faghri, Alaaeldin Mohamed Elnouby Ali, Alexander Toshev, Vaishaal Shankar, Georgios Smyrnis, Matt Jordan, Maor Igvi, Alex Dimakis, Hanlin Zhang, Hritik Bansal, Igor Vasiljevic, Jean Mercat, Jenia Jitsev, Kushal Arora, Mayee Chen, Niklas Muenninghoff, Luca Soldaini, Pang Wei Koh, Reinhard Heckel, Rui Xin, Samir Gadre, Rulin Shao, Sarah Pratt, Saurabh Garg, Sedrick Keh, Suchin Gururangan, Sunny Sanyal, Yonatan Bitton, Thomas Kollar, Mitchell Wortsman, Etash Guha, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Joshua Gardner, Marianna Nezhurina, Achal Dave, Yair Carmon, Ludwig Schmidt