eventSeptember 23, 2025

Apple Workshop on Natural Language and Interactive Systems 2025

Apple ML researcher Kevin Chen presenting a poster at the workshop.

Natural language processing (NLP) remains one of the most quickly evolving fields in AI, as new research continues to rapidly advance large language models (LLMs), systems for speech recognition and generation, language agents, and more. This technology is essential to many of today’s AI experiences, including Apple Intelligence and Siri, and fundamental research in NLP will be foundational to future AI.

Apple recently hosted the Workshop on Natural Language and Interactive Systems, bringing together Apple and members of the academic research community for a two-day event focused on recent advances in NLP. The workshop focused on three key research themes - Spoken Language Interactive Systems, LLM Training and Alignment, and Language Agents - and explored novel approaches to enabling intuitive interactions with devices through natural language understanding and generation. Throughout the workshop, talks and discussions underscored the importance of privacy, security, performance, and efficiency in this rapidly changing landscape.

In this post, we share recordings of selected talks and a recap of the publications discussed at the workshop.

Apple Workshop on Natural Language and Interactive Systems 2025 Videos

Workshop Resources

Published Work Presented at the Workshop

2 OLMo 2 Furious by Pete Walsh (Allen Institute for AI), Luca Soldaini (Allen Institute for AI), Dirk Groeneveld (Allen Institute for AI), Kyle Lo (Allen Institute for AI), Shane Arora (Allen Institute for AI), Akshita Bhagia (Allen Institute for AI), Yuling Gu (Allen Institute for AI), Shengyi Huang (Allen Institute for AI), Matt Jordan (Allen Institute for AI), Nathan Lambert (Allen Institute for AI), Dustin Schwenk (Allen Institute for AI), Oyvind Tafjord (Allen Institute for AI), Taira Anderson (Allen Institute for AI), David Atkinson (Allen Institute for AI), Faeze Brahman (Allen Institute for AI), Christopher Clark (Allen Institute for AI), Pradeep Dasigi (Allen Institute for AI), Nouha Dziri (Allen Institute for AI), Michal Guerquin (Allen Institute for AI), Hamish Ivison (Allen Institute for AI, University of Washington), Pang Wei Koh (Allen Institute for AI, University of Washington), Jiacheng Liu (Allen Institute for AI, University of Washington), Saumya Malik (Allen Institute for AI), William Merrill (Allen Institute for AI, New York University), Lester James V. Miranda (Allen Institute for AI), Jacob Morrison (Allen Institute for AI), Tyler Murray (Allen Institute for AI), Crystal Nam (Allen Institute for AI), Valentina Pyatkin (Allen Institute for AI, University of Washington), Aman Rangapur (Allen Institute for AI), Michael Schmitz (Allen Institute for AI), Sam Skjonsberg (Allen Institute for AI), David Wadden (Allen Institute for AI), Christopher Wilhelm (Allen Institute for AI), Michael Wilson (Allen Institute for AI), Luke Zettlemoyer (University of Washington), Ali Farhadi (Allen Institute for AI, University of Washington), Noah A. Smith (Allen Institute for AI, University of Washington), and Hannaneh Hajishirzi (Allen Institute for AI, University of Washington)

Adaptable Logical Control for Large Language Models by Honghua Zhang (UCLA), Po-Nien Kung (UCLA), Masahiro Yoshida (UCLA, Sony Group Corporation), Guy Van den Broeck (UCLA), and Nanyun Peng (UCLA)

Agent S: An Open Agentic Framework that Uses Computers Like a Human by Saaket Agashe (UC Santa Cruz), Jiuzhou Han (Monash University), Shuyu Gan (UC Santa Cruz), Jiachen Yang (Tianjin University), Ang Li, and Xin Eric Wang (UC Santa Cruz)

AI models collapse when trained on recursively generated data by Ilia Shumailov (University of Oxford), Zakhar Shumaylov (University of Cambridge), Yiren Zhao (Imperial College London), Nicolas Papernot (University of Toronto, Vector Institute), Ross Anderson (University of Cambridge, University of Edinburgh), Yarin Gal (University of Oxford)

DASB - Discrete Audio and Speech Benchmark by Pooneh Mousavi (Concordia University, Mila - Quebec AI Institute), Luca Della Libera (Concordia University, Mila - Quebec AI Institute), Jarod Duret (Avignon Université), Artem Ploujnikov (Université de Montréal, Université Laval, Mila - Quebec AI Institute), Cem Subakan (Université Laval, Mila - Quebec AI Institute, Concordia University), and Mirco Ravanelli (Concordia University, Mila - Quebec AI Institute, Université de Montréal)

Detecting hallucinations in large language models using semantic entropy by Sebastian Farquhar (University of Oxford) ), Jannik Kossen (University of Oxford), Lorenz Kuhn (University of Oxford), and Yarin Gal (University of Oxford)

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation by Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, and Lijie Wen

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding by Fabian David Schmidt (University of Würzburg), Ivan Vulic (University of Cambridge), Goran Glavaš (University of Würzburg), and David Ifeoluwa Adelani (Mila, McGill University)

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks by Luca Della Libera (Concordia University, Mila-Quebec AI Institute), Francesco Paissan (Fondazione Bruno Kessler, Mila-Quebec AI Institute), Cem Subakan (Université Laval, Concordia University, Mila-Quebec AI Institute), and Mirco Ravanelli (Concordia University, Mila-Quebec AI Institute)

Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models by Hyunwoo Kim, Melanie Sclar (University of Washington), Tan Zhi-Xuan (MIT), Lance Ying (MIT, Harvard University), Sydney Levine (Allen Institute for AI), Yang Liu (Amazon), Joshua B. Tenenbaum (MIT), Yejin Choi (Stanford University)

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models by David Ifeoluwa Adelani (Mila, McGill University, Canada CIFAR AI Chair), Jessica Ojo (Mila, McGill University, Lelapa AI), Israel Abebe Azime (Saarland University), Jian Yun Zhuang (University of Toronto), Jesujoba O. Alabi (Saarland University), Xuanli He (University College London), Millicent Ochieng (Microsoft Research Africa), Sara Hooker (Cohere For AI), Andiswa Bukula (SADiLaR), En-Shiun Annie Lee (Ontario Tech University), Chiamaka Chukwuneke (Lancaster University), Happy Buzaaba (Princeton University), Blessing Sibanda (Masakhane NLP), Godson Kalipe (Masakhane NLP), Jonathan Mukiibi (Makerere University), Salomon Kabongo (Leibniz Universität Hannover), Foutse Yuehgoh (Le CNAM), Mmasibidi Setaka, Lolwethu Ndolela (Masakhane NLP), Nkiruka Odu (Masakhane NLP), Rooweither Mabuya (SADiLaR), Shamsuddeen Hassan Muhammad (Imperial College London), Salomey Osei (Universidad de Deusto), Sokhar Samb (DAUST), Tadesse Kebede Guge (Haramaya University), Tombekai Vangoni Sherman, and Pontus Stenetorp (University College London)

Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values? by Hua Shen (University of Washington), Nicholas Clark (University of Washington), and Tanushree Mitra (University of Washington)

Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization by Yen-Ju Lu, Ting-Yao Hu, Hema Swetha Koppula, Hadi Pour Ansari, Jen-Hao Rick Chang, Yin Xia, Xiang Kong, Qi Zhu, Simon Wang, Oncel Tuzel, and Raviteja Vemulapalli

Reinforcement Learning for Long-Horizon Interactive LLM Agents by Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl

Retrospective Learning from Interactions by Zizhao Chen (Cornell University), Mustafa Omer Gul (Cornell University), Yiwei Chen (Cornell University), Gloria Geng (Cornell University), Anne Wu (Cornell University), and Yoav Artzi (Cornell University)

s1: Simple Test-Time Scaling by Niklas Muennighoff (Stanford University, Allen Institute for AI, Contextual AI), Zitong Yang (Stanford University), Weijia Shi (University of Washington, Allen Institute for AI), Xiang Lisa Li (Stanford University), Li Fei-Fei (Stanford University), Hannaneh Hajishirzi (University of Washington, Allen Institute for AI), Luke Zettlemoyer (University of Washington), Percy Liang (Stanford University), Emmanuel Candès (Stanford University), and Tatsunori Hashimoto (Stanford University)

Scaling Diffusion Language Models via Adaptation from Autoregressive Models by Shansan Gong (The University of Hong Kong), Shivam Agarwal (University of Illinois at Urbana-Champaign), Yizhe Zhang, Jiacheng Ye (The University of Hong Kong), Lin Zheng (The University of Hong Kong), Mukai Li (The University of Hong Kong), Chenxin An (The University of Hong Kong), Peilin Zhao (Tencent AI Lab), Wei Bi (Tencent AI Lab), Jiawei Han, Hao Peng (University of Illinois at Urbana-Champaign), and Lingpeng Kong (The University of Hong Kong)

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects by David Ifeoluwa Adelani (University College London, Masakhane), Hannah Liu (University of Toronto), Xiaoyu Shen (Amazon Alexa), Nikita Vassilyev (University of Toronto), Jesujoba O. Alabi (Masakhane, Saarland University), Yanke Mao (University of Toronto), Haonan Gao (University of Toronto), and En-Shiun Annie Lee (University of Toronto, University of Ontario Institute of Technology)

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore by Sewon Min (University of Washington), Suchin Gururangan (University of Washington), Eric Wallace (UC Berkeley), Weijia Shi (University of Washington), Hannaneh Hajishirzi (University of Washington, Allen Institute for AI), Noah A. Smith (University of Washington, Allen Institute for AI), and Luke Zettlemoyer (University of Washington)

Speculative Streaming: Fast LLM Inference Without Auxiliary Models by Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, and Mahyar Najibi

The Role of Prosody in Spoken Question Answering by Jie Chi, Maureen de Seyssel, and Natalie Schluter

TiC-LM: A Multi-Year Benchmark for Continual Pretraining of Language Models by Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, and Fartash Faghri

Tulu 3: Pushing Frontiers in Open Language Model Post-Training by Nathan Lambert (Allen Institute for AI), Jacob Morrison (Allen Institute for AI), Valentina Pyatkin (Allen Institute for AI, University of Washington), Shengyi Huang (Allen Institute for AI), Hamish Ivison (Allen Institute for AI, University of Washington), Faeze Brahman (Allen Institute for AI), Lester James V. Miranda (Allen Institute for AI), Alisa Liu (University of Washington), Nouha Dziri (Allen Institute for AI), Xinxi Lyu (Allen Institute for AI), Yuling Gu (Allen Institute for AI), Saumya Malik (Allen Institute for AI), Victoria Graf (University of Washington), Jena D. Hwang (Allen Institute for AI), Jiangjiang Yang (Allen Institute for AI), Ronan Le Bras (Allen Institute for AI), Oyvind Tafjord (Allen Institute for AI), Chris Wilhelm (Allen Institute for AI), Luca Soldaini (Allen Institute for AI), Noah A. Smith (Allen Institute for AI, University of Washington), Yizhong Wang (Allen Institute for AI, University of Washington), Pradeep Dasigi (Allen Institute for AI), and Hannaneh Hajishirzi (Allen Institute for AI, University of Washington)

ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs by Hua Shen (University of Washington), Tiffany Knearem (Google), Reshmi Ghosh (Microsoft), Yu-Ju Yang (University of Illinois at Urbana-Champaign), Tanushree Mitra (University of Washington), Nicholas Clark (University of Washington), and Yun Huang (University of Illinois at Urbana-Champaign)

Other Resources

Agent S: an open agentic framework that uses computers like a human, Xin Eric Wang (UC Santa Cruz)

DASB Benchmark, Mirco Ravanelli (Concordia University, Mila - Quebec AI Institute, Université de Montréal)

Acknowledgments

Many people contributed to this workshop including Irina Belousova, Samy Bengio, Pete Boothroyd, Meng Cao, Kevin Chen, Jie Chi, Fartash Faghri, Tatiana Likhomanenko, Rin Metcalf, Stephen Pulman, Rita Ramos, Natalie Schluter, David Q. Sun, Raviteja Vemulapalli, David Winarsky, and Yizhe Zhang.

Apple Workshop on Natural Language and Interactive Systems 2025

Apple Workshop on Natural Language and Interactive Systems 2025 Videos

Workshop Resources

Acknowledgments

Related readings and updates.

Apple Workshop on Privacy-Preserving Machine Learning 2025

Apple Workshop on Human-Centered Machine Learning 2024

Discover opportunities in Machine Learning.