Reasoning and planning are the bedrock of intelligent AI systems, enabling them to plan, interact, adapt, and ultimately, operate independently. At Apple, understanding and advancing reasoning capablilities in AI systems has long been an area of active research, and has resulted in numerous publications that both explore new techniques to advance the frontier of reasoning, and further the field’s understanding of the capabilities (and limitations) of current approaches.

Last year, Apple hosted the Workshop on Reasoning and Planning, bringing together Apple researchers and members of the broader research community for a two-day event focused on advancing the state of the art in this area. The workshop focused on three key areas: Reasoning and Planning, Applications to Agents, and Model Development.

The presentations and discussions of these topics explored how reasoning and planning models process and complete complex tasks based on simple instructions. Workshop participants discussed questions, such as:

  • how to ground the reasoning process into various modalities and embodiments;
  • what is the interplay between search/test-time compute and the capabilities of foundation models;
  • how can multiple reasoning systems collaborate.

The group also explored architectures that leverage memory and adaptation and how to plan and reason in a trustworthy, safe, and efficient manner. In addition, workshop attendees discussed model development for reasoning systems using environments and simulators, along with data generation techniques specifically designed for scalable training and reliable benchmarking. Together, progress in these research areas will enable the creation of adaptable, efficient, and intelligent systems that are capable of tackling complex, dynamic challenges.

In this post, we share recordings of selected talks and a recap of the publications discussed at the workshop.

Workshop Resources

Published Work Presented at the Workshop

AbstRaL: Augmenting LLMs’ Reasoning by Reinforcing Abstract Thinking by Silin Gao (work done while at Apple), Antoine Bosselut (EPFL), Samy Bengio, Emmanuel Abbe

Adaptable Logical Control for Large Language Models by Honghua Zhang (UCLA), Po-Nien Kung (UCLA), Masahiro Yoshida, Guy Van den Broeck (UCLA), Nanyun Peng (UCLA)

Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment by Annie S. Chen (Stanford University), Govind Chada (Stanford University), Laura Smith (UC Berkeley), Archit Sharma (Stanford University), Zipeng Fu (Stanford University), Sergey Levine (UC Berkeley), Chelsea Finn (Stanford University)

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models by Annie S. Chen (Stanford University), Alec M. Lessing (Stanford University), Andy Tang (Stanford University), Govind Chada (Stanford University), Laura Smith (UC Berkeley), Sergey Levine (UC Berkeley), Chelsea Finn (Stanford University)

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making by Manling Li (Stanford University, Northwestern University), Shiyu Zhao (Stanford University), Qineng Wang (Stanford University, Northwestern University) Kangrui Wang (Stanford University, Northwestern University), Yu Zhou (Stanford University), Sanjana Srivastava (Stanford University), Cem Gokmen (Stanford University), Tony Lee (Stanford University), Li Erran Li (Amazon), Ruohan Zhang (Stanford University), Weiyu Liu (Stanford University), Percy Liang (Stanford University), Li Fei-Fei (Stanford University), Jiayuan Mao (MIT), Jiajun Wu (Stanford University)

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs by Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms by Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons by Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev

Grounding Multimodal Large Language Models in Actions by Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Language Models by Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad by Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity by Parshin Shojaee (work done during an internship at Apple), Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar

Large Language Models as Generalizable Policies for Embodied Tasks by Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Rin Metcalf Susa, Natalie Mackraz, Devon Hjelm, Alexander Toshev

Learning Adaptive Parallel Reasoning with Language Models by Jiayi Pan (UC Berkeley), Xiuyu Li (UC Berkeley), Long Lian (UC Berkeley), Charlie Snell (UC Berkeley), Yifei Zhou (UC Berkeley), Adam Yala (UC Berkeley, UCSF), Trevor Darrell (UC Berkeley), Kurt Keutzer (UC Berkeley), Alane Suhr (UC Berkeley)

Learning to Reason without External Rewards by Xuandong Zhao (UC Berkeley), Zhewei Kang (UC Berkeley), Aosong Feng (Yale University), Sergey Levine (UC Berkeley), Dawn Song (UC Berkeley)

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving by Aniket Didolkar (Mila, University of Montreal), Anirudh Goyal (Mila, University of Montreal), Nan Rosemary Ke (Google DeepMind), Siyuan Guo (The University of Cambridge), Michal Valko (Google DeepMind), Timothy Lillicrap (Google DeepMind), Danilo Rezende (Google DeepMind), Yoshua Bengio (Mila, University of Montreal), Michael Mozer (Google DeepMind), Sanjeev Arora (Princeton University)

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge by Boyu Gou (The Ohio State University), Zanming Huang (The Ohio State University), Yuting Ning (The Ohio State University), Yu Gu (The Ohio State University), Michael Lin (The Ohio State University), Weijian Qi (The Ohio State University), Andrei Kopanev (The Ohio State University), Botao Yu (The Ohio State University), Bernal Jiménez Gutiérrez (The Ohio State University), Yiheng Shu (The Ohio State University), Chan Hee Song (The Ohio State University), Jiaman Wu (The Ohio State University), Shijie Chen (The Ohio State University), Hanane Nour Moussa (The Ohio State University), Tianshu Zhang (The Ohio State University), Jian Xie (The Ohio State University), Yifei Li (The Ohio State University), Tianci Xue (The Ohio State University), Zeyi Liao (The Ohio State University), Kai Zhang (The Ohio State University), Boyuan Zheng (The Ohio State University), Zhaowei Cai (Amazon AGI), Viktor Rozgic (Amazon AGI), Morteza Ziyadi (Amazon AGI), Huan Sun (The Ohio State University), Yu Su (The Ohio State University)

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains by Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe Zhang, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele, Juergen Wiest, Zhengfeng Lai, George Horrell, Xiaoming Wang, Jiulong Shan, Meng Cao, Ruoming Pang, Zirui Wang

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization by Yiyou Sun (UC Berkeley), Shawn Hu (dmodel.ai), Georgia Zhou (UC Berkeley,), Ken Zheng (UC Berkeley,), Hannaneh Hajishirzi (Ai2, University of Washington), Nouha Dziri (Ai2), Dawn Song (UC Berkeley)

On the Modeling Capabilities of Large Language Models for Sequential Decision Making by Martin Klissarov, Devon Hjelm, Alexander Toshev, Bogdan Mazoure

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments by Tianbao Xie (The University of Hong Kong), Danyang Zhang (The University of Hong Kong), Jixuan Chen (The University of Hong Kong), Xiaochuan Li (The University of Hong Kong), Siheng Zhao (The University of Hong Kong), Ruisheng Cao (The University of Hong Kong), Toh Jing Hua (The University of Hong Kong), Zhoujun Cheng (The University of Hong Kong), Dongchan Shin (The University of Hong Kong), Fangyu Lei (The University of Hong Kong), Yitao Liu (The University of Hong Kong), Yiheng Xu (The University of Hong Kong), Shuyan Zhou (Carnegie Mellon University), Silvio Savarese (Salesforce Research), Caiming Xiong (Salesforce Research), Victor Zhong (University of Waterloo), Tao Yu (The University of Hong Kong)

Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting by Xiong-Hui Chen (Nanjing University), Ziyan Wang (King’s College London), Yali Du (King’s College London), Shengyi Jiang (The University of Hong Kong), Meng Fang (University of Liverpool), Yang Yu (Nanjing University), Jun Wang (University College London)

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning by Zihan Wang (Northwestern University), Kangrui Wang (Northwestern University), Qineng Wang (Northwestern University), Pingyue Zhang (Northwestern University), Linjie Li (University of Washington), Zhengyuan Yang (Microsoft), Xing Jin (University of British Columbia), Kefan Yu (Northwestern University), Minh Nhat Nguyen (Singapore Management University), Licheng Liu (Northwestern University), Eli Gottlieb (Northwestern University), Yiping Lu (Northwestern University), Kyunghyun Cho (New York University), Jiajun Wu (Stanford University), Li Fei-Fei (Stanford University), Lijuan Wang (Microsoft), Yejin Choi (Stanford University), Manling Li (Northwestern University)

Reinforcement Learning for Long-Horizon Interactive LLM Agents by Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, Philipp Krähenbühl

TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation by Gwen Yidou Weng (UCLA), Benjie Wang (UCLA), Guy Van den Broeck (UCLA)

Training Software Engineering Agents and Verifiers with SWE-Gym by Jiayi Pan (UC Berkeley), Xingyao Wang (UICU), Graham Neubig (CMU), Navdeep Jaitly (Apple), Heng Ji (UIUC), Alane Suhr (UC Berkeley), Yizhe Zhang (Apple)

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets by Tianjian Li (Johns Hopkins University), Haoran Xu (Johns Hopkins University), Weiting Tan (Johns Hopkins University), Kenton Murray (Johns Hopkins University), Daniel Khashabi (Johns Hopkins University)

Yell At Your Robot: Improving On-the-Fly from Language Corrections by Lucy Xiaoyang Shi (Stanford University), Zheyuan Hu (University of California, Berkeley), Tony Z. Zhao (Stanford University), Archit Sharma (Stanford University), Karl Pertsch (Stanford University, UC Berkeley), Jianlan Luo (UC Berkeley), Sergey Levine (UC Berkeley), Chelsea Finn (Stanford University)

Acknowledgments

Many people contributed to this workshop including Akshay Aggarwal, Harsh Agrawal, Felix Bai, Samy Bengio, Meng Cao, Mehrdad Farajtabar, Philipp Krähenbühl, Iman Mirzadeh, Yanchao Sun, Alexander Toshev, Simon Wang, Guoli Yin, and Keen You.

Related readings and updates.

Apple believes that privacy is a fundamental human right. As AI experiences become increasingly personal and a part of people’s daily lives, it’s important that novel privacy-preserving techniques are created in parallel to advancing AI capabilities.

Apple’s fundamental research has consistently pushed the state-of-the-art in using differential privacy with machine learning, and earlier this year, we hosted the Workshop on Privacy-Preserving…

Read more

Progress in natural language processing enables more intuitive ways of interacting with technology. For example, many of Apple’s products and services, including Siri and search, use natural language understanding and generation to enable a fluent and seamless interface experience for users. Natural language is a rapidly moving area of machine learning research, and includes work on large-scale data curation across multiple languages, novel…

Read more