Progress in natural language processing enables more intuitive ways of interacting with technology. For example, many of Apple’s products and services, including Siri and search, use natural language understanding and generation to enable a fluent and seamless interface experience for users. Natural language is a rapidly moving area of machine learning research, and includes work on large-scale data curation across multiple languages, novel architectures and algorithms, and new evaluation regimes, all of which involve important issues of privacy and security, as well as of performance and efficiency.

To discuss this quickly changing landscape of research, Apple hosted a research workshop on the topic of natural language understanding, bringing together Apple and members of the academic research community for a multi-day event focused on recent developments in large language models (LLMs).

In this article we share highlights from workshop discussions and recordings of selected workshop talks.

Optimization

LLMs have proven capable of carrying out a wide range of tasks, and they are now commonly used across many domains and applications. Consequently, there is growing interest in more efficient models, and several talks discussed potential directions for developing these.

Two presentations described alternative architectures to attention-based transformer models:

  • Sasha Rush of Cornell University, in the talk “SSMs and the Foundation Model Design Space”, described State-Space Models (SSMs), a promising and rapidly developing direction showing competitive accuracy and scaling properties. SSM architectures also open up new model design possibilities (for example, byte-level LLMs and distillation of LLMs into SSM for speeding up inference) as well as the potential for non auto-regressive foundation models.

  • Yoon Kim of the Massachusetts Institute of Technology presented Recurrent Neural Network (RNN) architectures with matrix-valued hidden states and input-dependent linear transition dynamics. This approach allows for more efficient training and higher throughput than transformers, as described in the paper Gated Linear Attention Transformers with Hardware-Efficient Training.

Yejin Choi, currently affiliated with Stanford and NVIDIA, presented the talk “Impossible Possibilities with Small-Scale Language Models”, arguing that specialized distilled models for particular applications are still better than general-purpose LLMs, and that quality matters as much as scale. This talk was given under Choi’s previous affiliation with the University of Washington.

Another approach focused on LLM inference optimization on devices with limited memory. Apple’s Mehrdad Farajtabar showed how sparsity awareness, context-adaptive loading, and a hardware-oriented design speed up model inference by 4-5x on CPUs and 20-25x on GPUs, as detailed in the papers LLM in a Flash: Efficient Large Language Model Inference with Limited Memoryand ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models.

Reasoning and Planning

Standard training methods enable LLMs to carry out tasks like question answering, translation, and summarization when provided with suitable prompts. More complex requests, however, may need to be broken down into smaller units to be handled separately. Recent work in reasoning and planning has focused on improving the ability of LLMs to decompose and tackle complex tasks via planning, using strategies such as Chain of Thought (CoT) or Self-reflection, which show significant improvements in performance across various fields, including mathematics, science, information retrieval, tool utilization, and agency. In the talk “Assessing and Improving Planning Capabilities of Language Models”, Apple’s Navdeep Jaitly presented a series of experiments on this topic, including:

  • assessing LLM planning strategies in generating problem-solving steps using the Twenty Questions game, revealing their strategic behaviour in multi-turn interactions;

  • a hybrid approach where a smaller LLM generates strategies, and a larger LLM executes individual steps, effectively distilling planning capabilities without losing problem-solving effectiveness;

  • a latent diffusion language model that generates an overall paragraph embedding before individual tokens, enhancing coherence and context. This idea is detailed in the paper PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Mode

Another potential performance improvement was offered by William Wang of the University of California Santa Barbara, in a talk titled “Insights on Aggregating Reasoning Paths in Pretrained Language Models”. The talk suggested that an LLM’s reasoning ability comes from text-based connections between concepts (“reasoning paths”) seen in training data, and showed the aggregation of reasoning paths has the potential to optimize LLM performance. To learn more, see Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation.

Yu Su of Ohio State University highlighted a new trend in LLM-based planning research: the role of LLMs as agents employing external tools to extend their capabilities. As explained in the paper LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, effective tool use may require the ability to simulate trial applications so that LLMs can learn from their errors.

Subbarao Kambhampati of Arizona State University struck a cautionary note, arguing that, despite frequent claims to the contrary, the current generation of LLMs do not themselves plan in any straightforward sense, but can help planning within a more extended framework. This argument is explained in the paper, LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks.

Multilingual Models

In a satellite session involving researchers from the Asia-Pacific region, many promising methods for advancing multilingual understanding in a computationally and data-efficient manner were presented and discussed. A key focus was on adapting predominantly English-pretrained models to other languages.

Yen Yu of Apple presented the talk “Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages", at the ACL2024 workshop. This talk demonstrated how to use only minimal fine-tuning to teach an LLM to comprehend a previously unknown language, an approach that shows promise for developing LLMs for low-resource languages efficiently and economically. To learn more, see the paper of the same title here.

Naoaki Okazaki of the Tokyo Institute of Technology presented the talk “Developing Japanese LLMs with Continual Pre-Training", which follows a similar strategy of adapting pre-existing models to Japanese. The talk covered many topics applicable to the multi-lingual setting, including strategies to augment a base model’s vocabulary with more target-language tokens to avoid unnecessary “out of vocabulary” failures, and to improve inference speed. Okazaki demonstrated that continual pre-training atop existing models can retain capabilities of the base model while augmenting it with Japanese language, knowledge, and cultural understanding. Since instruction tuning is typically the final step of training, it can be problematic to do this additional pre-training after instruction tuning has finished. Okazaki therefore also demonstrated a method for efficiently reapplying instruction vectors from the base model in a blackbox manner, thus avoiding the need to rerun instruction tuning.

Alignment

As LLMs become more widely deployed in production applications, advances in alignment to ensure reliable and safe output from these models are increasingly important, and several alignment problems were explored at the workshop.

Hadas Kotek of Apple, and Hadas Orgad of Technion, discussed gender bias in the talk “Exploring Safety Issues in LLMs: Behavioral and Model Perspectives”. They described various test sets for eliciting gender bias and showed that most current models display such bias, not only in their results on the test items, but also in their explanations for the choices they make in determining, for example, likely antecedents for gender-specific pronouns. They also described how to train classifier probes on transformers that are able to predict the correctness of subsequent output, as a method for reducing the number of unreliable outputs, or at least providing appropriate warnings. For more information, see related paper Gender Bias in LLMs.

Accurate alignment with human interests requires data that exemplifies both inappropriate and appropriate behavior. David Q. Sun of Apple described the design and construction of a dataset of controversial questions, as explored in the paper DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues. This dataset includes questions such as 'Are we born good or evil?' and compares the performance of several current models on this data.

Tatsunori Hashimoto of Stanford University, in the talk “Statistical methods for trustworthy language modeling,” addressed hallucination or lack of factuality in LLM outputs, an issue which can have deep practical (and even medical or legal) consequences. Hashimoto demonstrated how probabilistic methods can help models estimate their confidence levels, even without achieving 100% certainty.

Hashimoto used classical statistical methods to detect contamination in widely-used benchmarks and test sets in language model training data, revealing that many community benchmarks are highly correlated, regardless of contamination. Hashimoto advocated for developing a diverse collection of new evaluation sets, as well as rethinking protocols in current community use, to routinely include audits of test set contamination.

Security

In addition to these issues of alignment, jailbreak and prompt injection threats pose another challenge for current LLMs. Chaowei Xiao, of the University of Wisconsin and NVIDIA, showed how different defense strategies can be employed to mitigate these safety and security risks, as well as how to expand these strategies to agent-based systems, as described in the paper AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.

As we continue to develop more natural and intuitive ways of interacting with personal devices, LLMs and related technologies in natural language understanding and generation are a key focus in academia and industry, as this workshop demonstrates.

Workshop Resources

Related Work

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models by Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao

DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues by David Q. Sun, Artem Abzaliev, Hadas Kotek, Zidi Xiu, Christopher Klein, and Jason D. Williams

Gated Linear Attention Transformers with Hardware-Efficient Training by Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim

Gender Bias in LLMs by Hadas Kotek, Rikker Dockum, and David Q. Sun

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks by Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy

LLM in a Flash: Efficient Large Language Model Inference with Limited Memory by Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error by Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, and Yu Su

PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model by Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Josh Susskind, and Navdeep Jaitly

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models by Iman Mirzadeh, Keivan Alizadeh Vahid, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar

Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages by Zhuoyuan Mao and Yen Yu

Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation by Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, and William Yang Wang

Acknowledgements

Many people contributed to this workshop, including Alex Acero, Kisuh Ahn, Ravi Anantha, Mehrdad Farajtabar, Navdeep Jaitly, Dain Kaplan, Hadas Kotek, Shiyu Li, Tatiana Likhomanenko, Kieran Liu, Joel Moniz, Ruoming Pang, Stephen Pulman, David Q. Sun, Yinfei Yang, Yen Yu, Yizhe Zhang, and Charlie Zhou.

Related readings and updates.

Neural Information Processing Systems (NeurIPS) 2024

Apple is presenting new research at the annual conference on Neural Information Processing Systems (NeurIPS), which takes place in person in Vancouver, Canada, from December 10 - 15. We are proud to again sponsor the multi-track interdisciplinary conference, which brings together the scientific and industrial research communities surrounding Machine Learning. Below is an overview of Apple’s participation at NeurIPS 2024.

See event details

Apple Machine Learning Research at NeurIPS 2024

Apple researchers are advancing the field of ML through fundamental research that improves the world’s understanding of this technology and helps to redefine what is possible with it. This work may lead to advancements in Apple's products and services, and the benefits of the research extend beyond the Apple ecosystem as it is shared with the broader research community through publication, open source resources, and engagement at industry and…
See highlight details