eventJuly 20, 2023

Apple Natural Language Understanding Workshop 2023

Earlier this year, Apple hosted the Natural Language Understanding workshop. This two-day hybrid event brought together Apple and members of the academic research community for talks and discussions on the state of the art in natural language understanding.

In this post, we share highlights from workshop discussions and recordings of select workshop talks.

NLU Workshop Videos

STAIR: Learning Sparse Text and Image Representation in Grounded…
Presented by Yinfei Yang
Watch
Building Language Models with Modularity
Presented by Noah Smith
Watch
Model-Aided Human Annotation at Scale
Presented by Hadas Kotek
Watch
Prompting for a Conversation: How to Control a Dialog Model?
Presented by Yimai Fang
Watch
Towards Practical Use of Large Pre-Trained Language Models:…
Presented by Chris Manning
Watch
Grounded Dialogue Generation with Cross-encoding Re-ranker…
Presented by Helen Meng
Watch

Balancing Privacy and Conversational Systems

Many workshop attendees noted that preserving privacy can be especially challenging in foundation models, which can memorize training data and directly use personal information. Workshop attendee and assistant professor of Computer Science at Princeton University, Dr. Danqi Chen discussed a popular proposed solution to use parametric language models augmented with a k-nearest-neighbor retrieval component. This solution is described in the paper, Generalization Through Memorization: Nearest Neighbor Language Models,) co-authored by workshop attendee and Professor at the University of Washington, Luke Zettlemoyer. Chen showed there is a higher risk of privacy leakage when using a private datastore in such a configuration than when using the datastore to fine tune the parametric language model. Chen and her group describe this in further detail in her paper, Privacy Implications of Retrieval-Based Language Models. These solutions would prevent internal memorization of personal information by limiting the core model and forcing it to rely on external information sources that could be properly anonymized in use.

Applying Foundation Models to Production Systems

Foundation models contain so much data that they require large computing clusters for processing.

Foundation models contain so much data so they require large computing clusters for processing. Making these models more compact will make it possible to run them on smaller computing devices (such as phones), some of which preserve users’ privacy by storing their data only on the device.

There are a variety of techniques for model compression which were covered in workshop attendee Zettlemoyer's paper LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale and Computer Science Ph.D. candidate at Cornell Tech Rajiv Movva's paper Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks.

Pruning out weights that can be omitted without degrading performance
Reducing the precision of encoding, typically by reducing the model weights from 32-bit to 16-bit or smaller, which reduces memory and inference speed
Distilling knowledge, in which a smaller student network learns to copy the behavior of a larger teacher network

Researchers also face challenges with foundation models’ consistency, hallucination (generating of false statements or addition of extraneous imagined details) and unsafe outputs. Research by workshop attendee Pascale Fung and team, Survey of Hallucination in Natural Language Generation, discusses such unsafe outputs. For example, if asked to summarize a passage containing facts like the date of the first vaccine for Ebola (2019), a foundation model might instead claim it was 2021 or it might claim that 2021 was the date China started COVID-19 vaccine trials. Neither of these is accurate, but the foundation model has no ability to determine truth — it can only measure language probability. Such hallucinations might be offensive or harmful. Similarly, foundation models might give two different and inconsistent answers to a question on separate occasions, in different contexts.

One common theme in the workshop was the idea of grounding agents — conversational assistants or chatbots — in retrieving facts and building an ecosystem of auxiliary models and systems to act as safeguards.

Using Multimodal Information in Conversational Systems

Researchers are currently applying multimodal context — such as prior interactions, information on the screen, gestures, gaze, and visual cues — to reduce ambiguity in conversational understanding. Workshop attendee and Apple machine learning researcher Hadas Kotek who cowrote the paper Generating Natural Questions from Images for Multimodal Assistants, gave some examples in which such context might provide referents for interpreting phrases like “today,” “she,” “the lowest one,” or “that picture.”

In addition, workshop attendee and Professor of Systems Design Methodologies at the University of Washington, Mari Ostendorf, described other aspects of context important for natural dialogue; external knowledge; dialogue history; prosodic and visual signals, not all of which can be currently handled successfully within a dialogue state tracking or slot-filling approach. She sketched one possible approach to including domain and task contextual knowledge via finite-state machine based graphs describing, for example, possible sequences of events in buyer-seller negotiation dialogues. To contextualize knowledge retrieval and language generation, she advocated for efficient representations of the dialogue history, including a structured dialogue state. Learn more about the aspects of context in Ostendorf's papers: In-context Learning for Few-shot Dialogue State Tracking.) and CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning.)

Workshop participants also gave talks on how to learn meaningful representations of language and vision in tandem, such as workshop attendee and Apple machine learning researcher Yinfei Yang's talk titled “STAIR: Learning Sparse Text and Image Representation in Grounded Tokens;” mixed representations are essential for systems that use text to search for images, as discussed by Yang and team in Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision; or systems that describe images in text, as discussed by workshop attendee Kotek in the paper Generating Natural Questions from Images for Multimodal Assistants.

Using Foundation Models to Solve Data Synthesis Problems

Foundation models have demonstrated the capability to generate high-quality synthetic data with little or no graded data to learn from. Using synthetic data in place of manually labeled data reduces the need to show annotators any data that might contain personal information, helping to preserve privacy.

However, generating synthetic data creates some challenges. Researchers are still not clear on how to measure and ensure the quality — that is, the factual accuracy, naturalness, or similarity to human speech or writing — and diversity of the output data.

This is especially challenging for data generation over multiple turns, including conversational and task-based interactions. Research shows foundation models can lose factual accuracy and hallucinate information not present in the conversational context over longer interactions.

Researchers can tackle these challenges in several ways. Some promising methods being considered for future research use foundation models for review and analysis — applying the models to view the same problem multiple times, in different roles. Other methods involve some amount of human annotation or preference selection. Thus, the main open challenge here is to find ways to maximize the impact of human input.

Workshop Resources

Recovering Private Text in Federated Learning of Language Models by Samyak Gupta, Yangsibo Huang, Zexuan Zhong, Tianyu Gao, Kai Li, and Danqi Chen

Survey of Hallucination in Natural Language Generation by Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Wenliang Dai, Andrea Madotto, and Pascale Fung

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Generalization through Memorization: Nearest Neighbor Language Models by Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis

 Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks By Rajiv Movva, Jinhao Lei, Shayne Longpre, Ajay Gupta, and Chris DuBois

Generating Natural Questions from Images for Multimodal Assistants By Alkesh Patel, Akanksha Bindal, Hadas Kotek, Christopher Klein, and Jason Williams 

Training Language Models with Memory Augmentation by Zexuan Zhong, Tao Lei, and Danqi Chen

Acknowledgements

Christopher Klein, David Q. Sun, Dhivya Piraviperumal, Hadas Kotek, Irina Belousova, Jason Williams, Kieran Liu, Matthew Henderson, Murat Akbacak, Stephen Pulman, Tatiana Likhomanenko, Thomas Voice, Yimai Fang, and Yinfei Yang.

Apple Natural Language Understanding Workshop 2023

NLU Workshop Videos

Balancing Privacy and Conversational Systems

Applying Foundation Models to Production Systems

Using Multimodal Information in Conversational Systems

Using Foundation Models to Solve Data Synthesis Problems

Workshop Resources

Related Videos

Related Papers

Acknowledgements

Related readings and updates.

Apple Workshop on Human-Centered Machine Learning 2024

A Multi-Task Neural Architecture for On-Device Scene Analysis

Discover opportunities in Machine Learning.