Earlier this year, Apple hosted the Natural Language Understanding workshop. This two-day hybrid event brought together Apple and members of the academic research community for talks and discussions on the state of the art in natural language understanding.
In this post, we share highlights from workshop discussions and recordings of select workshop talks.
Balancing Privacy and Conversational Systems
Many workshop attendees noted that preserving privacy can be especially challenging in foundation models, which can memorize training data and directly use personal information. Workshop attendee and assistant professor of Computer Science at Princeton University, Dr. Danqi Chen discussed a popular proposed solution to use parametric language models augmented with a k-nearest-neighbor retrieval component. This solution is described in the paper, Generalization Through Memorization: Nearest Neighbor Language Models,) co-authored by workshop attendee and Professor at the University of Washington, Luke Zettlemoyer. Chen showed there is a higher risk of privacy leakage when using a private datastore in such a configuration than when using the datastore to fine tune the parametric language model. Chen and her group describe this in further detail in her paper, Privacy Implications of Retrieval-Based Language Models. These solutions would prevent internal memorization of personal information by limiting the core model and forcing it to rely on external information sources that could be properly anonymized in use.
Applying Foundation Models to Production Systems
Foundation models contain so much data that they require large computing clusters for processing.
Foundation models contain so much data so they require large computing clusters for processing. Making these models more compact will make it possible to run them on smaller computing devices (such as phones), some of which preserve users’ privacy by storing their data only on the device.
There are a variety of techniques for model compression which were covered in workshop attendee Zettlemoyer's paper LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale and Computer Science Ph.D. candidate at Cornell Tech Rajiv Movva's paper Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks.
- Pruning out weights that can be omitted without degrading performance
- Reducing the precision of encoding, typically by reducing the model weights from 32-bit to 16-bit or smaller, which reduces memory and inference speed
- Distilling knowledge, in which a smaller student network learns to copy the behavior of a larger teacher network
Researchers also face challenges with foundation models’ consistency, hallucination (generating of false statements or addition of extraneous imagined details) and unsafe outputs. Research by workshop attendee Pascale Fung and team, Survey of Hallucination in Natural Language Generation, discusses such unsafe outputs. For example, if asked to summarize a passage containing facts like the date of the first vaccine for Ebola (2019), a foundation model might instead claim it was 2021 or it might claim that 2021 was the date China started COVID-19 vaccine trials. Neither of these is accurate, but the foundation model has no ability to determine truth — it can only measure language probability. Such hallucinations might be offensive or harmful. Similarly, foundation models might give two different and inconsistent answers to a question on separate occasions, in different contexts.
One common theme in the workshop was the idea of grounding agents — conversational assistants or chatbots — in retrieving facts and building an ecosystem of auxiliary models and systems to act as safeguards.
Using Multimodal Information in Conversational Systems
Researchers are currently applying multimodal context — such as prior interactions, information on the screen, gestures, gaze, and visual cues — to reduce ambiguity in conversational understanding. Workshop attendee and Apple machine learning researcher Hadas Kotek who cowrote the paper Generating Natural Questions from Images for Multimodal Assistants, gave some examples in which such context might provide referents for interpreting phrases like “today,” “she,” “the lowest one,” or “that picture.”
In addition, workshop attendee and Professor of Systems Design Methodologies at the University of Washington, Mari Ostendorf, described other aspects of context important for natural dialogue; external knowledge; dialogue history; prosodic and visual signals, not all of which can be currently handled successfully within a dialogue state tracking or slot-filling approach. She sketched one possible approach to including domain and task contextual knowledge via finite-state machine based graphs describing, for example, possible sequences of events in buyer-seller negotiation dialogues. To contextualize knowledge retrieval and language generation, she advocated for efficient representations of the dialogue history, including a structured dialogue state. Learn more about the aspects of context in Ostendorf's papers: In-context Learning for Few-shot Dialogue State Tracking.) and CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning.)
Workshop participants also gave talks on how to learn meaningful representations of language and vision in tandem, such as workshop attendee and Apple machine learning researcher Yinfei Yang's talk titled “STAIR: Learning Sparse Text and Image Representation in Grounded Tokens;” mixed representations are essential for systems that use text to search for images, as discussed by Yang and team in Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision; or systems that describe images in text, as discussed by workshop attendee Kotek in the paper Generating Natural Questions from Images for Multimodal Assistants.
Using Foundation Models to Solve Data Synthesis Problems
Foundation models have demonstrated the capability to generate high-quality synthetic data with little or no graded data to learn from. Using synthetic data in place of manually labeled data reduces the need to show annotators any data that might contain personal information, helping to preserve privacy.
However, generating synthetic data creates some challenges. Researchers are still not clear on how to measure and ensure the quality — that is, the factual accuracy, naturalness, or similarity to human speech or writing — and diversity of the output data.
This is especially challenging for data generation over multiple turns, including conversational and task-based interactions. Research shows foundation models can lose factual accuracy and hallucinate information not present in the conversational context over longer interactions.
Researchers can tackle these challenges in several ways. Some promising methods being considered for future research use foundation models for review and analysis — applying the models to view the same problem multiple times, in different roles. Other methods involve some amount of human annotation or preference selection. Thus, the main open challenge here is to find ways to maximize the impact of human input.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig
Christopher Klein, David Q. Sun, Dhivya Piraviperumal, Hadas Kotek, Irina Belousova, Jason Williams, Kieran Liu, Matthew Henderson, Murat Akbacak, Stephen Pulman, Tatiana Likhomanenko, Thomas Voice, Yimai Fang, and Yinfei Yang.
Related readings and updates.
Scene analysis is an integral core technology that powers many features and experiences in the Apple ecosystem. From visual content search to powerful memories marking special occasions in one’s life, outputs (or "signals") produced by scene analysis are critical to how users interface with the photos on their devices. Deploying dedicated models for each of these individual features is inefficient as many of these models can benefit from sharing resources. We present how we developed Apple Neural Scene Analyzer (ANSA), a unified backbone to build and maintain scene analysis workflows in production. This was an important step towards enabling Apple to be among the first in the industry to deploy fully client-side scene analysis in 2016.
Camera (in iOS and iPadOS) relies on a wide range of scene-understanding technologies to develop images. In particular, pixel-level understanding of image content, also known as image segmentation, is behind many of the app's front-and-center features. Person segmentation and depth estimation powers Portrait Mode, which simulates effects like the shallow depth of field and Stage Light. Person and skin segmentation power semantic rendering in group shots of up to four people, optimizing contrast, lighting, and even skin tones for each subject individually. Person, skin, and sky segmentation power Photographic Styles, which creates a personal look for your photos by selectively applying adjustments to the right areas guided by segmentation masks, while preserving skin tones. Sky segmentation and skin segmentation power denoising and sharpening algorithms for better image quality in low-texture regions. Several other features consume image segmentation as an essential input.