Apple Workshop on Human-Centered Machine Learning 2024

A human-centered approach to machine learning (HCML) involves designing ML machine learning & AI technology that prioritizes the needs and values of the people using it. This leads to AI that complements and enhances human capabilities, rather than replacing them. Research in the area of HCML includes the development of transparent and interpretable machine learning systems to help people feel safer using AI, as well as strategies for predicting and preventing potentially negative societal impacts of the technology. The human-centered approach to ML aligns with our focus on responsible AI development, which include empowering users with intelligent tools, representing our users, designing with care, and protecting privacy.
To continue to advance the state of the art in HCML, Apple brought together experts from the broader research community for a Workshop on Human-Centered Machine Learning. Speakers and attendees included both Apple and academic researchers, and discussions focused on topics such as wearables and ubiquitous computing, the implications of foundation models, and accessibility, all through the lens of privacy and safety.
In this post we share highlights from workshop discussions and recordings of select workshop talks.
Engineering Better UIs via Collaboration with Screen-Aware…
Presented by Kevin Moran
UI Understanding
Presented by Jeff Nichols
AI-Resilient Interfaces
Presented by Elena Glassman
Tiny but powerful: Human-Centered Research to Support Efficient…
Presented by Mary Beth Kery
Speech Technology for People with Speech Disabilities
Presented by Colin Lea and Dianna Yee
AI-Powered AR Accessibility
Presented by Jon Froehlich
Vision-Based Hand Gesture Customization from a Single…
Presented by Cori Park
Creating Superhearing: Augmenting Human Auditory Perception with…
Presented by Shyam Gollakota
Foundation models offer many opportunities for creating improved user interfaces that go beyond chat bots and other language-centric tasks, and many novel use cases for the use of foundation models were discussed throughout the workshop.
In the talk, "Engineering Better UIs via Collaboration with Screen-Aware Foundation Models," Kevin Moran of the University of Central Florida shared how foundation models can improve productivity for software developers, with a particular focus on the use of models for building user interfaces. Large-scale UI datasets, such as Rico and WebUI, and screen-aware foundation models, such as Ferret UI, are important building blocks for this type of work.
Moran shared an example of his work in bug reporting, which showed that foundation models can help identify the user steps that reproduce a bug, both by assisting the end user by providing comprehensive and accurate information, and also connecting problems in the user interface to the code that is causing the problem. Moran shared the work proved to not only improve productivity overall, but also could ultimately lead to fewer bugs for users.
In the talk, "UI Understanding," Jeff Nichols of Apple presented work with the long-term goal to give machines human-level abilities to interact with user interfaces. Nichols showed how foundation models are advancing the field in four areas:
In all of this work, the foundation model is an integral part of the system, but operates in the background through calls from other software, rather than in the foreground as a chat bot might.
Hari Subramonyam of Stanford University presented several principles and concepts that can help designers and developers think about the use of foundation models in user interfaces. A particular challenge is the models by themselves are extremely open-ended, making it difficult for users to anticipate and control the output of a model based on the prompt that they provide. The problem is users' understanding of the foundation model, which in turn creates a "Gulf of Envisioning" where users are not sure what prompt to give the model that will result in the desired output. To learn more, see the paper Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs.
By addressing the challenges of user understanding through better interface design and leveraging large-scale UI datasets, these models can boost productivity, reduce errors, and automate complex tasks. Foundation models provide transformative potential for creating intelligent, adaptable user interfaces that enhance both user experience and development productivity. All of the talks illustrated how critical it is to deeply understand - and help users understand - the interactive properties of foundation models. While there are many exciting application areas for foundation models in the area of user interfaces, much work remains.
Foundation models are trained on increasingly large datasets and are reaching complexities far beyond what users can fully understand. During the workshop, researchers shared ways to help users better understand the inner working of modern AI systems, as well as methods for evaluating how models may behave when deployed.
In the talk "AI-Resilient Interfaces," Elena Glassman of Harvard University argued that while AI is powerful, it can make choices that result in objective errors, contextually inappropriate outputs, and disliked options. She proposed AI-resilient interfaces that help people be resilient to the AI choices that are not right, or not right for them. These interfaces help users notice and have the context to appropriately judge AI choices.
For example, in a non-resilient interface, a summary of an article may omit details that are critical for a reader. A resilient interface could instead emphasize the most important aspect of a text and de-emphasize, but not hide, context. This interface is resilient to specific choices of the AI since the reader can still choose to read the de-emphasized context. In this talk, Glassman defined key aspects of AI-resilient interfaces, illustrated with examples from recent tools that support users' mental modeling of, iterative prototyping with, and leveraging of LLMs for particular tasks (e.g., ChainForge and Positional Diction Clustering). She concluded by emphasizing that a well-designed, AI-resilient interface would improve AI safety, usability, and utility.
Arvind Satyanarayan of the Massachusetts Institute of Technology suggested a shift in AI evaluation beyond the Turing Test’s narrow focus on mimicking human behavior. While valuable, current evaluative approaches are unable to help us assess an increasingly pressing concern: are AI-driven systems empowering or disempowering people?
Satyanarayan advocated for a conceptual shift to redefine intelligence as agency, and offered the new definition as the capacity to meaningfully act, rather than the capacity to perform a task. He demonstrated how to operationalize this definition of agency through a case study on using generative AI to improve the accessibility of web-based data visualizations with hierarchical and semantically meaningful textual descriptions. He echoed Glassman’s emphasis on the importance of developing AI that respects user autonomy and offers nuanced control over data comprehension and interpretation. To learn more, see related papers Intelligence as Agency: Evaluating the Capacity of Generative AI to Empower or Constrain Human Action and VisText: A Benchmark for Semantically Rich Chart Captioning.
In the talk, “Tiny but Powerful: Human-Centered Research to Support Efficient On-Device ML,” Mary Beth Kery of Apple shared research on how to deploy machine learning models on-device by compressing them with interactive tools and keeping track of improvements. This research offers pragmatic considerations, covering the design process, trade-offs, and technical strategies that go into creating efficient models. With the tools Kery presented, ML practitioners can analyze and interact with models, and experiment with model optimizations on hardware. This work emphasizes the importance of tools for translating ML research into intelligent, on-device ML experiences.
Foundation models and recent AI advances have the potential to address longstanding "grand challenges" in accessibility. During the workshop, researchers presented work to enhance accessibility for sign language and speech, address accessibility needs in the physical world, and empower disabled creators to both consume and produce digital content.
In the talk, "Speech Technology for People with Speech Disabilities," Colin Lea and Dianna Yee of Apple combined human-centered inquiry into how people with speech disabilities experience and want to use speech technology, with technical machine learning innovations to improve accessibility of speech recognition. The talk covered three projects:
In the talk, "AI-Powered AR Accessibility," Jon Froehlich of the University of Washington focused on how AI and new interactive technologies can make the real world more accessible for people with disabilities. First, with Project Sidewalk, Froehlich's research group combined online map imagery such as Street View and satellite data with crowdsourced data collection and AI to scale up sidewalk accessibility data. With 1.7 million sidewalk accessibility labels to date (e.g., curb ramps, uneven surfaces, missing sidewalks) and deployments in 21 cities across four continents, Project Sidewalk has been used to transform and help fund sidewalk accessibility programs, to create new interactive visualizations of sidewalk infrastructure, and to help train AI models. Froehlich also covered a range of projects leveraging advances in foundation models and augmented reality hardware, including:
Learn more about the ongoing projects at the Makeability Lab.
Amy Pavel of the University of Texas, Austin presented work on accessible creativity support with generative AI. Pavel argued that a primary challenge with accessibility for blind and low vision users is that user interface design patterns have been optimized over the last 50 years to prioritize visual access. Pavel proposed that redesigning interfaces centered on other modalities of access can yield new and counterintuitive design patterns that should be useful more broadly.
Pavel also covered a series of projects that explore this approach to interface redesign in a variety of contexts. One project, called the "GenAssist Technical Pipeline", explores image generation for blind and low vision users by proactively providing a structured summary and comparison of multiple images for the user to explore, based on questions the model expects a blind user may have. This same body of work also yields general requirements for accessible creativity support tools for blind and low vision creators that can inform future tool development. Read more about this work in GenAssist: Making Image Generation Accessible and AVscript: Accessible Video Editing with Audio-Visual Scripts.
Wearables and ubiquitous computing are rapidly emerging as essential components of the broader machine learning landscape. These technologies enable the seamless collection of continuous, real-time data, which is crucial for developing intelligent, context-aware systems with the potential for improved personalization for users and enhanced human-computer interaction.
In the talk, "Vision-Based Hand Gesture Customization from a Single Demonstration," Cori Park of Apple focused on recent work on Vision-based Gesture Customization using meta-learning. As the field transitions into the era of Mixed Reality (XR), hand gestures have become increasingly prevalent in interactions with XR headsets and other devices equipped with vision capabilities. Despite continued progress in this field, gesture customization is often underexplored. Customization is crucial because it enables users to define and demonstrate gestures that are more natural, memorable, and accessible. Defining new custom gestures generally requires extensive effort from the user to provide training samples and evaluate performance. In the talk, Park explained implementing meta-learning using a transformer architecture trained on 2D skeletal points of hands. The method was tested on various hand gestures, including static, dynamic, one-handed, and two-handed gestures, as well as multiple views, such as egocentric and allocentric. Although meta-learning has been applied to various classification problems, this work is the first to enable a vision-based gesture customization methodology from a single demonstration using this technique.
Thomas Ploetz of the Georgia Institute of Technology presented that despite significant advancements in AI, particularly in the field of Human Activity Recognition (HAR) on wearables, there are still challenges that remain. One major issue is the scarcity of large, labeled activity datasets, which limits the effectiveness of supervised learning methods. To address this challenge, researchers have explored methods for acquiring labeled data that are more flexible and cost-effective. This work is detailed in IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition and On the Benefit of Generative Foundation Models for Human Activity Recognition.
In the talk, "Creating Superhearing: Augmenting human auditory perception with AI," Shyam Gollakota of the University of Washington presented innovative concepts for enhancing human hearing through the use of AI, aiming to enable what he refers to as "superhearing" abilities. The talk covered three primary methods of sound augmentation that can be implemented using standard noise-canceling headphones:
Gollakota emphasized that the Superhearing abilities have great potential to bring transformative experiences to all headset and hearing aid users in the near future.
Human-centered machine learning and the impacts of foundation models, trustworthiness, wearables, and accessibility on end-users are a key focus of research in academia and industry, and we look forward to continue participating in this area of research.
AidUI: Toward Automated Recognition of Dark Patterns in User Interfaces by SM Hasan Mansur (George Mason University), Sabiha Salma (George Mason University), Damilola Awofisayo (Duke University), Kevin Moran (University of Central Florida)
An AI-Resilient Text Rendering Technique for Reading and Skimming Documents by Ziwei Gu (Harvard University), Ian Arawjo (Harvard University), Kenneth Li (Harvard University), Jonathan Kummerfeld (University of Sydney), Elena Glassman (Harvard University)
AVscript: Accessible Video Editing with Audio-Visual Scripts by Mina Huh (The University of Texas at Austin), Saelyne Yang (KAIST), Yi-Hao Peng (Carnegie Mellon University), Xiang 'Anthony' Chen (University of California, Los Angeles), Young-Ho Kim (NAVER AI Lab), Amy Pavel (The University of Texas at Austin)
Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs by Hari Subramonyam (Stanford University), Roy Pea (Stanford University), Christopher Lawrence Pondoc (Stanford University), Maneesh Agrawala (Stanford University), Colleen Seifert (University of Michigan)
ClearBuds: wireless binaural earbuds for learning-based speech enhancement by Ishan Chatterjee (University of Washington), Maruchi Kim (University of Washington), Vivek Jayaram (University of Washington), Shyamnath Gollakota (University of Washington), Ira Kemelmacher (University of Washington), Shwetak Patel (University of Washington), Steven M. Seitz (University of Washington)
GenAssist: Making Image Generation Accessible by Mina Huh (The University of Texas at Austin), Yi-Hao Peng (Carnegie Mellon University), Amy Pavel (The University of Texas at Austin)
Hypernetworks for Personalizing ASR to Atypical Speech by Max Müller-Eberstein (IT University of Copenhagen), Dianna Yee (Apple), Karren Yang (Apple), Gautam Varma Mantena (Apple), Colin Lea (Apple)
IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition by Zikang Leng (Georgia Institute of Technology), Amitrajit Bhattacharjee (Georgia Institute of Technology), Hrudhai Rajasekhar (Georgia Institute of Technology), Lizhe Zhang (Georgia Institute of Technology), Elizabeth Bruda (Georgia Institute of Technology), Hyeokhyen Kwon (Emory University), Thomas Plötz (Georgia Institute of Technology)
Intelligence as Agency: Evaluating the Capacity of Generative AI to Empower or Constrain Human Action by Arvind Satyanarayan (MIT) and Graham M. Jones (MIT)
Look Once to Hear: Target Speech Hearing with Noisy Examples by Bandhav Veluri (University of Washington), Malek Itani (University of Washington), Tuochao Chen (University of Washington), Takuya Yoshioka (AssemblyAI), Shyamnath Gollakota (University of Washington)
Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences by Fred Hohman (Apple), Mary Beth Kery (Apple), Donghao Ren (Apple), Dominik Moritz (Apple)
On the Benefit of Generative Foundation Models for Human Activity Recognition by Zikang Leng (Georgia Institute of Technology), Hyeokhyen Kwon (Emory University), Thomas Plötz (Georgia Institute of Technology)
On Using GUI Interaction Data to Improve Text Retrieval-based Bug Localization by Junayed Mahmud (University of Central Florida), Nadeeshan De Silva (William & Mary), Safwat Ali Khan (George Mason University), Seyed Hooman Mostafavi (George Mason University), SM Hasan Mansur (George Mason University), Oscar Chaparro (William & Mary), Andrian Marcus (George Mason University), Kevin Moran (University of Central Florida)
Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables by Bandhav Veluri (University of Washington), Malek Itani (University of Washington), Justin Chan (University of Washington), Takuya Yoshioka (Microsoft), Shyamnath Gollakota (University of Washington)
Vision-Based Hand Gesture Customization from a Single Demonstration by Soroush Shahi (Apple), Cori Tymoszek Park (Apple), Richard Kang (Apple), Asaf Liberman (Apple), Oron Levy (Apple), Jun Gong (Apple), Abdelkareem Bedri (Apple), Gierad Laput (Apple)
VisText: A Benchmark for Semantically Rich Chart Captioning by Benny J. Tang (MIT), Angie Boggiest (MIT), Arvind Satyanarayan (MIT)
Many people contributed to this workshop, including Kareem Bedri, Jeffrey P. Bigham, Leah Findlater, Mary Beth Kery, Gierad Laput, Colin Lea, Halden Lin, Dominik Moritz, Jeff Nichols, Cori Park, Griffin Smith, Amanda Swearngin, and Dianna Yee.
July 20, 2023research area Speech and Natural Language Processing
Earlier this year, Apple hosted the Natural Language Understanding workshop. This two-day hybrid event brought together Apple and members of the academic research community for talks and discussions on the state of the art in natural language understanding.
In this post, we share highlights from workshop discussions and recordings of select workshop talks.
June 7, 2022research area Computer Vision, research area Methods and Algorithms
Scene analysis is an integral core technology that powers many features and experiences in the Apple ecosystem. From visual content search to powerful memories marking special occasions in one’s life, outputs (or "signals") produced by scene analysis are critical to how users interface with the photos on their devices. Deploying dedicated models for each of these individual features is inefficient as many of these models can benefit from sharing resources. We present how we developed Apple Neural Scene Analyzer (ANSA), a unified backbone to build and maintain scene analysis workflows in production. This was an important step towards enabling Apple to be among the first in the industry to deploy fully client-side scene analysis in 2016.