Active Learning for Domain Classification in a Commercial Spoken Personal Assistant

AuthorsXi C. Chen, Adithya Sagar, Justine T. Kao, Tony Y. Li, Christopher Klein, Stephen Pulman, Ashish Garg, Jason D. Williams

View publication

We describe a method for selecting relevant new training data for the LSTM-based domain selection component of our personal assistant system. Adding more annotated training data for any ML system typically improves accuracy, but only if it provides examples not already adequately covered in the existing data. However, obtaining, selecting, and labeling relevant data is expensive. This work presents a simple technique that automatically identifies new helpful examples suitable for human annotation. Our experimental results show that the proposed method, compared with random-selection and entropy-based methods, leads to higher accuracy improvements given a fixed annotation budget. Although developed and tested in the setting of a commercial intelligent assistant, the technique is of wider applicability.

Related readings and updates.

Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation Program

September 30, 2024research area Data Science and Annotationconference Journal of Data and Information Quality

Machine learning (ML) and artificial intelligence (AI) systems rely heavily on human-annotated data for training and evaluation. A major challenge in this context is the occurrence of annotation errors, as their effects can degrade model performance. This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks for three industry-scale ML applications (music streaming, video streaming, and…

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

November 30, 2021research area Data Science and Annotation, research area Methods and AlgorithmsWorkshop at NeurIPS

This paper was accepted at the Data-Centric AI Workshop at the NeurIPS 2021 conference.

As the adoption of deep learning techniques in industrial applications grows with increasing speed and scale, successful deployment of deep learning models often hinges on the availability, volume, and quality of annotated data. In this paper, we tackle the problems of efficient data labeling and annotation verification under the human-in-the-loop setting…

Active Learning for Domain Classification in a Commercial Spoken Personal Assistant

Related readings and updates.

Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation Program

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

Discover opportunities in Machine Learning.