The two most common ways to activate intelligent voice assistants (IVAs) are button presses and trigger phrases. This paper describes a new way to invoke IVAs on smartwatches: simply raise your hand and speak naturally. To achieve this experience, we designed an accurate, low-power detector that works on a wide range of environments and activity scenarios with minimal impact to battery life, memory footprint, and processor utilization.
The raise to speak (RTS) detector consists of four main components: an on-device gesture convolutional neural network (CNN) that uses accelerometer data to detect specific poses; an on-device speech CNN to detect proximal human speech; a policy model to combine signals from the motion and speech detector; and an off-device false trigger mitigation (FTM) system to reduce unintentional invocations trigged by the on-device detector. Majority of the components of the detector run on-device to preserve user privacy.
The RTS detector was released in watchOS 5.0 and is running on millions of devices worldwide