Learning with Privacy at Scale
AuthorsDifferential Privacy Team
AuthorsDifferential Privacy Team
Understanding how people use their devices often helps in improving the user experience. However, accessing the data that provides such insights — for example, what users type on their keyboards and the websites they visit — can compromise user privacy. We develop a system architecture that enables learning at scale by leveraging local differential privacy, combined with existing privacy best practices. We design efficient and scalable local differentially private algorithms and provide rigorous analyses to demonstrate the tradeoffs among utility, privacy, server computation, and device bandwidth. Understanding the balance among these factors leads us to a successful practical deployment using local differential privacy. This deployment scales to hundreds of millions of users across a variety of use cases, such as identifying popular emojis, popular health data types, and media playback preferences in Safari. We provide additional details about our system in the full version.
Gaining insight into the overall user population is crucial to improving the user experience. The data needed to derive such insights is personal and sensitive, and must be kept private. In addition to privacy concerns, practical deployments of learning systems using this data must also consider resource overhead, computation costs, and communication costs. In this article, we give an overview of a system architecture that combines differential privacy and privacy best practices to learn from a user population.
Differential privacy  provides a mathematically rigorous definition of privacy and is one of the strongest guarantees of privacy available. It is rooted in the idea that carefully calibrated noise can mask a user’s data. When many people submit data, the noise that has been added averages out and meaningful information emerges.
Within the differential privacy framework, there are two settings: central and local. In our system, we choose not to collect raw data on the server which is required for central differential privacy; hence, we adopt local differential privacy, which is a superior form of privacy . Local differential privacy has the advantage that the data is randomized before being sent from the device, so the server never sees or receives raw data.
Our system is designed to be opt-in and transparent. No data is recorded or transmitted before the user explicitly chooses to report usage information. Data is privatized on the user's device using event-level differential privacy  in the local model where an event might be, for example, a user typing an emoji. Additionally, we restrict the number of transmitted privatized events per use case. The transmission to the server occurs over an encrypted channel once per day, with no device identifiers. The records arrive on a restricted-access server where IP identifiers are immediately discarded, and any association between multiple records is also discarded. At this point, we cannot distinguish, for example, if an emoji record and a Safari web domain record came from the same user. The records are processed to compute statistics. These aggregate statistics are then shared internally with the relevant teams at Apple.
We focus on the problem of estimating frequencies of elements --- for example, emojis and web domains. In estimating frequencies of elements, we consider two subproblems. In the first, we compute the histogram from a known dictionary of elements. In the second, the dictionary is unknown and we want to obtain a list of the most frequent elements in a dataset.
Our system architecture consists of device-side and server-side data processing. On the device, the privatization stage ensures raw data is made differentially private. The restricted-access server does data processing that can be further divided into the ingestion and aggregation stages. We explain each stage in detail below.
Users have the option, in System Preferences on macOS or in Settings on iOS, to share privatized records for analytics. For users who do not opt in, the system remains inactive. For users who do opt in, we define a per-event privacy parameter,
Whenever an event is generated on device, the data is immediately privatized via
On iOS, the reports are visible in Settings
The privatized records are first stripped of their IP addresses prior to entering the ingestor. The ingestor then collects the data from all users and processes them in a batch. The batch process removes metadata, such as the timestamps of privatized records received, and separates these records based on their use case. The ingestor also randomly permutes the ordering of privatized records within each use case before forwarding the output to the next stage.
The aggregator takes the private records from the ingestor and for each use case generates a differentially private histogram according to the algorithms described in the section below. The data from multiple use cases is never combined when computing statistics. In these histograms, only domain elements whose counts are above a prescribed threshold
We now describe three local differentially private algorithms in the following sections.
The Private Count Mean Sketch algorithm (CMS) aggregates records submitted by devices and outputs a histogram of counts over a dictionary of domain elements, while preserving local differential privacy. This happens in two stages: client-side processing followed by server-side aggregation.
We illustrate the process with an example. Suppose a user visits the web domain
The server-side algorithm constructs a sketch matrix
As records arrive on the server, the algorithm adds the privatized vector to the vector at row
To compute the frequency for a web domain
We describe in the full version of this article how increasing the device bandwidth will lead to more accurate counts in CMS. However, this leads to higher transmissions costs to users. We wanted to have a minimal impact on accuracy while reducing transmission cost. This led us to design the Private Hadamard Count Mean Sketch algorithm (HCMS), which has the advantage that the device can send one bit with a small loss in accuracy. With HCMS, it is possible to achieve reasonably accurate counts without having users pay a high transmission cost. We quantify the accuracy that we obtain with HCMS in the full version.
We now present HCMS with an example. Suppose a user visits the web domain
Similar to CMS, the server-side algorithm uses a data structure, sketch matrix
The previous algorithms assume there is some known dictionary of domain elements through which the server can enumerate in order to determine the corresponding counts. However,
in certain use cases, the domain is massive and enumerating over the full space is computationally prohibitive. For example, in discovering frequently typed new words, even if we restrict ourselves to 10-letter case-sensitive English words, this approach would require the server to loop over at least
Instead, we develop an algorithm called Sequence Fragment Puzzle (SFP) and present it in the setting of discovering new words. We take advantage of the fact that given a popular string, any substring of that string is also at least as popular. On the device, we use the client-side CMS algorithm to privatize the typed word. Additionally, we select a substring of the word and concatenate it with an 8-bit hash of the word. We refer to the small hash as the puzzle piece and the substring concatenated with the hash as the fragment. The fragment is privatized using CMS and also transmitted to the server along with the privatized word. For example, if the word is
Using sketches for fragments, the server-side algorithm obtains a histogram over all possible fragments for each substring location. The puzzle piece allows the server to correlate fragments from the same word since all fragments from a word will have the same puzzle piece. Then, restricting itself to the most popular fragments, the server algorithm determines a list of candidate strings by concatenating popular fragments whose puzzle pieces match. The set of candidate strings forms a dictionary of reasonable size and lets us use the CMS algorithm on the full word.
We present three use cases below to illustrate how our algorithms are used to enhance product features while protecting user privacy.
Given the popularity of emojis across our user base, we want to determine which specific emojis are most used by our customers and the relative distribution of these characters. To that end, we deploy our algorithms to understand the distribution of emojis used across keyboard locales. For this use case, we set the parameters for CMS to be
The data shows many differences across keyboard locales. In Figure 6, we observe snapshots from two locales: English and French. Using this data, we can improve our predictive emoji QuickType across locales.
Some websites are exceedingly resource-intensive, and we wish to identify these sites in order to ensure a better user experience. We consider two types of domains: those that cause high memory usage and those that cause excessive energy drain from CPU usage. In iOS 11 and macOS High Sierra, Safari can automatically detect these exceptional domains and report them using differential privacy.
Using our algorithms, we are able to determine which domains have high resource consumption.
For this use case, we set the parameters for HCMS to be
We want to learn words that are not present in the lexicons included on the device in order to improve auto-correction. To discover new words, we deploy the Sequence Fragment Puzzle (SFP) algorithm described above.
The algorithm produces results spanning several languages, including English, French, and Spanish. The learned words for the English keyboard, for example, can be divided into multiple categories: abbreviations like wyd, wbu, idc; popular expressions like bruh, hun, bae, and tryna, seasonal or trending words like Mayweather, McGregor, Despacito, Moana, and Leia; and foreign words like dia, queso, aqui, and jai. Using the data, we are constantly updating our on-device lexicons to improve the keyboard experience.
Another category of words discovered are known words without the trailing e (lov or th) or w (kno). If users accidentally press the left-most prediction cell above the keyboard, which contains the literal string typed thus far, a space will be added to their current word instead of the character they intended to type. This is a key insight that we were able to learn due to our local differentially private algorithm.
In this article, we have presented a novel learning system architecture, which leverages local differential privacy and combines it with privacy best practices. To scale our system to millions of users and a variety of use cases, we have developed novel local differentially private algorithms -- CMS, HCMS, and SFP -- for both the known and unknown dictionary settings. In our full paper, we have provided analytic expressions for the tradeoffs among various factors, including privacy, utility, server computation overhead, and device bandwidth. Our utility theorems give a principled way to choose algorithmic parameters to minimize the transmission cost for the users without lowering accuracy. Without such expressions, it is difficult to evaluate the impact on accuracy if, for example, transmission cost is reduced without running costly iterations. Further, to keep transmission costs to an absolute minimum, our HCMS algorithm can obtain accurate counts when each user sends only a single privatized bit. We believe that our paper is one of the first to demonstrate the successful deployment of local differential privacy, , in a real-world setting across multiple use cases. We have shown that we could find popular abbreviations and slang words typed, popular emojis, popular health data types while satisfying local differential privacy. Further, we can identify websites that are consuming too much energy and memory, and websites where users want Auto-play. This information has been used to improve features for the benefit of the user experience.
We hope that our work will serve a role in bridging the gap between theory and practice of private systems. We also believe our work will continue to support research in a broad set of large-scale learning problems while preserving users' privacy.
 C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. TCC, 2006.
 C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 2014.
 C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential Privacy Under Continual Observation. Proceedings of the Forty-second ACM Symposium on Theory of Computing, 2010.
 G. Fanti, V. Pihur, and Ú. Erlingsson. Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries. PoPETS, 2016.
 Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy Hitter Estimation Over Set-valued Data with Local Differential Privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016.
 Ú. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. Proceedings of the 21st ACM Conference on Computer and Communications Security, 2014.
Earlier this year, Apple hosted the Workshop on Privacy-Preserving Machine Learning (PPML). This virtual event brought Apple and members of the academic research communities together to discuss the state of the art in the field of privacy-preserving machine learning through a series of talks and discussions over two days.