Entering text on your iPhone, discovering news articles you might enjoy, finding out answers to questions you may have, and many other language-related tasks depend upon robust natural language processing (NLP) models. Word embeddings are a category of NLP models that mathematically map words to numerical vectors. This capability makes it fairly straightforward to find numerically similar vectors or vector clusters, then reverse the mapping to get relevant linguistic information. Such models are at the heart of familiar apps like News, search, Siri, keyboards, and Maps. In this article, we explore whether we can improve word predictions for the QuickType keyboard using global semantic context.


You shall know a word by the company it keeps.

John Rupert Firth, Linguist

Today, most techniques for training word embeddings capture the local context of a given word in a sentence as a window containing a relatively small number of words (say, 5) before and after the word in question—“the company it keeps” nearby. For example, the word “self-evident” in the U.S. Declaration of Independence has a local context given by “hold these truths to be” on the left and “that all men are created” on the right.

In this article, we describe an extension of this approach to one that captures instead the entire semantic fabric of the document—for example, the entire Declaration of Independence. Can this global semantic context result in better language models? Let’s first take a look at the current use of word embeddings.

Word Embeddings

Continuous-space word embeddings trained in an unsupervised manner are useful for a variety of natural language processing (NLP) tasks, like information retrieval [1], document classification [2], question answering [3], and connectionist language modeling [4]. The most elementary embedding is based on 1-of-N encoding, where every word in an underlying vocabulary of size N is represented by a sparse vector of dimension N (with 1 at the index of the word and 0 elsewhere). More complex embeddings map words into dense vectors in a lower-dimensional continuous vector space. In the course of this dimensionality reduction, the mapping advantageously captures a certain amount of semantic, syntactic, and pragmatic information about the words. Computing distances between vectors then produces similarity measures on words that wouldn’t be possible to compute directly on the individual words themselves.

There are two basic classes of dimension-reducing word embeddings:

  • Representations derived from the local context of the word (for example, the previous L and next L words, where L is typically a small integer).
  • Representations that exploit the global context surrounding the word (for example, the entire document in which it appears)

Methods that leverage local context include:

  • Prediction-based approaches using neural network architectures [5], like continuous bag-of-words and skip-gram models
  • The projection layer technique from connectionist language modeling [6]
  • Bottleneck representations using an autoencoder configuration [7]

Methods that leverage global context include:

  • Global matrix factorization approaches such as latent semantic mapping (LSM), which uses word-document co-occurrence counts [8]
  • Log-linear regression modeling like Global Vectors for Word Vectorization (GloVe), which uses word-word co-occurrence counts [9]

Global co-occurrence count methods like LSM lead to word representations that can be considered genuine semantic embeddings, because they expose statistical information that captures semantic concepts conveyed within entire documents. In contrast, typical prediction-based solutions using neural networks only encapsulate semantic relationships to the extent that they manifest themselves within a local window centered around each word (which is all that’s used in the prediction). Thus, the embeddings that result from such solutions have inherently limited expressive power when it comes to global semantic information.

Despite this limitation, researchers are increasingly adopting neural network-based embeddings. Continuous bag-of-words and skip-gram (linear) models, in particular, are popular because of their ability to convey word analogies of the type “king is to queen as man is to woman.” Because LSM-based methods largely fail to render such word analogies, the prevailing wisdom is that LSM-based methods yield a suboptimal space structure, attributed to a lack of precise meaning in individual dimensions of the vector space.

This realization caused us to explore whether alternatives based on deep learning could perform better than our existing LSM-style word embeddings. We investigated, in particular, whether we could achieve more powerful semantic embeddings by using a different type of neural network architecture.

In the next section, we discuss the challenges involved in using these embeddings for better prediction accuracy in the Apple QuickType keyboards. We go over the bidirectional long short-term memory (bi-LSTM) architecture we adopted. Then we describe the semantic embeddings obtained with this architecture, and their use for neural language modeling. Finally, we discuss the insights that we gained and our perspectives on future development.

Neural Architecture

One of the most well-known frameworks for creating word embedding is word2vec. Instead, we decided to investigate a neural architecture that could provide global semantic embeddings. We chose a bi-LSTM recurrent neural network (RNN) architecture. This approach allows the model to access information about prior, current, and future input that results in a model able to capture a global context.

We designed the architecture so that it can accept up to a full document as input. As output, it provides the semantic category associated with that document, as shown in Figure 1. The resulting embeddings don’t merely capture the local context of a word—they capture the entire semantic fabric of the input.

This architecture addresses two major obstacles. The first obstacle is the restriction to a local context contained within a finite-sized window centered around each word for predicting the output label. By switching to a bidirectional recurrent implementation for the embedding computation, this architecture can, in principle, accommodate a left and right context of infinite length. This opens up the possibility of handling not just a sentence, but an entire paragraph, or even a full document.

RNN Architecture for Global Semantic Embeddings
Figure 1. RNN Architecture for Global Semantic Embeddings

The second obstacle relates to the prediction target itself. All neural network solutions to date predict either a word in context or the local context itself, which doesn’t adequately reflect global semantic information. Instead, the architecture shown in Figure 1 predicts a global semantic category directly from the input text. To simplify the generation of semantic labels, we found that it’s expedient to derive suitable clustered categories, for example, from initial word-document embeddings obtained using LSM [8].

Let’s take a closer look at how the network operates. Assume that the current text chunk under consideration (whether a sentence, paragraph, or entire document) is composed of T words x(t), 1 ≤ t ≤ T, and is globally associated with semantic category z. We provide this text chunk as input to our bidirectional RNN architecture.

Each word x(t) in the input text is encoded using 1-of-N encoding; thus, x(t) is sparse and its dimension is equal to N. The left context vector h(t − 1), of dimension H, contains the internal representation of the left context from the output values in the hidden layer from the previous time step. The right context vector g(t + 1), also of dimension H, similarly contains the right context output values in the hidden layer from the future time step. The network computes the output values for the hidden nodes at the current time step as:

h(t)=F{Xx(t)+Wh(t1)}g(t)=F{Yx(t)+Vg(t+1)} h(t) = F\{X \cdot x(t) + W \cdot h(t-1)\}\\ g(t) = F\{Y \cdot x(t) + V \cdot g(t+1)\}


  • F{·} denotes an activation function like sigmoid, hyperbolic tangent, or rectified linear unit. (Alternatively, a purely linear activation function could conceivably be used to ensure good performance on word analogies.)
  • s(t) denotes the state of the network, which is the concatenation of left and right context hidden nodes: s(t) = [g(t) h(t)], of dimension 2H. View this state as a continuous-space representation of the word x(t) in a vector space of dimension 2H.

The output of the network is the semantic category associated with the input text. At each time step, the output label z corresponding to the current word is thus encoded as 1-of-K and is obtained as, computed as:

z=G{Z[g(t) h(t)]}z = G\{Z \cdot [g(t)\ h(t)]\}


  • G{·} denotes the softmax activation function

When we train the network, we assume that a set of semantic category annotations is available. As mentioned previously, these annotations could come from initial word-document embeddings obtained using LSM.

To avoid the typical vanishing gradient problem, we implement the hidden nodes as LSTM or Gated Recurrent Units (GRU). In addition, you can extend the single hidden layer shown in Figure 1 to an arbitrarily complex, deeper network as necessary. For example, two RNN or LSTM networks stacked on top of one another have given good results in a number of applications, like language identification.

Semantic Word Embeddings

For each word in the underlying vocabulary, the word representation we are looking for is the state of the trained network associated with this word given every text chunk (sentence, paragraph, or document) in which it appears. When the word occurs in only one piece of text, we use the corresponding vector (of dimension 2H) as is. If a word occurs in multiple texts, we average all associated vectors accordingly.

As with LSM, words strongly associated with a particular semantic category tend to be aligned with a small number of dimensions representative of this semantic category. Conversely, function words (i.e., words like “the”, “at”, “you”) tend to congregate near the origin of the vector space.

Note that the initial set of semantic categories can be subsequently refined by iterating on the successive semantic embeddings obtained and the semantic granularity you want to achieve.

Out-Of-Vocabulary (OOV) words need special treatment because they fall outside of the 1-of-N encoding paradigm. One possibility is to allocate an extra dimension to OOV, leading to a 1-of-(N + 1) scheme on the input. The rest of the network remains unchanged. In that case, a single vector of dimension 2H ends up being the representation of any OOV word.

Neural Language Modeling

In our initial experiments, we trained semantic embeddings on a dedicated paragraph corpus of about 70 million words, with a dimension 2H set to 256. However, we found that, even with length normalization, it was difficult to cope with the considerable variability in the length of each input paragraph. In addition, the mismatch between training word embeddings on paragraph data and training language models on sentence data seemed to outweigh the benefits of injecting global semantic information. This is in line with what we observed previously that any embedding is best trained on the exact type of data on which it will be used. In general, data-driven word embeddings aren’t very portable across tasks.

We thus decided to train the embeddings on a subset of the (sentence) corpus used to train our QuickType predictive typing neural language model. We then measured the quality of the embeddings in terms of perplexity on our standard language modeling test sets, as summarized in Table 1. The notation “1-of-N” refers to our standard sparse embedding, while “word2vec” refers to standard word2vec embeddings trained on a large amount of data.

Table 1. Perplexity results on language modeling test set (size of training corpus varies as indicated)

Embedding type1-of-*N*(*N* = 25K)word2vec(dim = 256)bi-LSTM(dim = 256)
Size of training corpus5 billion words70 million words10 million words
Test set perplexity189179185

We were surprised to see that bi-LSTM embeddings trained on a small amount of data perform at approximately the same level as word2vec embeddings trained on a substantially larger (7x more) amount of data. Similarly, the bi-LSTM embeddings performed as well as one-hot vectors trained on a much larger (5000x more) amount of data. Thus, the proposed approach is well-suited in applications where collecting a large amount of training data is tedious or expensive.


We set out to assess the potential benefits of incorporating global semantic information into neural language models. Because these models are inherently optimized for local prediction, they don’t properly leverage global content that may well prove relevant. Accordingly, tapping into global semantic information is generally beneficial for neural language modeling. As we discovered, however, this approach requires addressing the length mismatch between training word embeddings on paragraph data and training language models on sentence data.

We think the best way to address this problem is to modify the objective criterion used during language model training, so we can simultaneously train both the embeddings and the language model on the same paragraph data. We’re currently experimenting with a multitask objective to simultaneously predict both the semantic category (to train the semantic embeddings) and the next word (to train the neural language model). We are encouraged by our initial efforts and will report any promising results.

In summary, using bi-LSTM RNNs to train global semantic word embeddings can indeed lead to improved accuracy in neural language modeling. It can also reduce substantially the amount of data needed for training.

In the future, the we think that a multitask training framework has the potential to deliver the best of both worlds by leveraging both local and global semantic information.


[1] C.D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[2] F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, vol. 34, pp. 1–47, March 2002.

[3] S. Tellex, B. Katz, J. Lin, A. Fernandes, and G. Marton. Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 41–47, July–August 2003.

[4] H. Schwenk. Continuous Space Language Models. Computer Speech and Language, vol. 21, issue 3, pp. 492–518, July 2007.

[5] T. Mikolov, W.T. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT 2013, June 2013.

[6] M. Sundermeyer, R. Schlüter, and H. Ney. LSTM Neural Networks for Language Modeling. InterSpeech 2012, pp. 194–197, September 2012.

[7] C.Y. Liou, W.C. Cheng, J.W. Liou, and D.R. Liou. Autoencoder for Words. Neurocomputing, vol. 139, pp. 84–96, September 2014.

[8] J.R. Bellegarda. Exploiting Latent Semantic Information in Statistical Language Modeling. Proceedings of the IEEE, vol. 88, issue 8, pp. 1279–1296, August 2000.

[9] J. Pennington, R. Socher, and C.D. Manning. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543, October 2014.

Related readings and updates.

Apple at Interspeech 2019

Apple attended Interspeech 2019, the world's largest conference on the science and technology of spoken language processing. The conference took place in Graz, Austria from September 15th to 19th. See accepted papers below.

Apple continues to build cutting-edge technology in the space of machine hearing, speech recognition, natural language processing, machine translation, text-to-speech, and artificial intelligence, improving the lives of millions of customers every day.

See event details

Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models?

Natural language processing (NLP) tasks tend to suffer from a paucity of suitably annotated training data, hence the recent success of transfer learning across a wide variety of them. The typical recipe involves: (i) training a deep, possibly bidirectional, neural network with an objective related to language modeling, for which training data is plentiful; and (ii) using the trained network to derive contextual representations that are far richer…
See paper details