Siri displays entities like dates, times, addresses and currency amounts in a nicely formatted way. This is the result of the application of a process called inverse text normalization (ITN) to the output of a core speech recognition component. To understand the important role ITN plays, consider that, without it, Siri would display “October twenty third twenty sixteen” instead of “October 23, 2016”. In this work, we show that ITN can be formulated as a labelling problem, allowing for the application of a statistical model that is relatively simple, compact, fast to train, and fast to apply. We demonstrate that this approach represents a practical path to a data-driven ITN system.
In most speech recognition systems, a core speech recognizer produces a spoken-form token sequence which is converted to written form through a process called inverse text normalization (ITN). ITN includes formatting entities like numbers, dates, times, and addresses. Table 1 shows examples of spoken-form input and written-form output.
|Spoken Form||Written Form|
|one forty one Dorchester Avenue Salem Massachusetts||141 Dorchester Ave., Salem, MA|
|set an alarm for five thirty p.m.||Set an alarm for 5:30 PM|
|add an appointment on September sixteenth twenty seventeen||Add an appointment on September 16, 2017|
|twenty percent of fifteen dollars seventy three||20% of $15.73|
|what is two hundred seven point three plus six||What is 207.3+6|
Our goal is to build a data-driven system for ITN. In thinking about how we'd formulate the problem to allow us to apply a statistical model, we could consider naively tokenizing the written-form output by segmenting at spaces. If we were to do that, the output token at a particular position would not necessarily correspond to the input token at that position. Indeed, the number of output tokens would not always be equal to the number of input tokens. At first glance, this problem appears to require a fairly unconstrained sequence-to-sequence model, like those often applied for machine translation.
In fact, it is possible to train a sequence-to-sequence model for this task. But, such models are complex. They can be data hungry, and training and inference can be computationally expensive. In addition, the lack of constraint in such models can lead to particularly egregious and jarring errors when applied. (See , where a sequence-to-sequence model is trained for a related task.) Our preliminary experiments with this sort of model did not yield promising results.
In this article, we show that with a few tables and simple grammars, we can turn the ITN modeling problem into one of labeling spoken-form input tokens. By doing so, we are able to apply simpler and more compact models that achieve good performance with practical amounts of training data.
We begin with two observations:
- There is usually a clear correspondence between spoken-form tokens and segments of the written-form output string.
- Most of the time, the segments in the written-form output are in the same order as their corresponding spoken-form tokens.
Given those two observations, we formulate ITN as the following three step process:
- Label Assignment: Assign a label to each spoken-form input token. A label specifies edits to perform to the spoken-form token string in order to obtain its corresponding written-form segment. A label may also indicate the start or end of a region where a post-processing grammar should be applied.
- Label Application: Apply the edits specified by the labels. The result will be a written-form string, possibly with regions marked for post-processing.
- Post-processing: Apply the appropriate post-processing grammar to any regions marked for post-processing.
The following two tables give examples of the spoken-form token sequences with assigned labels, output after label application, and final written-form output after application of post-processing grammars. For simplicity, we ignore the fields that have default values.
|Spoken Form||Label||Label Applied||Written Form|
|Spoken Form||Label||Label Applied||Written Form|
A label is a data structure consisting of the following fields:
- Rewrite. Determines if and how the spoken-form token string should be re-written. Built-in values for this field include Capitalize and Downcase. The default value represents leaving the spoken-form token string as it is.
- Prepend. Determines what string, if any, should be prepended to the spoken-form token string. The default value represents prepending nothing.
- Append. Determines what string, if any, should be appended to the spoken-form token string. The default value represents appending nothing.
- Space. Determines whether or not a space should be inserted before the spoken-form token string. The default value represents inserting a space.
- Post Start. Determines if this represents the start of a region that needs to be post-processed and, if so, the post-processing grammar that should apply.
- Post End. Determines if this represents the end of a region that needs to be post-processed, and, if so, the post processing grammar that should apply.
For each field, each value corresponds to a string-to-string transformation. We use a finite-state-transducer (FST)—a finite-state automaton with both input and output symbols—to encode each transformation. The process of applying a label to an input token consists of applying the FST for the value in each field in the sequence.
We generate the FSTs for the Rewrite, Prepend, and Append fields from tables, as shown in the next two tables.
|Option||Spoken Form||Written Form|
For most of the ITN transformations we observe, label assignment and application would be sufficient to produce the correct written-form output. However, a few phenomena require additional post-processing. First, in currency expressions, we use a post-processing grammar to reorder the currency symbol and major currency amount. Second, we use a post-processing grammar to properly format relative time expressions like "twenty five minutes to four." Finally, we use post-processing grammars for formatting roman numerals in spoken-form token sequences like "roman numeral five hundred seven." In this last case, the label mechanism would be sufficient, but it is straightforward to write a language-independent grammar that converts numbers to roman numerals. Table 5 gives an example for each of these cases. Post-processing grammars are compiled into FSTs.
|Applying Label Edits||Written Form|
|You owe me \||You owe me $3.50|
|Meet me at \||Meet me at 3:35|
|Super Bowl \||Super Bowl LI|
If we are to train a model to predict labels, we need a way to obtain label sequences from spoken form, written form pairs. We do this using an FST-based approach. First, we construct the following FSTs:
- E: An expander FST. This takes a spoken-form token sequence as input and produces an output FST that allows all possible labels to be prepended to each token.
- A1, A2...AF: A set of applier FSTs, one for each field. These take as input a token sequence with labels prepended and apply the option specified for field f for each token.
- R: A renderer FST. This takes as input a token sequence with labels prepended after labels have been applied, and then strips the labels. The output is the written form, possibly with regions marked for post-processing.
- P: A post-processor FST. This takes as input a written form string, possibly with regions marked for post-processing and does the appropriate post processing. The output is the final written-form output.
We apply the series of operations shown in the following equations to an FST, S, constructed from the spoken-form input. Note that the Projin operation copies each arc input label to its output label and Projout does the opposite. W is an FST constructed from the written-form output. Label sequences compatible with the spoken form/written form pair can easily be extracted from Slabeled. For a vast majority of cases in our data, there is only a single label sequence. In cases with multiple compatible label sequences, we choose one by using a label bigram model trained from unambiguous cases.
Example: Cardinal Numbers
Let's take a look at how we can cast ITN as a labeling problem specifically for cardinal numbers. Table 6 gives the relevant lines from the rewrite table. Note that the options with PopD suffix, where D is a digit, represent turning a magnitude into a sequence of zeros with commas in the appropriate places and then removing the last D zeros (and any interspersed commas.)
|Option||Spoken Form||Written Form|
To format cardinal numbers, these rewrite options are combined with the No-space option. Table 7 demonstrates how this works for a non-trivial cardinal number.
|Spoken Form||Label||Written Form|
The rewrite options are designed such that many different numbers will share the same label sequence. This is illustrated in Table 8. Also note that the label sequence at the end of Table 8 matches the label sequence in Table 7. This sharing of label sequences across many numbers simplifies the modeling task.
<tr> <td>one</td> <td>two</td> <td>RewriteCardinal_SpaceNo</td> <td>1</td> <td>2</td> </tr> </tbody>
|Spoken Form A||Spoken Form B||Label||Written Form A||Written Form B|
An alternative approach, also supported in this framework, would be to format cardinals using a post-processing grammar. This would require writing a complete number grammar. The advantage of the approach we describe here is that only the atomic spoken-form token operations need to be specified. We can rely on the label prediction model to learn the intricacies of the grammar.
Modeling and Results
Using this approach to cast ITN as a labeling problem, we build a system using a bi-directional LSTM ,  as the label prediction model. We wish to estimate the performance of the system in the scenario where a practical amount of training data (in the form of spoken form, written form pairs) is available. In lieu of hand-transcribed data, we train and evaluate the system using the output of a very accurate rule-based system, with the belief that the results will generalize to the hand-transcribed data scenario.
Figure 1 gives the accuracies of our system on randomly selected Siri utterances using 500K, 1M, and 5M training data. If we zoom in on utterances where ITN performs a significant transformation, including those with dates, times and currency expressions, accuracy exceeds 90% for nearly all entity types when training with 1M utterances.
Given these results, if you are starting from scratch to develop an ITN component, we believe this approach represents a practical path to a system that can be deployed at scale.
In our production scenarios, we begin with reasonably accurate rule-based systems, and we have access to very large amounts of real spoken-form data. We found that by using the rule-based systems to generate written-form outputs, selecting our training data carefully, and performing some manual correction of the data, we can build systems with this approach that achieve better accuracies than our rule-based systems. These new systems will allow us to achieve further improvements through additional curation of the training data, a process requiring less specialized knowledge than rule writing. In addition, these new systems can capture transformations that occur in long and varied contexts. Such transformations are difficult to implement with rules.
You can find a more thorough description of the modeling approach that we took for label prediction, as well as more extensive performance evaluation in .
 R.Sproat and N. Jaitly, RNN Approaches to Text Normalization: A Challenge, arXiv preprint arXiv:1611.00068, 2016. https://arxiv.org/abs/1611.00068
 M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
 S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 E. Pusateri, B.R. Ambati, E. Brooks, O. Platek, D. McAllaster, V. Nagesha, A Mostly Data-driven Approach to Inverse Text Normalization, Interspeech, 2017.