Siri displays entities like dates, times, addresses and currency amounts in a nicely formatted way. This is the result of the application of a process called inverse text normalization (ITN) to the output of a core speech recognition component. To understand the important role ITN plays, consider that, without it, Siri would display “October twenty third twenty sixteen” instead of “October 23, 2016”. In this work, we show that ITN can be formulated as a labelling problem, allowing for the application of a statistical model that is relatively simple, compact, fast to train, and fast to apply. We demonstrate that this approach represents a practical path to a data-driven ITN system.

## Introduction

In most speech recognition systems, a core speech recognizer produces a spoken-form token sequence which is converted to written form through a process called inverse text normalization (ITN). ITN includes formatting entities like numbers, dates, times, and addresses. Table 1 shows examples of spoken-form input and written-form output.

Spoken FormWritten Form
one forty one Dorchester Avenue Salem Massachusetts141 Dorchester Ave., Salem, MA
set an alarm for five thirty p.m.Set an alarm for 5:30 PM
add an appointment on September sixteenth twenty seventeenAdd an appointment on September 16, 2017

## Label Inference

If we are to train a model to predict labels, we need a way to obtain label sequences from spoken form, written form pairs. We do this using an FST-based approach. First, we construct the following FSTs:

1. E: An expander FST. This takes a spoken-form token sequence as input and produces an output FST that allows all possible labels to be prepended to each token.
2. A1, A2...AF: A set of applier FSTs, one for each field. These take as input a token sequence with labels prepended and apply the option specified for field f for each token.
3. R: A renderer FST. This takes as input a token sequence with labels prepended after labels have been applied, and then strips the labels. The output is the written form, possibly with regions marked for post-processing.
4. P: A post-processor FST. This takes as input a written form string, possibly with regions marked for post-processing and does the appropriate post processing. The output is the final written-form output.

We apply the series of operations shown in the following equations to an FST, S, constructed from the spoken-form input. Note that the Projin operation copies each arc input label to its output label and Projout does the opposite. W is an FST constructed from the written-form output. Label sequences compatible with the spoken form/written form pair can easily be extracted from Slabeled. For a vast majority of cases in our data, there is only a single label sequence. In cases with multiple compatible label sequences, we choose one by using a label bigram model trained from unambiguous cases.

## Example: Cardinal Numbers

Let's take a look at how we can cast ITN as a labeling problem specifically for cardinal numbers. Table 6 gives the relevant lines from the rewrite table. Note that the options with PopD suffix, where D is a digit, represent turning a magnitude into a sequence of zeros with commas in the appropriate places and then removing the last D zeros (and any interspersed commas.)

 Option Spoken Form Written Form ... Cardinal one 1 Cardinal two 2 ... CardinalTwoDigit ten 10 CardinalTwoDigit eleven 11 ... CardinalDecade twenty 20 CardinalDecade thirty 30 ... CardinalDecadeFirstDigit twenty 2 CardinalDecadeFirstDigit thirty 3 ... CardinalMagnitudePop0 hundred 00 CardinalMagnitudePop0 thousand 000 CardinalMagnitudePop0 million 000,000 ... CardinalMagnitudePop1 million 000,00 CardinalMagnitudePop1 billion 000,000,00 ... CardinalMagnitudePop2 million 000,0 CardinalMagnitudePop2 billion 000,000,0 ... CardinalMagnitudePop5 million 0 CardinalMagnitudePop5 billion 000,0 ... CardinalMagnitudePop11 trillion 0

To format cardinal numbers, these rewrite options are combined with the No-space option. Table 7 demonstrates how this works for a non-trivial cardinal number.

 Spoken Form Label Written Form twenty RewriteCardinalDecadeFirstDigit_SpaceNo 2 five RewriteCardinal_AppendComma_SpaceNo 5, thousand RewriteDelete_SpaceNo six RewriteCardinal_SpaceNo 6 hundred RewriteCardinalMagnitudePop1_SpaceNo 0 and RewriteDelete one RewriteCardinal_SpaceNo 1

The rewrite options are designed such that many different numbers will share the same label sequence. This is illustrated in Table 8. Also note that the label sequence at the end of Table 8 matches the label sequence in Table 7. This sharing of label sequences across many numbers simplifies the modeling task.

 Spoken Form A Spoken Form B Label Written Form A Written Form B one two RewriteCardinal_AppendComma_SpaceNo 1, 2, million billion RewriteCardinalMagnitudePop5_SpaceNo 0 000,0 twenty thirty RewriteCardinalDecadeFirstDigit_SpaceNo 2 3 five six RewriteCardinal_AppendComma_SpaceNo 5, 6, thousand thousand RewriteDelete_SpaceNo six seven RewriteCardinal_SpaceNo 6 7 hundred hundred RewriteCardinalMagnitudePop1_SpaceNo 0 0 and and RewriteDelete_SpaceNo one two RewriteCardinal_SpaceNo 1 2

An alternative approach, also supported in this framework, would be to format cardinals using a post-processing grammar. This would require writing a complete number grammar. The advantage of the approach we describe here is that only the atomic spoken-form token operations need to be specified. We can rely on the label prediction model to learn the intricacies of the grammar.

## Modeling and Results

Using this approach to cast ITN as a labeling problem, we build a system using a bi-directional LSTM [2], [3] as the label prediction model. We wish to estimate the performance of the system in the scenario where a practical amount of training data (in the form of spoken form, written form pairs) is available. In lieu of hand-transcribed data, we train and evaluate the system using the output of a very accurate rule-based system, with the belief that the results will generalize to the hand-transcribed data scenario.

Figure 1 gives the accuracies of our system on randomly selected Siri utterances using 500K, 1M, and 5M training data. If we zoom in on utterances where ITN performs a significant transformation, including those with dates, times and currency expressions, accuracy exceeds 90% for nearly all entity types when training with 1M utterances.

Given these results, if you are starting from scratch to develop an ITN component, we believe this approach represents a practical path to a system that can be deployed at scale.

In our production scenarios, we begin with reasonably accurate rule-based systems, and we have access to very large amounts of real spoken-form data. We found that by using the rule-based systems to generate written-form outputs, selecting our training data carefully, and performing some manual correction of the data, we can build systems with this approach that achieve better accuracies than our rule-based systems. These new systems will allow us to achieve further improvements through additional curation of the training data, a process requiring less specialized knowledge than rule writing. In addition, these new systems can capture transformations that occur in long and varied contexts. Such transformations are difficult to implement with rules.

You can find a more thorough description of the modeling approach that we took for label prediction, as well as more extensive performance evaluation in [4].

## References

[1] R.Sproat and N. Jaitly, RNN Approaches to Text Normalization: A Challenge, arXiv preprint arXiv:1611.00068, 2016. https://arxiv.org/abs/1611.00068

[2] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[3] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[4] E. Pusateri, B.R. Ambati, E. Brooks, O. Platek, D. McAllaster, V. Nagesha, A Mostly Data-driven Approach to Inverse Text Normalization, Interspeech, 2017.