Siri displays entities like dates, times, addresses and currency amounts in a nicely formatted way. This is the result of the application of a process called inverse text normalization (ITN) to the output of a core speech recognition component. To understand the important role ITN plays, consider that, without it, Siri would display “October twenty third twenty sixteen” instead of “October 23, 2016”. In this work, we show that ITN can be formulated as a labelling problem, allowing for the application of a statistical model that is relatively simple, compact, fast to train, and fast to apply. We demonstrate that this approach represents a practical path to a data-driven ITN system.


In most speech recognition systems, a core speech recognizer produces a spoken-form token sequence which is converted to written form through a process called inverse text normalization (ITN). ITN includes formatting entities like numbers, dates, times, and addresses. Table 1 shows examples of spoken-form input and written-form output.

Table 1. Examples of spoken-form input and written-form output
Spoken FormWritten Form
one forty one Dorchester Avenue Salem Massachusetts141 Dorchester Ave., Salem, MA
set an alarm for five thirty p.m.Set an alarm for 5:30 PM
add an appointment on September sixteenth twenty seventeenAdd an appointment on September 16, 2017
twenty percent of fifteen dollars seventy three20% of $15.73
what is two hundred seven point three plus sixWhat is 207.3+6

Our goal is to build a data-driven system for ITN. In thinking about how we'd formulate the problem to allow us to apply a statistical model, we could consider naively tokenizing the written-form output by segmenting at spaces. If we were to do that, the output token at a particular position would not necessarily correspond to the input token at that position. Indeed, the number of output tokens would not always be equal to the number of input tokens. At first glance, this problem appears to require a fairly unconstrained sequence-to-sequence model, like those often applied for machine translation.

In fact, it is possible to train a sequence-to-sequence model for this task. But, such models are complex. They can be data hungry, and training and inference can be computationally expensive. In addition, the lack of constraint in such models can lead to particularly egregious and jarring errors when applied. (See [1], where a sequence-to-sequence model is trained for a related task.) Our preliminary experiments with this sort of model did not yield promising results.

In this article, we show that with a few tables and simple grammars, we can turn the ITN modeling problem into one of labeling spoken-form input tokens. By doing so, we are able to apply simpler and more compact models that achieve good performance with practical amounts of training data.


We begin with two observations:

  1. There is usually a clear correspondence between spoken-form tokens and segments of the written-form output string.
  2. Most of the time, the segments in the written-form output are in the same order as their corresponding spoken-form tokens.

Given those two observations, we formulate ITN as the following three step process:

  1. Label Assignment: Assign a label to each spoken-form input token. A label specifies edits to perform to the spoken-form token string in order to obtain its corresponding written-form segment. A label may also indicate the start or end of a region where a post-processing grammar should be applied.
  2. Label Application: Apply the edits specified by the labels. The result will be a written-form string, possibly with regions marked for post-processing.
  3. Post-processing: Apply the appropriate post-processing grammar to any regions marked for post-processing.

The following two tables give examples of the spoken-form token sequences with assigned labels, output after label application, and final written-form output after application of post-processing grammars. For simplicity, we ignore the fields that have default values.

Table 2A. The spoken-form token sequence with corresponding labels for "February 20, 2017"
Spoken Form Label Label Applied Written Form
February Default February February
twentieth RewriteOrdinalAsCardinalDecade_AppendComma 20, 20,
twenty RewriteCardinalDecade 20 20
seventeen RewriteCardinalTeen_SpaceNo 17 17
Table 2B. The spoken-form token sequence with corresponding labels for "20% of $205"
Spoken Form Label Label Applied Written Form
twenty RewriteCardinalDecade 20 20
percent RewritePercentSign_SpaceNo % %
of Default of of
two RewriteCardinal_StartMajorCurrency <MajorCurrency>2
hundred RewriteMagnitudePop1_SpaceNo 0
five RewriteCardinal_SpaceNo 5
dollars RewriteCurrencySymbol_SpaceNo_EndMajorCurrency $<\MajorCurrency> $205


A label is a data structure consisting of the following fields:

  1. Rewrite. Determines if and how the spoken-form token string should be re-written. Built-in values for this field include Capitalize and Downcase. The default value represents leaving the spoken-form token string as it is.
  2. Prepend. Determines what string, if any, should be prepended to the spoken-form token string. The default value represents prepending nothing.
  3. Append. Determines what string, if any, should be appended to the spoken-form token string. The default value represents appending nothing.
  4. Space. Determines whether or not a space should be inserted before the spoken-form token string. The default value represents inserting a space.
  5. Post Start. Determines if this represents the start of a region that needs to be post-processed and, if so, the post-processing grammar that should apply.
  6. Post End. Determines if this represents the end of a region that needs to be post-processed, and, if so, the post processing grammar that should apply.

For each field, each value corresponds to a string-to-string transformation. We use a finite-state-transducer (FST)—a finite-state automaton with both input and output symbols—to encode each transformation. The process of applying a label to an input token consists of applying the FST for the value in each field in the sequence.

We generate the FSTs for the Rewrite, Prepend, and Append fields from tables, as shown in the next two tables.

Table 3. Excerpts from the rewrite table
OptionSpoken FormWritten Form
Magnitude hundred00
AbbreviateMeasure poundslbs.
CurrencySymbol pounds£
Table 4. The append table
Period .
Comma ,
Colon :
Hyphen -
Slash /
Square 2
Cube 3


For most of the ITN transformations we observe, label assignment and application would be sufficient to produce the correct written-form output. However, a few phenomena require additional post-processing. First, in currency expressions, we use a post-processing grammar to reorder the currency symbol and major currency amount. Second, we use a post-processing grammar to properly format relative time expressions like "twenty five minutes to four." Finally, we use post-processing grammars for formatting roman numerals in spoken-form token sequences like "roman numeral five hundred seven." In this last case, the label mechanism would be sufficient, but it is straightforward to write a language-independent grammar that converts numbers to roman numerals. Table 5 gives an example for each of these cases. Post-processing grammars are compiled into FSTs.

Table 5. Examples of applying post-processing grammar
Applying Label EditsWritten Form
You owe me \3$.50You owe me $3.50
Meet me at \25 minutes to 4Meet me at 3:35
Super Bowl \51Super Bowl LI

Label Inference

If we are to train a model to predict labels, we need a way to obtain label sequences from spoken form, written form pairs. We do this using an FST-based approach. First, we construct the following FSTs:

  1. E: An expander FST. This takes a spoken-form token sequence as input and produces an output FST that allows all possible labels to be prepended to each token.
  2. A1, A2...AF: A set of applier FSTs, one for each field. These take as input a token sequence with labels prepended and apply the option specified for field f for each token.
  3. R: A renderer FST. This takes as input a token sequence with labels prepended after labels have been applied, and then strips the labels. The output is the written form, possibly with regions marked for post-processing.
  4. P: A post-processor FST. This takes as input a written form string, possibly with regions marked for post-processing and does the appropriate post processing. The output is the final written-form output.

We apply the series of operations shown in the following equations to an FST, S, constructed from the spoken-form input. Note that the Projin operation copies each arc input label to its output label and Projout does the opposite. W is an FST constructed from the written-form output. Label sequences compatible with the spoken form/written form pair can easily be extracted from Slabeled. For a vast majority of cases in our data, there is only a single label sequence. In cases with multiple compatible label sequences, we choose one by using a label bigram model trained from unambiguous cases.

Sexp=Projout(SE)S_{exp} = {Proj}_{out}(S \circ E) Wproc=SexpARewriteAPostEndW_{proc} = S_{exp} \circ A_{Rewrite} \circ \dots \circ A_{PostEnd} Slabeled=Projin(WprocRPW)S_{labeled} = {Proj}_{in}(W_{proc} \circ R \circ P \circ W)

Example: Cardinal Numbers

Let's take a look at how we can cast ITN as a labeling problem specifically for cardinal numbers. Table 6 gives the relevant lines from the rewrite table. Note that the options with PopD suffix, where D is a digit, represent turning a magnitude into a sequence of zeros with commas in the appropriate places and then removing the last D zeros (and any interspersed commas.)

Table 6. Excerpt of the rewrite table relevant for formatting cardinals
OptionSpoken FormWritten Form
Cardinal one 1
Cardinal two2
CardinalTwoDigit ten10
CardinalTwoDigit eleven11
CardinalDecade twenty 20
CardinalDecade thirty 30
CardinalDecadeFirstDigittwenty 2
CardinalMagnitudePop0 million000,000
CardinalMagnitudePop1 billion000,000,00
CardinalMagnitudePop2 million000,0

To format cardinal numbers, these rewrite options are combined with the No-space option. Table 7 demonstrates how this works for a non-trivial cardinal number.

Table 7. Spoken form, labels and written form for a cardinal number
Spoken Form Label Written Form
twenty RewriteCardinalDecadeFirstDigit_SpaceNo 2
five RewriteCardinal_AppendComma_SpaceNo 5,
thousand RewriteDelete_SpaceNo
six RewriteCardinal_SpaceNo 6
hundred RewriteCardinalMagnitudePop1_SpaceNo 0
and RewriteDelete
one RewriteCardinal_SpaceNo 1

The rewrite options are designed such that many different numbers will share the same label sequence. This is illustrated in Table 8. Also note that the label sequence at the end of Table 8 matches the label sequence in Table 7. This sharing of label sequences across many numbers simplifies the modeling task.

Table 8. Spoken form, labels and written form for two cardinal numbers with the same label sequence
Spoken Form A Spoken Form B Label Written Form A Written Form B
one two RewriteCardinal_AppendComma_SpaceNo 1, 2,
million billion RewriteCardinalMagnitudePop5_SpaceNo 0 000,0
twenty thirty RewriteCardinalDecadeFirstDigit_SpaceNo 2 3
five six RewriteCardinal_AppendComma_SpaceNo 5, 6,
thousand thousand RewriteDelete_SpaceNo
six seven RewriteCardinal_SpaceNo 6 7
hundred hundred RewriteCardinalMagnitudePop1_SpaceNo 0 0
and and RewriteDelete_SpaceNo

An alternative approach, also supported in this framework, would be to format cardinals using a post-processing grammar. This would require writing a complete number grammar. The advantage of the approach we describe here is that only the atomic spoken-form token operations need to be specified. We can rely on the label prediction model to learn the intricacies of the grammar.

Modeling and Results

Using this approach to cast ITN as a labeling problem, we build a system using a bi-directional LSTM [2], [3] as the label prediction model. We wish to estimate the performance of the system in the scenario where a practical amount of training data (in the form of spoken form, written form pairs) is available. In lieu of hand-transcribed data, we train and evaluate the system using the output of a very accurate rule-based system, with the belief that the results will generalize to the hand-transcribed data scenario.

Figure 1 gives the accuracies of our system on randomly selected Siri utterances using 500K, 1M, and 5M training data. If we zoom in on utterances where ITN performs a significant transformation, including those with dates, times and currency expressions, accuracy exceeds 90% for nearly all entity types when training with 1M utterances.

Figure 1. Accuracy of the models trained using 500K, 1M and 5M utterances
Shows a chart that compares three models. One that uses half a million utterances, another that uses one million utterances, and a third that uses five million utterances. The accuracy increases with the number of utterances. The three accuracies are 99 point forty six percent, 99 point 62 percent, and 99 point 85 percent.

Given these results, if you are starting from scratch to develop an ITN component, we believe this approach represents a practical path to a system that can be deployed at scale.

In our production scenarios, we begin with reasonably accurate rule-based systems, and we have access to very large amounts of real spoken-form data. We found that by using the rule-based systems to generate written-form outputs, selecting our training data carefully, and performing some manual correction of the data, we can build systems with this approach that achieve better accuracies than our rule-based systems. These new systems will allow us to achieve further improvements through additional curation of the training data, a process requiring less specialized knowledge than rule writing. In addition, these new systems can capture transformations that occur in long and varied contexts. Such transformations are difficult to implement with rules.

You can find a more thorough description of the modeling approach that we took for label prediction, as well as more extensive performance evaluation in [4].


[1] R.Sproat and N. Jaitly, RNN Approaches to Text Normalization: A Challenge, arXiv preprint arXiv:1611.00068, 2016.

[2] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[3] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[4] E. Pusateri, B.R. Ambati, E. Brooks, O. Platek, D. McAllaster, V. Nagesha, A Mostly Data-driven Approach to Inverse Text Normalization, Interspeech, 2017.

Related readings and updates.

Continuous Soft Pseudo-Labeling in ASR

This paper was accepted at the workshop "I Can’t Believe It’s Not Better: Understanding Deep Learning Through Empirical Falsification" Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end…
See paper details

Learning Soft Labels via Meta Learning

One-hot labels do not represent soft decision boundaries among concepts, and hence, models trained on them are prone to overfitting. Using soft labels as targets provide regularization, but different soft labels might be optimal at different stages of optimization. Also, training with fixed labels in the presence of noisy annotations leads to worse generalization. To address these limitations, we propose a framework, where we treat the labels as…
See paper details