highlightAugust 2, 2017

Inverse Text Normalization as a Labeling Problem

AuthorsSiri Team

Siri displays entities like dates, times, addresses and currency amounts in a nicely formatted way. This is the result of the application of a process called inverse text normalization (ITN) to the output of a core speech recognition component. To understand the important role ITN plays, consider that, without it, Siri would display “October twenty third twenty sixteen” instead of “October 23, 2016”. In this work, we show that ITN can be formulated as a labelling problem, allowing for the application of a statistical model that is relatively simple, compact, fast to train, and fast to apply. We demonstrate that this approach represents a practical path to a data-driven ITN system.

Introduction

In most speech recognition systems, a core speech recognizer produces a spoken-form token sequence which is converted to written form through a process called inverse text normalization (ITN). ITN includes formatting entities like numbers, dates, times, and addresses. Table 1 shows examples of spoken-form input and written-form output.

Table 1. Examples of spoken-form input and written-form output

Spoken Form	Written Form
one forty one Dorchester Avenue Salem Massachusetts	141 Dorchester Ave., Salem, MA
set an alarm for five thirty p.m.	Set an alarm for 5:30 PM
add an appointment on September sixteenth twenty seventeen	Add an appointment on September 16, 2017
twenty percent of fifteen dollars seventy three	20% of $15.73
what is two hundred seven point three plus six	What is 207.3+6

Our goal is to build a data-driven system for ITN. In thinking about how we’d formulate the problem to allow us to apply a statistical model, we could consider naively tokenizing the written-form output by segmenting at spaces. If we were to do that, the output token at a particular position would not necessarily correspond to the input token at that position. Indeed, the number of output tokens would not always be equal to the number of input tokens. At first glance, this problem appears to require a fairly unconstrained sequence-to-sequence model, like those often applied for machine translation.

In fact, it is possible to train a sequence-to-sequence model for this task. But, such models are complex. They can be data hungry, and training and inference can be computationally expensive. In addition, the lack of constraint in such models can lead to particularly egregious and jarring errors when applied. (See [1], where a sequence-to-sequence model is trained for a related task.) Our preliminary experiments with this sort of model did not yield promising results.

In this article, we show that with a few tables and simple grammars, we can turn the ITN modeling problem into one of labeling spoken-form input tokens. By doing so, we are able to apply simpler and more compact models that achieve good performance with practical amounts of training data.

Approach

We begin with two observations:

There is usually a clear correspondence between spoken-form tokens and segments of the written-form output string.
Most of the time, the segments in the written-form output are in the same order as their corresponding spoken-form tokens.

Given those two observations, we formulate ITN as the following three step process:

Label Assignment: Assign a label to each spoken-form input token. A label specifies edits to perform to the spoken-form token string in order to obtain its corresponding written-form segment. A label may also indicate the start or end of a region where a post-processing grammar should be applied.
Label Application: Apply the edits specified by the labels. The result will be a written-form string, possibly with regions marked for post-processing.
Post-processing: Apply the appropriate post-processing grammar to any regions marked for post-processing.

The following two tables give examples of the spoken-form token sequences with assigned labels, output after label application, and final written-form output after application of post-processing grammars. For simplicity, we ignore the fields that have default values.

Table 2A. The spoken-form token sequence with corresponding labels for "February 20, 2017"

Spoken Form	Label	Label Applied	Written Form
February	Default	February	February
twentieth	RewriteOrdinalAsCardinalDecade_AppendComma	20,	20,
twenty	RewriteCardinalDecade	20	20
seventeen	RewriteCardinalTeen_SpaceNo	17	17

Table 2B. The spoken-form token sequence with corresponding labels for "20% of $205"

Spoken Form	Label	Label Applied	Written Form
twenty	RewriteCardinalDecade	20	20
percent	RewritePercentSign_SpaceNo	%	%
of	Default	of	of
two	RewriteCardinal_StartMajorCurrency	<MajorCurrency>2
hundred	RewriteMagnitudePop1_SpaceNo	0
five	RewriteCardinal_SpaceNo	5
dollars	RewriteCurrencySymbol_SpaceNo_EndMajorCurrency	$<\MajorCurrency>	$205

Labels

A label is a data structure consisting of the following fields:

Rewrite. Determines if and how the spoken-form token string should be re-written. Built-in values for this field include Capitalize and Downcase. The default value represents leaving the spoken-form token string as it is.
Prepend. Determines what string, if any, should be prepended to the spoken-form token string. The default value represents prepending nothing.
Append. Determines what string, if any, should be appended to the spoken-form token string. The default value represents appending nothing.
Space. Determines whether or not a space should be inserted before the spoken-form token string. The default value represents inserting a space.
Post Start. Determines if this represents the start of a region that needs to be post-processed and, if so, the post-processing grammar that should apply.
Post End. Determines if this represents the end of a region that needs to be post-processed, and, if so, the post processing grammar that should apply.

For each field, each value corresponds to a string-to-string transformation. We use a finite-state-transducer (FST)—a finite-state automaton with both input and output symbols—to encode each transformation. The process of applying a label to an input token consists of applying the FST for the value in each field in the sequence.

We generate the FSTs for the Rewrite, Prepend, and Append fields from tables, as shown in the next two tables.

Table 3. Excerpts from the rewrite table

Option	Spoken Form	Written Form
...
Cardinal	one	1
Cardinal	two	2
...
Magnitude	hundred	00
MagnitudePop1	hundred	0
...
MagnitudePop1	million	000,00
MagnitudePop2	millio	000,0
...
AbbreviateMeasure	pounds	lbs.
AbbreviateMeasure	zettabytes	ZB
...
CurrencySymbol	pounds	£
...

Table 4. The append table

Option	String
Period	.
Comma	,
Colon	:
Hyphen	-
Slash	/
Square	²
Cube	³

Post-processing

For most of the ITN transformations we observe, label assignment and application would be sufficient to produce the correct written-form output. However, a few phenomena require additional post-processing. First, in currency expressions, we use a post-processing grammar to reorder the currency symbol and major currency amount. Second, we use a post-processing grammar to properly format relative time expressions like “twenty five minutes to four.” Finally, we use post-processing grammars for formatting roman numerals in spoken-form token sequences like “roman numeral five hundred seven.” In this last case, the label mechanism would be sufficient, but it is straightforward to write a language-independent grammar that converts numbers to roman numerals. Table 5 gives an example for each of these cases. Post-processing grammars are compiled into FSTs.

Table 5. Examples of applying post-processing grammar

Applying Label Edits	Written Form
You owe me \3$.50	You owe me $3.50
Meet me at \25 minutes to 4	Meet me at 3:35
Super Bowl \51	Super Bowl LI

Label Inference

If we are to train a model to predict labels, we need a way to obtain label sequences from spoken form, written form pairs. We do this using an FST-based approach. First, we construct the following FSTs:

E: An expander FST. This takes a spoken-form token sequence as input and produces an output FST that allows all possible labels to be prepended to each token.
A₁, A₂…A_F: A set of applier FSTs, one for each field. These take as input a token sequence with labels prepended and apply the option specified for field f for each token.
R: A renderer FST. This takes as input a token sequence with labels prepended after labels have been applied, and then strips the labels. The output is the written form, possibly with regions marked for post-processing.
P: A post-processor FST. This takes as input a written form string, possibly with regions marked for post-processing and does the appropriate post processing. The output is the final written-form output.

We apply the series of operations shown in the following equations to an FST, S, constructed from the spoken-form input. Note that the Proj_in operation copies each arc input label to its output label and Proj_out does the opposite. W is an FST constructed from the written-form output. Label sequences compatible with the spoken form/written form pair can easily be extracted from S_labeled. For a vast majority of cases in our data, there is only a single label sequence. In cases with multiple compatible label sequences, we choose one by using a label bigram model trained from unambiguous cases.

S_{exp} = {Proj}_{out}(S \circ E)

W_{proc} = S_{exp} \circ A_{Rewrite} \circ \dots \circ A_{PostEnd}

S_{labeled} = {Proj}_{in}(W_{proc} \circ R \circ P \circ W)

Example: Cardinal Numbers

Let’s take a look at how we can cast ITN as a labeling problem specifically for cardinal numbers. Table 6 gives the relevant lines from the rewrite table. Note that the options with PopD suffix, where D is a digit, represent turning a magnitude into a sequence of zeros with commas in the appropriate places and then removing the last D zeros (and any interspersed commas.)

Table 6. Excerpt of the rewrite table relevant for formatting cardinals

Option	Spoken Form	Written Form
...
Cardinal	one	1
Cardinal	two	2
...
CardinalTwoDigit	ten	10
CardinalTwoDigit	eleven	11
...
CardinalDecade	twenty	20
CardinalDecade	thirty	30
...
CardinalDecadeFirstDigit	twenty	2
CardinalDecadeFirstDigit	thirty	3
...
CardinalMagnitudePop0	hundred	00
CardinalMagnitudePop0	thousand	000
CardinalMagnitudePop0	million	000,000
...
CardinalMagnitudePop1	million	000,00
CardinalMagnitudePop1	billion	000,000,00
...
CardinalMagnitudePop2	million	000,0
CardinalMagnitudePop2	billion	000,000,0
...
CardinalMagnitudePop5	million	0
CardinalMagnitudePop5	billion	000,0
...
CardinalMagnitudePop11	trillion	0

To format cardinal numbers, these rewrite options are combined with the No-space option. Table 7 demonstrates how this works for a non-trivial cardinal number.

Table 7. Spoken form, labels and written form for a cardinal number

Spoken Form	Label	Written Form
twenty	RewriteCardinalDecadeFirstDigit_SpaceNo	2
five	RewriteCardinal_AppendComma_SpaceNo	5,
thousand	RewriteDelete_SpaceNo
six	RewriteCardinal_SpaceNo	6
hundred	RewriteCardinalMagnitudePop1_SpaceNo	0
and	RewriteDelete
one	RewriteCardinal_SpaceNo	1

The rewrite options are designed such that many different numbers will share the same label sequence. This is illustrated in Table 8. Also note that the label sequence at the end of Table 8 matches the label sequence in Table 7. This sharing of label sequences across many numbers simplifies the modeling task.

Table 8. Spoken form, labels and written form for two cardinal numbers with the same label sequence

Spoken Form A	Spoken Form B	Label	Written Form A	Written Form B
one	two	RewriteCardinal_AppendComma_SpaceNo	1,	2,
million	billion	RewriteCardinalMagnitudePop5_SpaceNo	0	000,0
twenty	thirty	RewriteCardinalDecadeFirstDigit_SpaceNo	2	3
five	six	RewriteCardinal_AppendComma_SpaceNo	5,	6,
thousand	thousand	RewriteDelete_SpaceNo
six	seven	RewriteCardinal_SpaceNo	6	7
hundred	hundred	RewriteCardinalMagnitudePop1_SpaceNo	0	0
and	and	RewriteDelete_SpaceNo
one	two	RewriteCardinal_SpaceNo	1	2

An alternative approach, also supported in this framework, would be to format cardinals using a post-processing grammar. This would require writing a complete number grammar. The advantage of the approach we describe here is that only the atomic spoken-form token operations need to be specified. We can rely on the label prediction model to learn the intricacies of the grammar.

Modeling and Results

Using this approach to cast ITN as a labeling problem, we build a system using a bi-directional LSTM [2], [3] as the label prediction model. We wish to estimate the performance of the system in the scenario where a practical amount of training data (in the form of spoken form, written form pairs) is available. In lieu of hand-transcribed data, we train and evaluate the system using the output of a very accurate rule-based system, with the belief that the results will generalize to the hand-transcribed data scenario.

Figure 1 gives the accuracies of our system on randomly selected Siri utterances using 500K, 1M, and 5M training data. If we zoom in on utterances where ITN performs a significant transformation, including those with dates, times and currency expressions, accuracy exceeds 90% for nearly all entity types when training with 1M utterances.

Shows a chart that compares three models. One that uses half a million utterances, another that uses one million utterances, and a third that uses five million utterances. The accuracy increases with the number of utterances. The three accuracies are 99 point forty six percent, 99 point 62 percent, and 99 point 85 percent. — Figure 1. Accuracy of the models trained using 500K, 1M and 5M utterances

Given these results, if you are starting from scratch to develop an ITN component, we believe this approach represents a practical path to a system that can be deployed at scale.

In our production scenarios, we begin with reasonably accurate rule-based systems, and we have access to very large amounts of real spoken-form data. We found that by using the rule-based systems to generate written-form outputs, selecting our training data carefully, and performing some manual correction of the data, we can build systems with this approach that achieve better accuracies than our rule-based systems. These new systems will allow us to achieve further improvements through additional curation of the training data, a process requiring less specialized knowledge than rule writing. In addition, these new systems can capture transformations that occur in long and varied contexts. Such transformations are difficult to implement with rules.

You can find a more thorough description of the modeling approach that we took for label prediction, as well as more extensive performance evaluation in [4].

References

[1] R.Sproat and N. Jaitly, RNN Approaches to Text Normalization: A Challenge, arXiv preprint arXiv:1611.00068, 2016. https://arxiv.org/abs/1611.00068

[2] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[3] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[4] E. Pusateri, B.R. Ambati, E. Brooks, O. Platek, D. McAllaster, V. Nagesha, A Mostly Data-driven Approach to Inverse Text Normalization, Interspeech, 2017.

Inverse Text Normalization as a Labeling Problem

Introduction

Approach

Labels

Post-processing

Label Inference

Example: Cardinal Numbers

Modeling and Results

References

Related readings and updates.

Continuous Soft Pseudo-Labeling in ASR

Learning Soft Labels via Meta Learning

Discover opportunities in Machine Learning.