speech recognition with weighted finite-state...

Springer Handbook on Speech Processing and Speech Communication 1

SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS

Mehryar Mohri1,3

1 Courant Institute251 Mercer Street

New York, NY [email protected]

Fernando Pereira2

2 University of Pennsylvania200 South 33rd Street

Philadelphia, PA [email protected]

Michael Riley3

3 Google Research76 Ninth Avenue

New York, NY [email protected]

ABSTRACT

This chapter describes a general representation andalgorithmic framework for speech recognition basedon weighted finite-state transducers. These trans-ducers provide a common and natural representa-tion for major components of speech recognition sys-tems, including hidden Markov models (HMMs),context-dependency models, pronunciation dictio-naries, statistical grammars, and word or phone lat-tices. General algorithms for building and optimizingtransducer models are presented, including composi-tion for combining models, weighted determinizationand minimization for optimizing time and space re-quirements, and a weight pushing algorithm for re-distributing transition weights optimally for speechrecognition. The application of these methods tolarge-vocabulary recognition tasks is explained in de-tail, and experimental results are given, in particu-lar for the North American Business News (NAB)task, in which these methods were used to combineHMMs, full cross-word triphones, a lexicon of fortythousand words, and a large trigram grammar intoa single weighted transducer that is only somewhatlarger than the trigram word grammar and that runsNAB in real-time on a very simple decoder. Anotherexample demonstrates that the same methods can beused to optimize lattices for second-pass recognition.

1. INTRODUCTION

Much of current large-vocabulary speech recognitionis based on models such as hidden Markov mod-els (HMMs), lexicons, orn-gram statistical languagemodels that can be represented byweighted finite-

state transducers. Even when richer models are used,for instance context-free grammars for spoken-dialogapplications, they are often restricted, for efficiencyreasons, to regular subsets, either by design or byapproximation [Pereira and Wright, 1997, Nederhof,2000, Mohri and Nederhof, 2001].

A finite-state transducer is a finite automatonwhose state transitions are labeled with both inputand output symbols. Therefore, a path through thetransducer encodes a mapping from an input symbolsequence, orstring, to an output string. Aweightedtransducer puts weights on transitions in additionto the input and output symbols. Weights may en-code probabilities, durations, penalties, or any otherquantity that accumulates along paths to compute theoverall weight of mapping an input string to an out-put string. Weighted transducers are thus a naturalchoice to represent the probabilistic finite-state mod-els prevalent in speech processing.

We present a detailed view of the use of weightedfinite-state transducers in speech recognition [Mohriet al., 2000, Pereira and Riley, 1997, Mohri, 1997,Mohri et al., 1996, Mohri and Riley, 1998, Mohriet al., 1998, Mohri and Riley, 1999]. We showthat common methods for combining and optimiz-ing probabilistic models in speech processing canbe generalized and efficiently implemented by trans-lation to mathematically well-defined operations onweighted transducers. Furthermore, new optimiza-tion opportunities arise from viewing all symboliclevels of speech recognition modeling as weightedtransducers. Thus, weighted finite-state transducersdefine a common framework with shared algorithmsfor the representation and use of the models in speech


recognition that has important algorithmic and soft-ware engineering benefits.

We begin with an overview in Section 2, whichinformally introduces weighted finite-state transduc-ers and algorithms, and motivates the methods byshowing how they are applied to speech recogni-tion. This section may suffice for those only inter-ested in a brief tour of these methods. In the sub-sequent two sections, we give a more detailed andprecise account. Section 3 gives formal definitionsof weighted finite-state transducer concepts and cor-responding algorithm descriptions. Section 4 gives adetailed description of how to apply these methods tolarge-vocabulary speech recognition and shows per-formance results. These sections are appropriate forthose who wish to understand the algorithms morefully or wish to replicate the results.

2. OVERVIEW

We start with an informal overview of weighted au-tomata and transducers, outlines of some of the keyalgorithms that support the ASR applications de-scribed in this chapter –composition, determiniza-tion, and minimization, and their application tospeech recognition.

2.1. Weighted Acceptors

Weighted finite automata (or weighted acceptors) areused widely in automatic speech recognition (ASR).Figure 1 gives simple, familiar examples of weightedautomata as used in ASR. The automaton in Fig-ure 1(a) is a toy finite-statelanguage model. Thelegal word strings are specified by the words alongeach complete path, and their probabilities by theproduct of the corresponding transition probabilities.The automaton in Figure 1(b) gives the possible pro-nunciations of one word,data, used in the languagemodel. Each legal pronunciation is the phone stringsalong a complete path, and its probability is given bythe product of the corresponding transition probabil-ities. Finally, the automaton in Figure 1(c) encodesa typical left-to-right, three-distribution-HMM struc-ture for one phone, with the labels along a completepath specifying legal strings of acoustic distributionsfor that phone.

These automata consist of a set of states, an ini-tial state, a set of final states (with final weights), anda set of transitions between states. Each transitionhas a source state, a destination state, a label and aweight. Such automata are calledweighted finite-state acceptors(WFSA), since theyacceptor recog-nizeeach string that can be read along a path fromthe start state to a final state. Each accepted string isassigned a weight, namely the accumulated weightsalong accepting paths for that string, including finalweights. An acceptor as a whole represents a set ofstrings, namely those that it accepts. As a weightedacceptor, it also associates to each accepted string theaccumulated weights of their accepting paths.

Speech recognition architectures commonly givethe run-time decoder the task of combining and opti-mizing automata such as those in Figure 1. The de-coder finds word pronunciations in its lexicon andsubstitutes them into the grammar. Phonetic treerepresentations may be used at this point to re-duce path redundancy and thus improve search ef-ficiency, especially for large vocabulary recognition[Ortmanns et al., 1996]. The decoder then identi-fies the correct context-dependent models to use foreach phone in context, and finally substitutes themto create an HMM-level transducer. The softwarethat performs these operations is usually tied to par-ticular model topologies. For example, the context-dependent models might have to be triphonic, thegrammar might be restricted to trigrams, and the al-ternative pronunciations might have to be enumer-ated in the lexicon. In addition, these automatacombinations and optimizations are applied in a pre-programmed order to a pre-specified number of lev-els.

2.2. Weighted Transducers

Our approach uses finite-state transducers, ratherthan acceptors, to represent then-gram grammars,pronunciation dictionaries, context-dependencyspecifications, HMM topology, word, phone orHMM segmentations, lattices andn-best output listsencountered in ASR. The transducer representationprovides general methods for combining models andoptimizing them, leading to both simple and flexibleASR decoder designs.

A weighted finite-state transducer (WFST) is


0 1using/1

2data/0.66

3

intuition/0.33

4

is/1

5better/0.7

worse/0.3

(a)

are/0.5

is/0.5

(b)

0 1d/1

2ey/0.5

ae/0.53

t/0.3

dx/0.74

ax/1

(c)

0

d1

1d1

d2

2d2

d3

3d3

Figure 1: Weighted finite-state acceptor examples. By convention, the states are represented by circles andmarked with their unique number. The initialstate is represented by a bold circle, final states by doublecircles. The labell and weightw of a transition are marked on the corresponding directed arcby l/w. Whenexplicitly shown, the final weightw of a final statef is marked byf/w.

quite similar to a weighted acceptor except that it hasan input label, an output label and a weight on eachof its transitions. The examples in Figure 2 encode(a superset of) the information in the WFSAs of Fig-ure 1(a)-(b) as WFSTs. Figure 2(a) represents thesame language model as Figure 1(a) by giving eachtransition identical input and output labels. This addsno new information, but is a convenient way we useoften to treat acceptors and transducers uniformly.

Figure 2(b) represents a toy pronunciation lexi-con as a mapping from phone strings to words inthe lexicon, in this exampledata and dew, withprobabilities representing the likelihoods of alterna-tive pronunciations. Ittransducesa phone string thatcan be read along a path from the start state to a fi-nal state to a word string with a particular weight.The word corresponding to a pronunciation is out-put by the transition that consumes the first phonefor that pronunciation. The transitions that consumethe remaining phones output no further symbols, in-dicated by the null symbolε as the transition’s outputlabel. In general, anε input label marks a transitionthat consumes no input, and anε output label marksa transition that produces no output.

This transducer has more information than theWFSA in Figure 1(b). Since words are encoded by

the output label, it is possible to combine the pronun-ciation transducers for more than one word withoutlosing word identity. Similarly, HMM structures ofthe form given in Figure 1(c) can be combined intoa single transducer that preserves phone model iden-tity. This illustrates the key advantage of a transducerover an acceptor: the transducer can represent a rela-tionship between two levels of representation, for in-stance between phones and words or between HMMsand context-independent phones. More precisely, atransducer specifies a binary relation between strings:two strings are in the relation when there is a pathfrom an initial to a final state in the transducer thathas the first string as the sequence of input labelsalong the path, and the second string as the sequenceof output labels along the path (ε symbols are leftout in both input and output). In general, this is arelation rather than a function since the same inputstring might be transduced to different strings alongtwo distinct paths. For a weighted transducer, eachstring pair is also associated with a weight.

We rely on a common set of weighted trans-ducer operations to combine, optimize, search andprune them [Mohri et al., 2000]. Each operationimplements a single, well-defined function that hasits foundations in the mathematical theory of ratio-


(a)

0 1using:using/1

2data:data/0.66

3intuition:intuition/0.33

4

is:is/0.5

are:are/0.5

is:is/1

5better:better/0.7

worse:worse/0.3

(b)

0

1d:data/1

5

d:dew/1

2ey:ε/0.5

ae:ε/0.5

6uw:ε/1

3t:ε/0.3

dx:ε/0.74

ax: ε /1

Figure 2: Weighted finite-state transducer examples. Theseare similar to the weighted acceptors in Figure 1except output labels are introduced on each transition. Theinput labeli, the output labelo, and weightw ofa transition are marked on the corresponding directed arc byi : o/w.

nal power series [Salomaa and Soittola, 1978, Bers-tel and Reutenauer, 1988, Kuich and Salomaa, 1986].Many of those operations are the weighted trans-ducer generalizations of classical algorithms for un-weighted acceptors. We have brought together thoseand a variety of auxiliary operations in a comprehen-sive weighted finite-state machine software library(FsmLib) available for non-commercial use from theAT&T Labs – Research Web site [Mohri et al., 2000].

Basicunion, concatenation, andKleene closureoperations combine transducers in parallel, in series,and with arbitrary repetition, respectively. Other op-erations convert transducers to acceptors by project-ing onto the input or output label set, find the bestor then best paths in a weighted transducer, removeunreachable states and transitions, and sort acyclicautomata topologically.

Where possible, we providedlazy(also calledon-demand) implementations of algorithms. Any finite-state objectfsm can be accessed with the three mainmethodsfsm.start(), fsm.final(state),andfsm.transitions(state) that return thestart state, the final weight of a state, and the transi-tions leaving a state, respectively. This interface canbe implemented for concrete automata in an obviousway: the methods simply return the requested infor-mation from a stored representation of the automa-

ton. However, the interface can also be given lazyimplementations. For example, the lazy union of twoautomata returns a new lazyfsm object. When theobject is first constructed, the lazy implementationjust initializes internal book-keeping data. It is onlywhen the interface methods request the start state, thefinal weights, or the transitions (and their destinationstates) leaving a state, that this information is actuallycomputed, and optionally cached inside the object forlater reuse. This approach has the advantage that ifonly a part of the result of an operation is needed (forexample in a pruned search), then the unused part isnever computed, saving time and space. We refer theinterested reader to the library documentation and anoverview of the library [Mohri et al., 2000] for fur-ther details on lazy finite-state objects.

We now discuss the key transducer operationsthat are used in our speech applications for modelcombination, redundant path removal, and size re-duction, respectively.

2.3. Composition

Composition is the transducer operation for com-bining different levels of representation. For in-stance, a pronunciation lexicon can be composedwith a word-level grammar to produce a phone-to-


0

1a:b/0.1

2b:a/0.2

c:a/0.3

3/0.6

a:a/0.4

b:b/0.5 0 1b:c/0.3

2/0.7a:b/0.4

a:b/0.6

(a) (b)

(0, 0) (1, 1)a:c/0.4

(1, 2)c:b/0.7

(3, 2)/1.3a:b/0.8

c:b/0.9

a:b/1

(c)

Figure 3: Example of transducer composition.

word transducer whose word strings are restrictedto the grammar. A variety of ASR transducer com-bination techniques, both context-independent andcontext-dependent, are conveniently and efficientlyimplemented with composition.

As previously noted, a transducer represents a bi-nary relation between strings. The composition oftwo transducers represents their relational composi-tion. In particular, the compositionT = T1 ◦ T2 oftwo transducersT1 andT2 has exactly one path map-ping stringu to stringw for each pair of paths, thefirst in T1 mappingu to some stringv and the sec-ond inT2 mappingv to w. The weight of a path inT is computed from the weights of the two corre-sponding paths inT1 andT2 with the same operationthat computes the weight of a path from the weightsof its transitions. If the transition weights representprobabilities, that operation is the product. If insteadthe weights represent log probabilities or negative logprobabilities as is common in ASR for numerical sta-bility, the operation is the sum. More generally, theweight operations for a weighted transducer can bespecified by asemiring[Salomaa and Soittola, 1978,Berstel and Reutenauer, 1988, Kuich and Salomaa,1986], as discussed in more detail in Section 3.

The weighted composition algorithm generalizesthe classical state-pair construction for finite au-tomata intersection [Hopcroft and Ullman, 1979] toweighted acceptors and transducers. The states of thecompositionT are pairs of aT1 state and aT2 state.Tsatisfies the following conditions: (1) its initial stateis the pair of the initial state ofT1 and the initial stateof T2; (2) its final states are pairs of a final state ofT1 and a final state ofT2, and (3) there is a transitiont from (q1, q2) to (r1, r2) for each pair of transitionst1 from q1 to r1 andt2 from q2 to r2 such that theoutput label oft1 matches the input label oft2. Thetransitiont takes its input label fromt1, its output la-bel fromt2, and its weight is the combination of theweights oft1 and t2 done with the same operationthat combines weights along a path. Since this com-putation islocal — it involves only the transitionsleaving two states being paired — it can be givena lazy implementation in which the composition isgenerated only as needed by other operations on thecomposed automaton. Transitions withε-labels inT1

or T2 must be treated specially as discussed in Sec-tion 3. Figure 3 shows two simple transducers, Fig-ure 3(a) and Figure 3(b), and the result of their com-position, Figure 3(c). The weight of a path in the


resulting transducer is the sum of the weights of thematching paths inT1 andT2 (as when the weightsrepresent negative log probabilities).

Since we represent weighted acceptors byweighted transducers in which the input and outputlabels of each transition are identical, the intersectionof two weighted acceptors is just the composition ofthe corresponding transducers.

2.4. Determinization

In a deterministic automaton, each state has at mostone transition with any given input label and thereare no inputε-labels. Figure 4(a) gives an exampleof a non-deterministic weighted acceptor: at state0,for instance, there are two transitions with the samelabela. The automaton in Figure 4(b), on the otherhand, is deterministic.

The key advantage of a deterministic automa-ton over equivalent nondeterministic ones is its ir-redundancy: it contains at most one path matchingany given input string, thereby reducing the time andspace needed to process the string. This is particu-larly important in ASR due to pronunciation lexiconredundancy in large vocabulary tasks. The familiartree lexicon in ASR is a deterministic pronunciationrepresentation [Ortmanns et al., 1996].

To benefit from determinism, we use a de-terminization algorithm that transforms a non-deterministic weighted automaton into an equivalentdeterministic automaton. Two weighted acceptorsare equivalentif they associate the same weight toeach input string; weights may be distributed dif-ferently along the paths of two equivalent acceptors.Two weighted transducers are equivalent if they as-sociate the same output string and weights to eachinput string; the distribution of the weight or outputlabels along paths need not be the same in the twotransducers.

If we apply the weighted determinization algo-rithm to the union of a set of chain automata, eachrepresenting a single word pronunciation, we obtaina tree-shaped automaton. However, the result of thisalgorithm on more general automata may not be atree, and in fact may be much more compact than atree. The algorithm can produce results for a broadclass of automata with cycles, which have no tree rep-resentation.

Weighted determinization generalizes the classi-cal subset method for determinizing unweighted fi-nite automata [Aho et al., 1986], Unlike in the un-weighted case, not all weighted automata can be de-terminized. Conditions for determinizability are dis-cussed in Section 3.3. Fortunately, most weighted au-tomata used in speech processing can be either deter-minized directly or easily made determinizable withsimple transformations, as we discuss in sections 3.3and 4.1. In particular, any acyclic weighted automa-ton can always be determinized.

To eliminate redundant paths, weighted deter-minization needs to calculate the combined weight ofall paths with the same labeling. When each path rep-resents a disjoint event with probability given by itsweight, the combined weight, representing the prob-ability of the common labeling for that set of paths,would be the sum of the weights of the paths. Alter-natively, we may just want to keep the most proba-ble path, as is done in shortest path algorithms, lead-ing to the so-calledViterbi approximation. Whenweights are negative log probabilities, these two al-ternatives correspond respectively to log summationand taking the minimum. In the general case, weuse one operation, denoted⊗, for combining weightsalong paths and for composition, and a second opera-tion, denoted⊕, to combine identically labeled paths.Some common choices of(⊕,⊗) include(max, +),(+, ∗), (min, +), and (− log(e−x + e−y), +). Inspeech applications, the first two are appropriate forprobabilities, the last two for the corresponding neg-ative log probabilities. More generally, as we willsee in Section 3, many of the weighted automata al-gorithms apply when the two operations define anappropriate semiring. The choices(min, +), and(− log(e−x +e−y), +) are called thetropical andlogsemirings, respectively.

Our discussion and examples of determiniza-tion and, later, minimization will be illustrated withweighted acceptors. Thestring semiring, whose twooperations are longest common prefix and concatena-tion, can be used to treat the output strings as weights.By this method, the transducer case can be handled aswell; see Mohri [1997] for details.

We will now work through an example of de-terminization with weights in the tropical semiring.Figure 4(b) shows the weighted determinization ofautomatonA1 from Figure 4(a). In general, the de-


0

1a/1

2

a/2

b/3

3/0

c/5

b/3 d/6 (0,0) (1,0),(2,1)a/1

b/3

(3,0)/0c/5d/7

(a) (b)

Figure 4: Determinization of weighted automata. (a) Weighted automaton over the tropical semiringA. (b)Equivalent weighted automatonB obtained by determinization ofA.

terminization of a weighted automaton is equivalentto the original, that is, it associates the same weightto each input string. For example, there are twopaths corresponding to the input stringab in A1, withweights{1 + 3 = 4, 2 + 3 = 5}. The minimum4 isalso the weight associated byA2 to the stringab.

In the classical subset construction for deter-minizing unweighted automata, all the states reach-able by a given input from the initial state are placedin the same subset. In the weighted case, transitionswith the same input label can have different weights,but only the minimum of those weights is needed andthe leftover weights must be kept track of. Thus,the subsets in weighted determinization contain pairs(q, w) of a stateq of the original automaton and aleftover weightw.

The initial subset is{(i, 0)}, wherei is the initialstate of the original automaton. For example, in Fig-ure 4, the initial subset for automatonB is {(0, 0)}.Each new subsetS is processed in turn. For eachelementa of the input alphabetΣ labeling at leastone transition leaving a state ofS, a new transitiontleavingS is constructed in the result automaton. Theinput label oft is a and its weight is the minimum ofthe sumsw + l wherew is s’s leftover weight andlis the weight of ana-transition leaving a states in S.The destination state oft is the subsetS′ containingthose pairs(q′, w′) in whichq′ is a state reached by atransition labeled witha from a state ofS andw′ isthe appropriate leftover weight.

For example, in Figure 4, the transition leaving(0, 0) in B labeled witha is obtained from the twotransitions labeled witha leaving state0 in A: itsweight is the minimum of the weight of those twotransitions, and its destination state is the subsetS′ ={(1, 1 − 1 = 0), (2, 2 − 1 = 1)}. The algorithm is

described in more in detail in Section 3.3.It is clear that the transitions leaving a given state

in the determinization of an automaton can be com-puted from the subset for the state and the transi-tions leaving the states in the subset, as is the casefor the classical non-deterministic finite automata(NFA) determinization algorithm. In other words,the weighted determinization algorithm is local likethe composition algorithm, and can thus be given alazy implementation that creates states and transi-tions only as needed.

2.5. Minimization

Given a deterministic automaton, we can reduce itssize by minimization, which can save both spaceand time. Any deterministic unweighted automatoncan be minimized using classical algorithms [Ahoet al., 1974, Revuz, 1992]. In the same way, anydeterministic weighted automatonA can be mini-mized using our minimization algorithm, which ex-tends the classical algorithm [Mohri, 1997]. The re-sulting weighted automatonB is equivalent to the au-tomatonA, and has the least number of states and theleast number of transitions among all deterministicweighted automata equivalent toA.

As we will see in Section 3.5, weighted mini-mization is quite efficient, indeed as efficient as clas-sical deterministic finite automata (DFA) minimiza-tion.

We can view the deterministic weighted automa-ton A in Figure 5(a) as an unweighted automaton byinterpreting each pair(a, w) of a labela and a weightw as a single label. We can then apply the stan-dard DFA minimization algorithm to this automaton.However, since the pairs for different transitions are


0

1

a/0

b/1

c/5

2

d/0

e/1

3

e/0f/1

e/4

f/5

0/0

1

a/0

b/1

c/5

2

d/4

e/5

3/0

e/0f/1

e/0

f/10/0 1

a/0b/1c/5

3/0e/0f/1

(a) (b) (c)

Figure 5: Weight pushing and minimization. (a) Deterministic weighted automatonA. (b) Equivalentweighted automatonB obtained by weight pushing in the tropical semiring. (c) Minimal weighted automatonequivalent toA.

all distinct, classical minimization would have no ef-fect onA.

The size ofA can still be reduced by using trueweighted minimization. This algorithm works intwo steps: the first steppushesweight among tran-sitions, and the second applies the classical mini-mization algorithm to the result with each distinctlabel-weight pair viewed as a distinct symbol, as de-scribed above. Weight pushing is useful not only asa first step of minimization but also to redistributeweight among transitions to improve search, espe-cially pruned search. The algorithm is described inmore detail in Section 3.4, and analyzed in [Mohri,2002]. Its applications to speech recognition are dis-cussed in [Mohri and Riley, 2001].

Pushing is a special case ofreweighting. We de-scribe reweighting in the case of the tropical semir-ing; similar definitions can be given for other semir-ings. A (non-trivial) weighted automaton can bereweighted in an infinite number of ways that pro-duce equivalent automata. To see how, leti be A’sinitial state and assume for convenienceA has a sin-gle final statef .1 Let V : Q → R be an arbitrarypotentialfunction on states. Update each weightw[t]of the transitiont from statep[t] to n[t] as follows:

w[t]← w[t] + (V (n[t])− V (p[t])), (1)

1Any automaton can be transformed into an equivalent automa-ton with a single final state by adding a super-final state, making allpreviously final states non-final, and adding from each previouslyfinal statef with weightρ(f) anε-transition with the weightρ(f)to the super-final state.

and the final weightρ(f) as follows:

ρ(f)← ρ(f) + (V (i)− V (f)). (2)

It is easy to see that with this reweighting, each po-tential internal to any successful path from the initialstate to the final state is added and then subtracted,making the overall change in path weight:

(V (f)− V (i)) + (V (i)− V (f)) = 0. (3)

Thus, reweighting does not affect the total weight of asuccessful path and the resulting automaton is equiv-alent to the original.

To push the weight inA towards the initial stateas much as possible, a specific potential function ischosen, the one that assigns to each state the low-est path weight from that state to the final state. Af-ter pushing, the lowest cost path (excluding the finalweight) from every state to the final state will thus be0.

Figure 5(b) shows the result of pushing for theinput A. Thanks to pushing, the size of the automa-ton can then be reduced using classical minimization.Figure 5(c) illustrates the result of the final step of thealgorithm. No approximation or heuristic is used: theresulting automatonC is equivalent toA.

2.6. Speech Recognition Transducers

As an illustration of these methods applied to speechrecognition, we describe how to construct a single,statically-compiled and optimized recognition trans-ducer that maps from context-dependent phones to


k,ae ae,tae:ae/k_t

k,ae ae,tt:ae/k_t

(a) (b)

Figure 6: Context-dependent triphone transducer transition: (a) non-deterministic, (b) deterministic.

words. This is an attractive choice for tasks that havefixed acoustic, lexical, and grammatical models sincethe static transducer can be searched simply and effi-ciently with no recognition-time overhead for modelcombination and optimization.

Consider the pronunciation lexicon in Fig-ure 2(b). Suppose we form the union of this trans-ducer with the pronunciation transducers for the re-maining words in the grammarG of Figure 2(a) bycreating a new super-initial state and connecting anε-transition from that state to the former start statesof each pronunciation transducer. We then take itsKleene closure by connecting anε-transition fromeach final state to the initial state. The resultingpronunciation lexiconL would pair any word stringfrom that vocabulary to their corresponding pronun-ciations. Thus,

L ◦G (4)

gives a transducer that maps from phones to wordstrings restricted toG.

We used composition here to implement acontext-independent substitution. However, a ma-jor advantage of transducers in speech recogni-tion is that they generalize naturally the notion ofcontext-independent substitution of a label to thecontext-dependent case. In particular, the applica-tion of the familiar triphone models in ASR to thecontext-independent transducer, producing a context-dependent transducer, can be performed simply withcomposition.

To do so, we first construct a context-dependencytransducer that maps from context-independentphones to context-dependent triphones. This trans-ducer has a state for every pair of phones and a tran-sition for every context-dependent model. In partic-ular, if ae/k t represents the triphonic model forae

with left contextk and right contextt,2 then thereis a transition in the context-dependency transducerfrom state(k, ae) to state(ae, t) with output labelae/k t. For the input label on this transition, wecould choose the center phoneae as depicted in Fig-ure 6(a). This will correctly implement the transduc-tion; but the transducer will be non-deterministic. Al-ternately, we can choose the right phonet as depictedin Figure 6(b). This will also correctly implement thetransduction, but the result will be deterministic. Tosee why these are correct, realize that when we en-ter a state, we have read (in the deterministic case)or must read (in the non-deterministic case) the twophones that label the state. Therefore, the source stateand destination state of a transition determine the tri-phone context. In Section 4, we give the full detailsof the triphonic context-dependency transducer con-struction and further demonstrate its correctness.

The above context-dependency transducermaps from context-independent phones to context-dependent triphones. We can invert the relationby interchanging the transducer’s input and outputlabels to create a new transducer that maps fromcontext-dependent triphones to context-independentphones. We do this inversion so we can left com-pose it with our context-independent recognitiontransducerL ◦ G. If we let C represent a context-dependency transducer from context-dependentphones to context-independent phones, then:

C ◦ (L ◦G) (5)

gives a transducer that maps from context-dependent2This use of/ to indicate “in the context of” in a triphone sym-

bol offers a potential ambiguity with our use of/ to separate a tran-sition’s weight from its input and output symbols. However,sincecontext-dependency transducers are never weighted in thischap-ter, the confusion is not a problem in what follows, so we chose tostay with the notation of previous work rather than changingit toavoid the potential ambiguity.


phones to word strings restricted to the grammarG.To complete our example, we optimize this trans-ducer. Given our discussion of the benefits of deter-minization and minimization, we might try to applythose operations directly to the composed transducer:

N = min(det(C ◦ (L ◦G))). (6)

This assumes the recognition transducer can be de-terminized, which will be true if each of the compo-nent transducers can be determinized. If the context-dependencyC is constructed as we have describedand if the grammarG is ann-gram language model,then they will be determinizable. However,L maynot be determinizable. In particular, ifL has am-biguities, namely homophones (two distinct wordsthat have the same pronunciation), then it can notbe determinized as is. However, we can introduceauxiliary phone symbols at word ends to disam-biguate homophones to create a transformed lexi-con L. We also need to create a modified context-dependency transducerC that additionally pairs thecontext-independent auxiliary symbols found in thelexicon with new context-dependent auxiliary sym-bols (which are later rewritten to epsilons after all op-timizations). We leave the details to Section 4. Thefollowing expression specifies the optimized trans-ducer:

N = min(det(C ◦ (L ◦G))). (7)

In Section 4, we give illustrative experimental re-sults with a fully-composed, optimized (andfac-tored) recognition transducer that maps from context-dependent units to words for the North AmericanBusiness News (NAB) DARPA task. This transducerruns about18× faster than its unoptimized versionand has only about1.4× times as many transitionsas its word-level grammar. We have found similar re-sults with DARPA Broadcast News and Switchboard.

3. ALGORITHMS

We now describe in detail the weighted automata andtransducer algorithms introduced informally in Sec-tion 2 that are relevant to the design of speech recog-nition systems. We start with definitions and notationused in specifying and describing the algorithms.

3.1. Preliminaries

As noted earlier, all of our algorithms work withweights that are combined with operations satisfyingthe semiringconditions. A semiring(K,⊕,⊗, 0, 1)is specified by a set of valuesK, two binary oper-ations⊕ and⊗, and two designated values0 and1. The operation⊕ is associative, commutative, andhas0 as identity. The operation⊗ is associative, hasidentity1, distributes with respect to⊕, and has0 asannihilator: for alla ∈ K, a ⊗ 0 = 0 ⊗ a = 0. If⊗ is also commutative, we say that the semiring iscommutative. All the semirings we discuss in the restof this chapter are commutative.

Real numbers with addition and multiplicationsatisfy the semiring conditions, but of course theyalso satisfy several other important conditions (forexample, having additive inverses), which are not re-quired for our transducer algorithms. Table 3.1 listssome familiar (commutative) semirings. In additionto the Boolean semiring, and the probability semir-ing used to combine probabilities, two semirings of-ten used in text and speech processing applicationsare thelog semiringwhich is isomorphic to the prob-ability semiring via the negative-log mapping, andthe tropical semiringwhich is derived from the logsemiring using theViterbi approximation.

A semiring(K,⊕,⊗, 0, 1) is weakly left-divisibleif for any x andy in K such thatx ⊕ y 6= 0, thereexists at least onez such thatx = (x ⊕ y)⊗ z. The⊗-operation iscancellativeif z is unique and we canwrite: z = (x⊕y)−1⊗x. A semiring iszero-sum-freeif for anyx andy in K, x⊕y = 0 impliesx = y = 0.

For example, the tropical semiring is weakly left-divisible with z = x −min(x, y), which also showsthat⊗ for this semiring is cancellative. The proba-bility semiring is also weakly left-divisible withz =

xx+y

. Finally, the tropical semiring, the probabilitysemiring, and the log semiring are zero-sum-free.

For anyx ∈ K, let xn denote

xn = x⊗ · · · ⊗ x︸︷︷︸

n

. (8)

When the infinite sum⊕+∞

n=0 xn is well defined andin K, the closure of an elementx ∈ K is definedasx∗ =

⊕+∞

n=0 xn. A semiring isclosedwhen infi-nite sums such as the one above, are well defined andif associativity, commutativity, and distributivity ap-


Table 1:Semiring examples.⊕log is defined by:x⊕log y = − log(e−x + e−y).

SEMIRING SET ⊕ ⊗ 0 1

Boolean {0, 1} ∨ ∧ 0 1Probability R+ + × 0 1Log R ∪ {−∞, +∞} ⊕log + +∞ 0Tropical R ∪ {−∞, +∞} min + +∞ 0

ply to countable sums (Lehmann [1977] and Mohri[2002] give precise definitions). The Boolean andtropical semirings are closed, while the probabilityand log semirings are not.

A weighted finite-state transducerT =(A,B, Q, I, F, E, λ, ρ) over a semiring K isspecified by a finite input alphabetA, a finite outputalphabetB, a finite set of statesQ, a set of initialstatesI ⊆ Q, a set of final statesF ⊆ Q, a finite setof transitionsE ⊆ Q×(A∪{ε})×(B∪{ε})×K×Q,an initial state weight assignmentλ : I → K, anda final state weight assignmentρ : F → K. E[q]denotes the set of transitions leaving stateq ∈ Q.|T | denotes the sum of the number of states andtransitions ofT .

Weighted automata(or weighted acceptors) aredefined in a similar way by simply omitting theinput or output labels. Theprojection operationsΠ1(T ) andΠ2(T ) obtain a weighted automaton froma weighted transducerT by omitting respectively theinput or the output labels ofT .

Given a transitione ∈ E, p[e] denotes its originor previous state,n[e] its destination or next state,i[e] its input label,o[e] its output label, andw[e] itsweight. A pathπ = e1 · · · ek is a sequence of con-secutive transitions:n[ei−1] = p[ei], i = 2, . . . , k.The pathπ is a cycle if p[e1] = n[ek]. An ε-cycleis a cycle in which the input and output labels of alltransitions areε.

The functionsn, p, and w on transitions canbe extended to paths by settingn[π] = n[ek] andp[π] = p[e1], and by defining the weight of a path asthe⊗-product of the weights of its constituent tran-sitions: w[π] = w[e1] ⊗ · · · ⊗ w[ek]. More gen-erally, w is extended to any finite set of pathsRby settingw[R] =

⊕

π∈R w[π]; if the semiring is

closed, this is defined even for infiniteR. We de-note byP (q, q′) the set of paths fromq to q′ and byP (q, x, y, q′) the set of paths fromq to q′ with inputlabelx ∈ A∗ and output labely ∈ B∗. For an accep-tor, we denote byP (q, x, q′) the set of paths with in-put labelx. These definitions can be extended to sub-setsR, R′ ⊆ Q by P (R, R′) = ∪q∈R, q′∈R′P (q, q′),P (R, x, y, R′) = ∪q∈R, q′∈R′P (q, x, y, q′), and, foran acceptor,P (R, x, R′) = ∪q∈R, q′∈R′P (q, x, q′).A transducerT is regulatedif the weight associatedby T to any pair of input-output strings(x, y), givenby

T (x, y) =⊕

π∈P (I,x,y,F )

λ[p[π]]⊗ w[π]⊗ ρ[n[π]], (9)

is well defined and inK. If P (I, x, y, F ) = ∅, thenT (x, y) = 0. A weighted transducer withoutε-cyclesis regulated, as is any weighted transducer over aclosed semiring. Similarly, for a regulated acceptor,we define

T (x) =⊕

π∈P (I,x,F )

λ[p[π]]⊗ w[π] ⊗ ρ[n[π]]. (10)

The transducerT is trim if every state occurs insome pathπ ∈ P (I, F ). In other words, a trim trans-ducer has no useless states. The same definition ap-plies to acceptors.

3.2. Composition

As we outlined in Section 2.3, composition is the coreoperation for relating multiple levels of representa-tion in ASR. More generally, composition is the fun-damental algorithm used to create complex weightedtransducers from simpler ones [Salomaa and Soittola,


WEIGHTED-COMPOSITION(T1, T2)

1 Q← I1 × I2

2 S ← I1 × I2

3 while S 6= ∅ do4 (q1, q2)← HEAD(S)5 DEQUEUE(S)6 if (q1, q2) ∈ I1 × I2 then7 I ← I ∪ {(q1, q2)}8 λ(q1, q2)← λ1(q1)⊗ λ2(q2)9 if (q1, q2) ∈ F1 × F2 then

10 F ← F ∪ {(q1, q2)}11 ρ(q1, q2)← ρ1(q1)⊗ ρ2(q2)12 for each(e1, e2) ∈ E[q1]× E[q2] such thato[e1] = i[e2] do13 if (n[e1], n[e2]) 6∈ Q then14 Q← Q ∪ {(n[e1], n[e2])}15 ENQUEUE(S, (n[e1], n[e2]))16 E ← E ∪ {((q1, q2), i[e1], o[e2], w[e1]⊗ w[e2], (n[e1], n[e2]))}17 return T

Figure 7: Pseudocode of the composition algorithm.

1978, Kuich and Salomaa, 1986], and generalizesthe composition algorithm for unweighted finite-statetransducers [Eilenberg, 1974-1976, Berstel, 1979].Let K be a commutative semiring and letT1 andT2

be two weighted transducers defined overK such thatthe input alphabetB of T2 coincides with the out-put alphabet ofT1. Assume that the infinite sum⊕

z∈B∗ T1(x, z) ⊗ T2(z, y) is well defined and inK for all (x, y) ∈ A∗ × C∗, whereA is the input al-phabet ofT1 andC is the output alphabet ofT2. Thiswill be the case ifK is closed, or ifT1 has noε-inputcycles orT2 has noε-output cycles. Then, the resultof the composition ofT1 andT2 is a weighted trans-ducer denoted byT1 ◦ T2 and specified for allx, yby:

(T1 ◦ T2)(x, y) =⊕

z∈B∗

T1(x, z)⊗ T2(z, y). (11)

There is a general and efficient composition algo-rithm for weighted transducers [Salomaa and Soit-tola, 1978, Kuich and Salomaa, 1986]. States in thecompositionT1 ◦ T2 of two weighted transducersT1

andT2 are identified with pairs of a state ofT1 and astate ofT2. Leaving aside transitions withε inputs oroutputs, the following rule specifies how to compute

a transition ofT1 ◦T2 from appropriate transitions ofT1 andT2:

(q1, a, b, w1, r1) and(q2, b, c, w2, r2)=⇒ ((q1, q2), a, c, w1 ⊗ w2, (r1, r2)).

(12)

Figure 7 gives the pseudocode of the algorithm in theε-free case.

The algorithm takes as input two weighted trans-ducers

T1 = (A,B, Q1, I1, F1, E1, λ1, ρ1) andT2 = (B, C, Q2, I2, F2, E2, λ2, ρ2),

(13)

and outputs a weighted finite-state transducerT =(A, C, Q, I, F, E, λ, ρ) implementing the composi-tion of T1 and T2. E, I, andF are all initializedto the empty set and grown as needed.

The algorithm uses a queueS containing the setof pairs of states yet to be examined. The queuediscipline ofS is arbitrary, and does not affect thetermination of the algorithm. The state setQ is ini-tially the set of pairs of initial states of the originaltransducers, as isS (lines 1-2). Each time throughthe loop in lines 3-16, a new pair of states(q1, q2)is extracted fromS (lines 4-5). The initial weight


of (q1, q2) is computed by⊗-multiplying the initialweights ofq1 andq2 when they are both initial states(lines 6-8). Similar steps are followed for final states(lines 9-11). Then, for each pair of matching transi-tions(e1, e2), a new transition is created according tothe rule specified earlier (line 16). If the destinationstate(n[e1], n[e2]) has not been found previously, itis added toQ and inserted inS (lines 14-15).

In the worst case, all transitions ofT1 leavinga stateq1 match all those ofT2 leaving stateq′1,thus the space and time complexity of composition isquadratic:O(|T1||T2|). However, a lazy implemen-tation of composition can be used to construct justthe part of the composed transducer that is needed.

More care is needed whenT1 has outputε labelsandT2 input ε labels. Indeed, as illustrated by Fig-ure 8, a straightforward generalization of theε-freecase would generate redundantε-paths and, in thecase of non-idempotent semirings, would lead to anincorrect result. The weight of the matching pathsof the original transducers would be countedp times,wherep is the number of redundant paths in the com-position.

To solve this problem, all but oneε-path must befiltered out of the composition. Figure 8 indicatesin boldface one possible choice for that path, whichin this case is the shortest. Remarkably, that filter-ing mechanism can be encoded as a finite-state trans-ducer.

Let T1 and T2 be the weighted transducers ob-tained fromT1 andT2 respectively by replacing theoutputε labels ofT1 with ε2 and the inputε labels ofT2 with ε1. Consider the filter finite-state transducerF represented in Figure 9. ThenT1◦F ◦T2 = T1◦T2.Since the two compositions inT1 ◦ F ◦ T2 do notinvolve ε labels, theε-free composition already de-scribed can be used to compute the resulting trans-ducer.

Intersection (orHadamard product) of weightedautomata and composition of finite-state transducersare both special cases of composition of weightedtransducers. Intersection corresponds to the casewhere input and output labels of transitions are iden-tical and composition of unweighted transducers isobtained simply by omitting the weights.

0

x:xε2:ε1

1ε1:ε1

2

ε2:ε2

x:x

ε1:ε1

x:x

ε2:ε2

Figure 9: Filter for compositionF .

3.3. Determinization

We now describe the generic determinization algo-rithm for weighted automata that we used informallywhen working through the example in Section 2.4.This algorithm is a generalization of the classicalsubset construction for NFAs (unweighted nonde-terministic finite automata). The determinization ofunweighted or weighted finite-state transducers canboth be viewed as special instances of the generic al-gorithm presented here but, for simplicity, we willfocus on the weighted acceptor case.

A weighted automaton isdeterministic (alsoknown assubsequential) if it has a unique initial stateand if no two transitions leaving any state share thesame input label. Thedeterminizationalgorithm wenow present applies to weighted automata over a can-cellative weakly left-divisible semiring that satisfiesa mild technical condition.3 Figure 10 gives pseu-docode for the algorithm.

A weighted subsetp of Q is a set of pairs(q, x) ∈Q×K. Q[p] is the set of statesq in p, E[Q[p]] is theset of transitions leaving those states, andi[E[Q[p]]]the set of input labels of those transitions.

The states of the result automaton are weightedsubsets of the states of the original automaton. Astater of the result automaton that can be reachedfrom the start state by pathπ is the weighted set ofpairs (q, x) ∈ Q × K such thatq can be reachedfrom an initial state of the original automaton bya pathσ with i[σ] = i[π] and λ[p[σ]] ⊗ w[σ] =

3If x ∈ P (I, Q), thenw[P (I, x, Q)] 6= 0, which is satisfiedby trim automata over the tropical semiring or any other zero-sum-free semiring.


0,0 1,1 1,2

2,1 2,2

3,1 3,2

4,3

a:d ε:e

b:ε

c:ε

b:ε

c:ε

ε:e

ε:ed:a

b:e(x:x) (ε1:ε1)

(ε1:ε1)

(ε1:ε1)

(ε2:ε2)(ε2:ε2)

(ε2:ε2) (ε2:ε2)

(x:x)

(ε2:ε1)

0 1a:a 2b:ε 3c:ε 4d:d

0 1a:d 2ε:e 3d:a

Figure 8: Redundantε-paths. A straightforward generalization of theε-free case could generate all the pathsfrom (1, 1) to (3, 2) when composing the two simple transducers on the right-handside.

λ[p[π]]⊗w[π]⊗x. Thus,x can be viewed as theresid-ual weight at stateq. The algorithm takes as input aweighted automatonA = (A, Q, I, F, E, λ, ρ) and,when it terminates, yields an equivalent deterministicweighted automatonA′ = (A, Q′, I ′, F ′, E′, λ′, ρ′).

The algorithm uses a queueS containing the setof states of the resulting automatonA′, yet to be ex-amined. The setsQ′, I ′, F ′, and E′ are initiallyempty. The queue discipline forS can be arbitrar-ily chosen and does not affect the termination of thealgorithm. The initial state set ofA′ is I ′ = {i′}wherei′ is the weighted set of the initial states ofAwith the respective initial weights. Its initial weightis 1 (lines 1-2).S originally contains only the subsetI ′ (line 3). Each time through the loop in lines 4-16,a new weighted subsetp′ is dequeued fromS (lines5-6). For eachx labeling at least one of the transi-tions leaving a statep in the weighted subsetp′, anew transition with input labelx is constructed. Theweightw′ associated to that transition is the sum ofthe weights of all transitions inE[Q[p′]] labeled withx pre-⊗-multiplied by the residual weightv at eachstatep (line 8). The destination state of the transi-tion is the subset containing all the statesq reachedby transitions inE[Q[p′]] labeled withx. The weightof each stateq of the subset is obtained by taking the⊕-sum of the residual weights of the statesp⊗-timesthe weight of the transition fromp leading toq andby dividing that byw′. The new subsetq′ is insertedin the queueS when it is a new state (line 16). If anyof the states in the subsetq′ is final, q′ is made a fi-

nal state and its final weight is obtained by summingthe final weights of all the final states inq′, pre-⊗-multiplied by their residual weightv (line 14-15).

The worst case complexity of determinization isexponential even in the unweighted case. However,in many practical cases such as for weighted au-tomata used in large-vocabulary speech recognition,this blow-up does not occur. It is also important tonotice that just like composition, determinization hasa natural lazy implementation in which only the tran-sitions required by an application are expanded in theresult automaton.

Unlike in the unweighted case, determinizationdoes not halt on all input weighted automata. Wesay that a weighted automatonA is determinizableif the determinization algorithm halts for the inputA.With a determinizable input, the algorithm outputs anequivalent deterministic weighted automaton.

The twins propertyfor weighted automata char-acterizes determinizable weighted automata undersome general conditions [Mohri, 1997]. LetA bea weighted automaton over a weakly left-divisiblesemiringK. Two statesq andq′ of A are said to besiblings if there are stringsx andy in A∗ such thatbothq andq′ can be reached fromI by paths labeledwith x and there is a cycle atq and a cycle atq′ bothlabeled withy. WhenK is a commutative and can-cellative semiring, two sibling states are said to betwinswhen for every stringy:

w[P (q, y, q)] = w[P (q′, y, q′)]. (14)


WEIGHTED-DETERMINIZATION(A)

1 i′ ← {(i, λ(i)) : i ∈ I}2 λ′(i′)← 13 S ← {i′}4 while S 6= ∅ do5 p′ ← HEAD(S)6 DEQUEUE(S)7 for eachx ∈ i[E[Q[p′]]] do8 w′ ←

⊕{v ⊗ w : (p, v) ∈ p′, (p, x, w, q) ∈ E}

9 q′ ← {(q,⊕{

w′−1 ⊗ (v ⊗ w) : (p, v) ∈ p′, (p, x, w, q) ∈ E}) :

q = n[e], i[e] = x, e ∈ E[Q[p′]]}10 E′ ← E′ ∪ {(p′, x, w′, q′)}11 if q′ 6∈ Q′ then12 Q′ ← Q′ ∪ {q′}13 if Q[q′] ∩ F 6= ∅ then14 F ′ ← F ′ ∪ {q′}15 ρ′(q′)←

⊕{v ⊗ ρ(q) : (q, v) ∈ q′, q ∈ F}

16 ENQUEUE(S, q′)17 return T ′

Figure 10: Pseudocode of the weighted determinization algorithm [Mohri, 1997].

A hasthe twins propertyif any two sibling states ofA are twins. Figure 11 shows a weighted automa-ton over the tropical semiring that does not have thetwins property: states1 and2 can be reached by pathslabeled witha from the initial state and have cycleswith the same labelb, but the weights of these cycles(3 and4) are different.

The following theorems relate the twins propertyand weighted determinization [Mohri, 1997].

Theorem 1 LetA be a weighted automaton over thetropical semiring. IfA has the twins property, thenAis determinizable.

Theorem 2 Let A be a trim unambiguous weightedautomaton over the tropical semiring.4 ThenA is de-terminizable iff it has the twins property.

There is an efficient algorithm for testing thetwins property for weighted automata [Allauzen andMohri, 2003]. Note that any acyclic weighted au-tomaton over a zero-sum-free semiring has the twinsproperty and is determinizable.

4A weighted automaton is said to beunambiguousif for anyx ∈ Σ∗ it admits at most one accepting path labeled withx.

0

1a/1

2

a/2

b/3

3/0

c/5

b/4 d/6

Figure 11: Non-determinizable weighted automatonover the tropical semiring. States 1 and 2 are non-twin siblings.

The pre-determinizationalgorithm can be usedto make determinizable an arbitrary weighted trans-ducer over the tropical semiring by inserting tran-sitions labeled with special symbols [Allauzen andMohri, 2004]. The algorithm makes use of a generaltwins property [Allauzen and Mohri, 2003] to insertnew transitions when needed to guarantee that the re-sulting transducer has the twins property and thus isdeterminizable.


3.4. Weight Pushing

As discussed in Section 2.5,weight pushingis nec-essary in weighted minimization, and is also veryuseful to improve search. Weight pushing can alsobe used to test the equivalence of two weighted au-tomata. Weight pushing is possible because thechoice of the distribution of the total weight alongeach successful path of a weighted automaton doesnot affect the total weight of each successful path,and therefore preserves the definition of the automa-ton as a weighted set (weighted relation for a trans-ducer).

Let A be a weighted automaton over a zero-sum-free and weakly left-divisible semiringK. For anystateq ∈ Q, assume that the following sum is welldefined and inK:

d[q] =⊕

π∈P (q,F )

(w[π]⊗ ρ[n[π]]). (15)

The valued[q] is the shortest-distancefrom q toF [Mohri, 2002]. This is well defined for allq ∈ QwhenK is a closed semiring. The weight-pushing al-gorithm consists of computing each shortest-distanced[q] and ofreweightingthe transition weights, initialweights and final weights in the following way:

w[e]←d[p[e]]−1 ⊗ w[e]⊗ d[n[e]] if d[p[e]] 6= 0,λ[i]←λ[i]⊗ d[i],ρ[f ]←d[f ]−1 ⊗ ρ[f ] if d[f ] 6= 0,

(16)

where e is any transition,i any initial state, andf any final state. Each of these operations can bedone in constant time. Therefore, reweighting can bedone in linear timeO(T⊗|A|) whereT⊗ denotes theworst cost of an⊗-operation. The complexity of theshortest-distances computation depends on the semir-ing [Mohri, 2002]. For the tropical semiring,d[q] canbe computed using a standard shortest-distance algo-rithm. The complexity is linear for acyclic automata,O(|Q|+(T⊕+T⊗)|E|), whereT⊕ denotes the worstcost of an⊕-operation. For general weighted au-tomata over the tropical semiring, the complexity isO(|E| + |Q| log |Q|).

For semirings like the probability semiring, ageneralization of the Floyd-Warshall algorithm forcomputing all-pairs shortest-distances can be used[Mohri, 2002]. Its complexity isΘ(|Q|3(T⊕ + T⊗ +

0/15

1

a/0

b/(1/15)

c/(5/15)

2

d/0

e/(9/15)

3/1

e/0f/1

e/(4/9)

f/(5/9)

Figure 12: Weighted automatonC obtained fromAof Figure 5(a) by weight pushing in the probabilitysemiring.

T∗)) where T∗ denotes the worst cost of the clo-sure operation. The space complexity of these al-gorithms isΘ(|Q|2). Therefore, the Floyd-Warshallalgorithm is impractical for automata with severalhundred million states or transitions, which arise inlarge-vocabulary ASR. An approximate version of ageneric single-source shortest-distance algorithm canbe used instead to computed[q] efficiently [Mohri,2002].

Speaking informally, the algorithm pushes theweight on each path as much as possible towardsthe initial states. Figures 5(a)-(b) show weight push-ing on the tropical semiring, while Figure 12 showsweight pushing for the same automaton but on theprobability semiring.

Note that ifd[q] = 0, then, sinceK is zero-sum-free, the weight of all paths fromq to F is 0. LetA bea weighted automaton over the semiringK. Assumethat K is closed and that the shortest-distancesd[q]are all well defined and inK−

{0}

. In both cases, wecan use the distributivity over the infinite sums defin-ing shortest distances. Lete′ (π′) denote the transi-tion e (pathπ) after application of the weight pushingalgorithm.e′ (π′) differs frome (resp.π) only by itsweight. Letλ′ denote the new initial weight func-tion, andρ′ the new final weight function. Then, thefollowing proposition holds [Mohri, 1997, 2005].

Proposition 1 Let B = (A, Q, I, F, E′, λ′, ρ′) bethe result of the weight pushing algorithm applied tothe weighted automatonA, then

1. the weight of a successful pathπ is unchangedafter weight pushing:

λ′[p[π′]]⊗ w[π′]⊗ ρ′[n[π′]] =λ[p[π]]⊗ w[π] ⊗ ρ[n[π]].

(17)


2. the weighted automatonB is stochastic, that is,

∀q ∈ Q,⊕

e′∈E′[q]

w[e′] = 1. (18)

These two properties of weight pushing are illus-trated by figures 5(a)-(b) and 12: the total weight ofa successful path is unchanged after pushing; at eachstate of the weighted automaton of Figure 5(b), theminimum weight of the outgoing transitions is0, andat each state of the weighted automaton of Figure 12,the weights of outgoing transitions sum to1.

3.5. Minimization

Finally, we discuss in more detail the minimizationalgorithm introduced in Section 2.5. A determinis-tic weighted automaton is said to beminimal if thereis no other deterministic weighted automaton witha smaller number of states that represents the samemapping from strings to weights. It can be shownthat the minimal deterministic weighted automatonhas also the minimal number of transitions among allequivalent deterministic weighted automata [Mohri,1997].

Two states of a deterministic weighted automa-ton are said to beequivalentif exactly the same setof strings label the paths from these states to a fi-nal state, and the total weight of the paths for eachstring, including the final weight of the final state,is the same. Thus, two equivalent states of a deter-ministic weighted automaton can be merged withoutaffecting the function realized by that automaton. Aweighted automaton is minimal when it is not possi-ble to create two distinct equivalent states after anypushing of the weights along its paths.

As outlined in Section 2.5, the general minimiza-tion algorithm for weighted automata consists of firstapplying the weight pushing algorithm to normal-ize the distribution of the weights along the paths ofthe input automaton, and then of treating each pair(label, weight) as a single label and applying clas-sical (unweighted) automata minimization [Mohri,1997]. The minimization of both unweighted andweighted finite-state transducers can also be viewedas instances of the algorithm presented here, but, forsimplicity, we will not discuss that further here. Thefollowing theorem holds [Mohri, 1997].

Theorem 3 Let A be a deterministic weighted au-tomaton over a semiringK. Assume that the condi-tions of application of the weight pushing algorithmhold. Then the execution of the following steps:

1. weight pushing,

2. (unweighted) automata minimization,

lead to a minimal weighted automaton equivalent toA.

The complexity of automata minimization is linearin the case of acyclic automataO(|Q| + |E|) andis O(|E| log |Q|) in the general case. In view ofthe complexity results of the previous section, forthe tropical semiring, the time complexity of theweighted minimization algorithm is linearO(|Q| +|E|) in the acyclic case andO(|E| log |Q|) in the gen-eral case.

Figure 5 illustrates the algorithm in the tropicalsemiring. AutomatonA cannot be further minimizedusing the classical unweighted automata minimiza-tion since no two states are equivalent in that ma-chine. After weight pushing, automatonB has twostates, 1 and 2, that can be merged by unweightedautomaton minimization.

Figure 13 illustrates the minimization of an au-tomaton defined over the probability semiring. Un-like the unweighted case, a minimal weighted au-tomaton is not unique, but all minimal weighted au-tomata have the same graph topology, and only dif-fer in the weight distribution along each path. Theweighted automataB′ andC′ are both minimal andequivalent toA′. B′ is obtained fromA′ using the al-gorithm described above in the probability semiringand it is thus a stochastic weighted automaton in theprobability semiring.

For a deterministic weighted automaton, the⊕operation can be arbitrarily chosen without affect-ing the mapping from strings to weights defined bythe automaton, because a deterministic weighted au-tomaton has at most one path labeled by any givenstring. Thus, in the algorithm described in theorem 3,the weight pushing step can be executed in any semir-ing K

′ whose multiplicative operation matches thatof K. The minimal weighted automata obtained bypushing the weights inK′ is also minimal inK, sinceit can be interpreted as a (deterministic) weighted au-tomaton overK.


0

1

a/1

b/2

c/3

2

d/4

e/5

3/1

e/.8f/1

e/4

f/5

0/(459/5) 1

a/(1/51)

b/(2/51)c/(3/51)

d/(20/51)e/(25/51)

2/1e/(4/9)f/(5/9)

0/25 1

a/.04

b/.08c/.12d/.90e/1

2/1e/.8f/1

(a) (b) (c)

Figure 13: Minimization of weighted automata. (a) WeightedautomatonA′ over the probability semiring.(b) Minimal weighted automatonB′ equivalent toA′. (c) Minimal weighted automatonC′ equivalent toA′.

In particular,A′ can be interpreted as a weightedautomaton over the semiring(R+, max,×, 0, 1).The application of the weighted minimization algo-rithm to A′ in this semiring leads to the minimalweighted automatonC′ of Figure 13(c).C′ is alsoa stochasticweighted automaton in the sense that, atany state, the maximum weight of all outgoing tran-sitions is one.

In the particular case of a weighted automatonover the probability semiring, it may be preferable touse weight pushing in the (max,×)-semiring sincethe complexity of the algorithm is then equivalentto that of classical single-source shortest-paths al-gorithms.5 The corresponding algorithm is a spe-cial instance of a generic shortest-distance algorithm[Mohri, 2002].

4. APPLICATIONS TO SPEECHRECOGNITION

We now describe the details of the application ofweighted finite-state transducer representations andalgorithms to speech recognition as introduced inSection 2.

4.1. Speech Recognition Transducers

As described in Section 2, we will represent vari-ous models in speech recognition as weighted-finite

5This preference assumes the resulting distribution of weightsalong paths is not important. As discussed in the next section,the weight distribution that results from pushing in the (+,×)semiring has advantages when the resulting automaton is used in apruned search.

state transducers. Four principal models are theword-level grammarG, the pronunciation lexiconL,the context-dependency transducerC, and the HMMtransducerH . We will discuss now the constructionof each these transducers. Since these will be com-bined by composition and optimized by determiniza-tion, we ensure they are efficient to compose and al-low weighted determinization.

The word-level grammarG, whether hand-crafted or learned from data, is typically a finite-statemodel in speech recognition. Hand-crafted finite-state models can be specified by regular expressions,rules or directly as automata. Stochasticn-grammodels, common in large vocabulary speech recog-nition, can be represented compactly by finite-statemodels. For example, a bigram grammar has a statefor every wordwi and a transition from statew1 tostatew2 for every bigramw1w2 that is seen in thetraining corpus. The transition is labeled withw2 andhas weight− log(p(w2|w1)), the negative log of theestimated transition probability. The weight of a bi-gramw1w3 that is not seen in the training data canbe estimated, for example, by backing-off to the un-igram. That is, it has weight− log(β(w1) p(w3)),wherep(w3) is the estimatedw3 unigram probabil-ity andβ(w1) is thew1 backoff weight [Katz, 1987].The unseen bigram could be represented as a tran-sition from statew1 to w3 in the bigram automatonjust as a seen bigram. However, this would result inO(|V |2) transitions in the automaton, where|V | isthe vocabulary size. A simple approximation, withthe introduction of abackoffstateb, avoids this. Inthis model, an unseenw1w3 bigram is representedas two transitions: anε-transition from statew1 to


� �

� �

� �

�

� � � � � � � � � � � �

� ��

�

� � � � � � � � � �

Figure 14: Word bigram transducer model: Seen bi-gramw1w2 represented as aw2-transition from statew1 to statew2; unseen bigramw1w3 represented asanε-transition from statew1 to backoff stateb and asaw3 transition from stateb to statew3.

stateb with weight − log(β(w1)) and a transitionfrom stateb to statew3 with label w3 and weight− log(p(w3)). This configuration is depicted in Fig-ure 14. This is an approximation since seen bigramsmay also be read as backed-off unigrams. However,since the seen bigram typically has higher probabil-ity than its backed-off unigram, it is usually a goodapproximation. A similar construction is used forhigher-ordern-grams.

These grammars present no particular issues forcomposition. However, the backoffε-transitions in-troduce non-determinism in then-gram model. Iffully determinized withoutε-transitions, O(|V |2)transitions would result. However, we can treat thebackoff ε labels as regular symbols during deter-minization, avoiding the explosion in the number oftransitions.

As described in Section 2, we represent the pro-nunciation lexiconL as the Kleene closure of theunion of individual word pronunciations as in Fig-ure 2(b). In order for this transducer to efficientlycompose withG, the output (word) labels must beplaced on the initial transitions of the words; otherlocations would lead to delays in the compositionmatching, which could consume significant time andspace.

In general, transducerL is not determinizable.This is clear in the presence of homophones. But,even without homophones, it may not be deter-minizable because the first word of the output stringmight not be known before the entire phone string

is scanned. Such unbounded output delays makeLnon-determinizable.

To make it possible to determinizeL, we intro-duce an auxiliary phone symbol, denoted#0, mark-ing the end of the phonetic transcription of eachword. Other auxiliary symbols#1 . . . #k−1 are usedwhen necessary to distinguish homophones, as in thefollowing example:

r eh d #0 readr eh d #1 red.

At mostP auxiliary phones are needed, whereP isthe maximum degree of homophony. The pronunci-ation dictionary transducer with these auxiliary sym-bols added is denoted byL. Allauzen et al. [2004b]describe more general alternatives to the direct con-struction ofL. In that work, so long asL correctly de-fines the pronunciation transduction, it can be trans-formed algorithmically to something quite similar toL, regardless of the initial disposition of the outputlabels or the presence of homophony.

As introduced in Section 2, we can representthe mapping from context-independent phones tocontext-dependentunits with a finite-state transducer,with Figure 6 giving a transition of that transducer.Figure 15 gives complete context-dependency trans-ducers where just two hypothetical phonesx andyare shown for simplicity. The transducer in Fig-ure 15(a) is non-deterministic, while the one inFigure 15(b) is deterministic. For illustration pur-poses, we will describe the non-deterministic ver-sion since it is somewhat simpler. As in Sec-tion 2, we denote the context-dependent units asphone/left context right context. Each state in Fig-ure 15(a) encodes the knowledge of the previous andnext phones. State labels in the figure are pairs(a, b)of the pasta and the futureb, with ε representingthe start or end of a phone string and∗ an unspec-ified future. For instance, it is easy to see that thephone stringxyx is mapped by the transducer tox/ε y y/x x x/y ε via the unique state sequence(ε, ∗)(x, y)(y, x)(x, ε). More generally, when therearen context-independent phones, this triphonic con-struction gives a transducer withO(n2) states andO(n3) transitions. A tetraphonic construction wouldgive a transducer withO(n3) states andO(n4) tran-sitions.

The following simple example shows the use


ε,* x,ε

x:x/ ε_ε

x,x

x:x/ ε_x

x,y

x:x/ ε_y

y,ε

y:y/ ε_ε

y,x

y:y/ ε_x

y,y

y:y/ ε_y x:x/x_ε

x:x/x_x

x:x/x_y

y:y/x_ ε

y:y/x_x

y:y/x_y

x:x/y_ε

x:x/y_x

x:x/y_y

y:y/y_ε

y:y/y_xy:y/y_y

(a)

ε,ε

ε,x

x:ε

ε,y

y:ε

x,ε$:x/ε_ε

x,x

x:x/ε_x

x,yy:x/ε_y

y,ε

$:y/ε_ε

y,x

x:y/ε_y

y,yy:y/ε_y

$:x/x_ε

x:x/x_xy:x/x_y

$:y/x_ε

x:y/x_x

y:y/x_y

$:x/y_εx:x/y_x

y:x/y_y

$:y/y_ε

x:y/y_x

y:y/y_y

(b)

Figure 15: Context-dependent triphone transducers: (a) non-deterministic, (b) deterministic.


(a)

0 1x

2y

3x

4x

5y

(b)

0 1x:x/e_y

2y:y/x_x

3x:x/y_x

4x:x/x_y

5y:y/x_e

(c)

0 1x

5

y

2y

4x

3x

y

y

(d)

0 1x:x/e_y

2

y:y/x_e

3

y:y/x_x

4x:x/y_x

5x:x/x_y

y:y/x_e

6

y:y/x_y y:y/y_e

y:y/y_x

Figure 16: Context-dependent composition examples: (a) context-independent ‘string’, (b) context-dependency applied to(a), (c) context-independent automaton, (d) context-dependency applied to(c).


of this context-dependency transducer. A context-independent string can be represented by the obvi-ous single-path acceptor as in Figure 16(a). Thiscan then be composed with the context-dependencytransducer in Figure 15.6 The result is the transducerin Figure 16(b), which has a single path labeled withthe context-independent labels on the input side andthe corresponding context-dependent labels on theoutput side.

The context-dependency transducer can be com-posed with more complex transducers than the triv-ial one in Figure 16(a). For example, composing thecontext-dependency transducer with the transducer inFigure 16(c) results in the transducer in Figure 16(d).By definition of relational composition, this mustcorrectly replace the context-independent units withthe appropriate context-dependent units on all of itspaths. Therefore, composition provides a conve-nient and general mechanism for applying context-dependency to ASR transducers.

The non-determinism of the transducer in Fig-ure 15(a) introduces a single symbol matching delayin the composition with the lexicon. The determin-istic transducer in Figure 15(b) composes without amatching delay, which makes it the better choice inapplications. However, it introduces a single-phoneshift between a context-independent phone and itscorresponding context-dependent unit in the result.This shift requires the introduction of a finalsubse-quentialsymbol$ to pad out the context. In prac-tice, this might be mapped to a silence phone or anε-transition.

If we let C represent a context-dependency trans-ducer from context-dependent phones to context-independent phones, then

C ◦ L ◦G

gives a transducer that maps from context-dependentphones to word strings restricted to the grammarG. Note thatC is the inverse of a transducer suchas in Figure 15; that is the input and output labelshave been exchanged on all transitions. For nota-tional convenience, we adopt this form of the context-dependency transducer when we use it in recognitioncascades.

6Before composition, we promote the acceptor in Figure 16(a)to the corresponding transducer with identical input and output la-bels.

For correctness, the context-dependency trans-ducerC must also accept all paths containing theauxiliary symbols added toL to make it deter-minizable. For determinizations at the context-dependent phone level and distribution level, eachauxiliary phone must be mapped to a distinct context-dependent-level symbol. Thus, self-loops are addedat each state ofC mapping each auxiliary phone toa new auxiliary context-dependent phone. The aug-mented context-dependency transducer is denoted byC.

As we did for the pronunciation lexicon, we canrepresent the HMM set asH , the closure of the unionof the individual HMMs (see Figure 1(c)). Note thatwe do not explicitly represent the HMM-state self-loops inH . Instead, we simulate those in the run-time decoder. WithH in hand,

H ◦ C ◦ L ◦G

gives a transducer that maps from distributions toword strings restricted toG.

Each auxiliary context-dependent phone inCmust be mapped to a new distinct distribution name.Self-loops are added at the initial state ofH withauxiliary distribution name input labels and auxiliarycontext-dependent phone output labels to allow forthis mapping. The modified HMM model is denotedby H .

We thus can use composition to combine all lev-els of our ASR transducers into an integrated trans-ducer in a convenient, efficient and general manner.When these automata are statically provided, we canapply the optimizations discussed in the next sectionto reduce decoding time and space requirements. Ifthe transducer needs to be modified dynamically, forexample by adding the results of a database lookup tothe lexicon and grammar in an extended dialogue, weadopt a hybrid approach that optimizes the fixed partsof the transducer and uses lazy composition to com-bine them with the dynamic portions during recogni-tion [Riley et al., 1997, Mohri and Pereira, 1998].

4.2. Transducer Standardization

To optimize an integrated transducer, we use threeadditional steps; (a) determinization, (b) minimiza-tion, and (c) factoring.


4.2.1. Determinization

We use weighted transducer determinization at eachstep of the composition of each pair of transducers.The main purpose of determinization is to eliminateredundant paths in the composed transducer, therebysubstantially reducing recognition time. In addition,its use in intermediate steps of the construction alsohelps to improve the efficiency of composition and toreduce transducer size.

First, L is composed withG and determinized,yieldingdet(L ◦G). The benefit of this determiniza-tion is the reduction of the number of alternative tran-sitions at each state to at most the number of distinctphones at that state, while the original transducer mayhave as many asV outgoing transitions at some stateswhereV is the vocabulary size. For large tasks inwhich the vocabulary has105 to 106 words, the ad-vantages of this optimization are clear.

C is then composed with the resulting transducerand determinized. Similarly,H is composed with thecontext-dependent transducer and determinized. Thislast determinization increases sharing among HMMmodels that start with the same distributions. At eachstate of the resulting integrated transducer, there is atmost one outgoing transition labeled with any givendistribution name, reducing recognition time evenmore.

In a final step, we use the erasing operationπε

that replace the auxiliary distribution symbols byε’s.The complete sequence of operations is summarizedby the following construction formula:

N = πε(det(H ◦ det(C ◦ det(L ◦G)))). (19)

where parentheses indicate the order in which the op-erations are performed. The resultN is an integratedrecognition transducer that can be constructed evenin very large-vocabulary tasks and leads to a substan-tial reduction in recognition time, as the experimentalresults below show.

4.2.2. Minimization

Once we have determinized the integrated transducer,we can reduce it further by minimization. The auxil-iary symbols are left in place, the minimization algo-rithm is applied, and then the auxiliary symbols are

removed:

N = πε(min(det(H ◦ det(C ◦ det(L ◦G))))). (20)

Weighted minimization can be used in differentsemirings. Both minimization in the tropical semir-ing and minimization in the log semiring can be usedin this context. It is not hard to prove that the resultsof these two minimizations have exactly the samenumber of states and transitions and only differ inhow weight is distributed along paths. The differencein weights arises from differences in the definition ofthe weight pushing operation for different semirings.

Weight pushing in the log semiring has a verylarge beneficial impact on the pruning efficacy of astandard Viterbi beam search. In contrast, weightpushing in the tropical semiring, which is based onlowest weights between paths described earlier, pro-duces a transducer that may slow down beam-prunedViterbi decoding many fold.

To push weights in the log semiring instead of thetropical semiring, the potential function is the− logof the total probability of paths from each state to thesuper-final state rather than the lowest weight fromthe state to the super-final state. In other words, thetransducer is pushed in terms of probabilities alongall future paths from a given state rather than thehighest probability over the single best path. By us-ing − log probability pushing, we preserve a desir-able property of the language model, namely that theweights of the transitions leaving each state are nor-malized as in a probabilistic automaton [Carlyle andPaz, 1971]. We have observed that probability push-ing makes pruning more effective [Mohri and Riley,2001], and conjecture that this is because the acousticlikelihoods and the transducer probabilities are nowsynchronizedto obtain the optimal likelihood ratiotest for deciding whether to prune. We further con-jecture that this reweighting is the best possible forpruning. A proof of these conjectures will require acareful mathematical analysis of pruning.

We have thusstandardizedthe integrated trans-ducer in our construction — it is theuniquedetermin-istic, minimal transducer for which the weights for alltransitions leaving any state sum to1 in probability,up to state relabeling. If one accepts that these aredesirable properties of an integrated decoding trans-ducer, then our methods obtain theoptimalsolutionamong all integrated transducers.


0 1

jim/1.386jill/0.693bill/1.386

2/0

read/0.400wrote/1.832fled/1.771

(a)

0

14jh:jim

10jh:jill

1b:bill

18r:read

22r:read

26r:wrote

5f:fled

15ih:<eps>

11ih:<eps>

2ih:<eps>

19eh:<eps>

23iy:<eps>

27ow:<eps>

6l:<eps>

3l:<eps>

4

#0:<eps><eps>:<eps>

7eh:<eps>

8d:<eps>

9

#0:<eps>

<eps>:<eps>

12l:<eps>

13

#0:<eps>

<eps>:<eps>

16m:<eps>

17#0:<eps>

<eps>:<eps>

20d:<eps>

21

#0:<eps>

<eps>:<eps>

24d:<eps>

25#0:<eps>

<eps>:<eps>

28t:<eps> 29#0:<eps><eps>:<eps>

30 <eps>:<eps>

(b)

0

2jh:jim/1.386

19jh:jill/0.693

22

b:bill/1.386

3ih:<eps>/0

20ih:<eps>/0

23ih:<eps>/0

1/0

4m:<eps>/0

5

#0:<eps>/06r:read/0.400

9r:read/0.400

12r:wrote/1.832

15

f:fled/1.771

7eh:<eps>/0

10iy:<eps>/0

13ow:<eps>/0

16l:<eps>/0

8d:<eps>/0 #0:<eps>/0

11d:<eps>/0 #0:<eps>/0

14t:<eps>/0#0:<eps>/0

17eh:<eps>/0 18d:<eps>/0

#0:<eps>/021l:<eps>/0 #0:<eps>/0

24l:<eps>/0

#0:<eps>/0

(c)

02b:bill/1.386

3jh:<eps>/0.693

4ih:<eps>/0

5ih:<eps>/0 1/0

6l:<eps>/0

7l:jill/0

8

m:jim/0.6939

#0:<eps>/0

#0:<eps>/0#0:<eps>/0

10f:fled/1.771

11r:<eps>/0.400

12l:<eps>/013eh:read/0

14iy:read/0

15

ow:wrote/1.432

16eh:<eps>/0

17d:<eps>/0

18d:<eps>/0

19t:<eps>/0

20d:<eps>/0

#0:<eps>/0

#0:<eps>/0

#0:<eps>/0

#0:<eps>/0

(d)

02b:bill/0.693

3jh:<eps>/0

4ih:<eps>/0

13ih:<eps>/0 1/05

l:<eps>/0

l:jill/0

m:jim/0.6936#0:<eps>/0

7f:fled/1.371

8r:<eps>/0

9l:<eps>/0

10iy:read/0

eh:read/0

12ow:wrote/1.431

eh:<eps>/0

11

d:<eps>/0

t:<eps>/0#0:<eps>/1.093

(e)

02b:bill/1.386

3jh:<eps>/0.287

4ih:<eps>/0

11ih:<eps>/0 1/05

l:<eps>/0

l:jill/0.405

m:jim/1.098

6f:fled/2.284

7r:<eps>/0.107

8l:<eps>/0

9eh:read/0.805

iy:read/0.805

10ow:wrote/2.237

eh:<eps>/0

d:<eps>/0

t:<eps>/0

(f)

Figure 17: Recognition transducer construction: (a) grammarG, (b) lexiconL, (c) L ◦G, (d)det(L ◦G), (e)mintropical(det(L ◦G)), (f) minlog(det(L ◦G)).


Figure 17 illustrates the steps in this construc-tion. For simplicity, we consider a small toy grammarand show the construction only down to the context-independent phone level. Figure 17(a) shows the toygrammarG and Figure 17(b) shows the lexiconL.Note the word labels on the lexicon are on the initialtransitions and that disambiguating auxiliary sym-bols have been added at the word ends. Figure 17(c)shows their compositionL ◦ G. Figure 17(d) showsthe resulting determinization,det(L ◦ G); observehow phone redundancy is removed. Figures 17(e)-(f) show the minimization step,min(det(L ◦ G));identical futures are combined. In Figure 17(e), theminimization uses weight pushing over the tropicalsemiring, while in Figure 17(f), the log semiring isused.

4.2.3. Factoring

For efficiency reasons, our decoder has a sepa-rate representation for variable-length left-to-rightHMMs, which we will call theHMM specification.The integrated transducer of the previous sectiondoes not take good advantage of this since, havingcombined the HMMs into the recognition transducerproper, the HMM specification consists of trivial one-state HMMs. However, by suitablyfactoring the in-tegrated transducer, we can again take good advan-tage of this feature.

A path whose states other than the first and lasthave at most one outgoing and one incoming tran-sition is called achain. The integrated recognitiontransducer just described may contain many chainsafter the composition withH , and after determiniza-tion. As mentioned before, we do not explicitly rep-resent the HMM-state self-loops but simulate themin the run-time decoder. The set of all chains inN isdenoted bychain(N).

The input labels ofN name one-state HMMs. Wecan replace the input of each length-n chain inN by asingle label naming ann-state HMM. The same labelis used for all chains with the same input string. Theresult of that replacement is a more compact trans-ducer denoted byF . The factoring operation onNleads to the following decomposition:

N = H ′ ◦ F, (21)

where H ′ is a transducer mapping variable-length

transducer states transitionsG 1,339,664 3,926,010L ◦G 8,606,729 11,406,721det(L ◦G) 7,082,404 9,836,629C ◦ det(L ◦G)) 7,273,035 10,201,269det(H ◦ C ◦ L ◦G) 18,317,359 21,237,992F 3,188,274 6,108,907min(F ) 2,616,948 5,497,952

Table 2: Size of the first-pass recognition transducersin the NAB40, 000-word vocabulary task.

left-to-right HMM state distribution names ton-stateHMMs. SinceH ′ can be separately represented inthe decoder’s HMM specification, the actual recog-nition transducer is justF .

Chain inputs are in fact replaced by a single labelonly when this helps to reduce the size of the trans-ducer. This can be measured by defining thegain ofthe replacement of an input stringσ of a chain by:

G(σ) =∑

π∈chain(N),i[π]=σ

|σ| − |o[π]| − 1, (22)

where|σ| denotes the length of the stringσ, i[π] theinput label ando[π] the output label of a pathπ. Thereplacement of a stringσ helps reduce the size of thetransducer ifG(σ) > 0.

Our implementation of the factoring algorithm al-lows one to specify the maximum numberr of re-placements done (ther chains with the highest gainare replaced), as well as the maximum length of thechains that are factored.

Factoring does not affect recognition time. It canhowever significantly reduce the size of the recogni-tion transducer. We believe that even better factoringmethods may be found in the future.

4.2.4. Experimental Results – First-Pass Transduc-ers

We used the techniques discussed in the previous sec-tions to build many recognizers. To illustrate effec-tiveness of the techniques and explain some practi-cal details, we discuss here an integrated, optimizedrecognition transducer for a40, 000-word vocabulary


transducer × real-timeC ◦ L ◦G 12.5C ◦ det(L ◦G) 1.2det(H ◦ C ◦ L ◦G) 1.0min(F ) 0.7

transducer × real-timeC ◦ L ◦G .18C ◦ det(L ◦G) .13C ◦min(det(L ◦G)) .02

(a) (b)

Table 3: (a) Recognition speed of the first-pass transducersin the NAB40, 000-word vocabulary task at 83%word accuracy. (b) Recognition speed of the second-pass transducers in the NAB160, 000-word vocabularytask at 88% word accuracy.

North American Business News (NAB) task. Thefollowing models are used:

• Acoustic model of 7,208 distinct HMM states,each with an emission mixture distribution of upto twelve Gaussians.

• Triphonic context-dependency transducerC with1,525 states and 80,225 transitions.

• 40, 000-word pronunciation dictionaryL with anaverage of 1.056 pronunciations per word and anout-of-vocabulary rate of 2.3% on the NAB Eval’95 test set.

• Trigram language modelG with 3,926,010 tran-sitions built by Katz’s back-off method with fre-quency cutoffs of 2 for bigrams and 4 for trigrams,shrunk with an epsilon of40 using the methodof [Seymore and Rosenfeld, 1996], which retainedall the unigrams, 22.3% of the bigrams and 19.1%of the trigrams. Perplexity on the NAB Eval ’95test set is 164.4 (142.1 before shrinking).

We applied the transducer optimization steps asdescribed in the previous section except that we ap-plied the minimization and weight pushing after fac-toring the transducer. Table 2 gives the size of theintermediate and final transducers.

Observe that the factored transducermin(F ) hasonly about40% more transitions thanG. The HMMspecificationH ′ consists of 430,676 HMMs with anaverage of7.2 states per HMM. It occupies onlyabout10% of the memory ofmin(F ) in the decoder(due to the compact representation possible from itsspecialized topology). Thus, the overall memory re-duction from factoring is substantial.

We used these transducers in a simple, general-purpose, one-pass Viterbi decoder applied to theDARPA NAB Eval ’95 test set. Table 4.2.4(a) showsthe recognition speed on a Compaq Alpha 21264processor for the various optimizations, where theword accuracy has been fixed at 83.0%. We see thatthe fully-optimized recognition transducer,min(F ),substantially speeds up recognition.

To obtain improved accuracy, we might widen thedecoder beam7, use a larger vocabulary, or use a lessshrunken language model. Figure 18(a) shows the af-fect of vocabulary size (with a bigram LM and opti-mization only to theL◦G level). We see that beyond40,000 words, there is little benefit to increasing thevocabulary either in real-time performance or asymp-totically. Figure 18(b) shows the affect of the lan-guage model shrinking parameter. These curves wereproduced by Stephan Kanthak of RWTH using ourtransducer construction, but RWTH’s acoustic mod-els, as part of a comparison with lexical tree meth-ods [Kanthak et al., 2002]. As we can see, decreasingthe shrink parameter from 40 as used above to 10 hasa significant affect, while further reducing it to 5 hasvery little affect. An alternative to using a larger LMis to use a two-pass system to obtain improved accu-racy, as described in the next section. This has theadvantage it allows quite compact shrunken bigramLMs in the first-pass, while the second pass performsas well as the larger-model single pass systems.

While our examples here have been on NAB,we have also applied these methods to BroadcastNews [Saraclar et al., 2002], Switchboard, and vari-ous AT&T-specific large-vocabulary tasks [Allauzenet al., 2004b]. In our experience, fully-optimized and

7These models have an asymptotic wide-beam accuracy of85.3%.


factored recognition transducers provide very fast de-coding while often having well less than twice thenumber of transitions as their word-level grammars.

4.2.5. Experimental Results – Rescoring Transduc-ers

The weighted transducer approach is also easily ap-plied to multipass recognition. To illustrate this, wenow show how to implement lattice rescoring for a160, 000-word vocabulary NAB task. The followingmodels are used to build lattices in a first pass:

• Acoustic model of 5,520 distinct HMM states,each with an emission mixture distribution of upto four Gaussians.


• 160, 000-word pronunciation dictionaryL with anaverage of 1.056 pronunciations per word and anout-of-vocabulary rate of 0.8% on the NAB Eval’95 test set.

• Bigram language modelG with 1,238,010 tran-sitions built by Katz’s back-off method with fre-quency cutoffs of 2 for bigrams. It is shrunk withan epsilon of160 using the method of [Seymoreand Rosenfeld, 1996], which retained all the un-igrams and 13.9% of the bigrams. Perplexity onthe NAB Eval ’95 test set is 309.9.

We used an efficient approximate lattice genera-tion method [Ljolje et al., 1999] to generate word lat-tices. These word lattices are then used as the ‘gram-mar’ in a second rescoring pass. The following mod-els are used in the second pass:

• Acoustic model of 7,208 distinct HMM states,each with an emission mixture distribution of upto twelve Gaussians. The model is adapted to eachspeaker using a single full-matrix MLLR trans-form [Leggetter and Woodland, 1995].


• 160, 000-word stochastic, TIMIT-trained,multiple-pronunciation lexiconL [Riley et al.,1999].

• 6-gram language modelG with 40,383,635 tran-sitions built by Katz’s back-off method with fre-quency cutoffs of 1 for bigrams and trigrams, 2for 4-grams, and 3 for 5-grams and 6-grams. Itis shrunk with an epsilon of5 using the methodof Seymore and Rosenfeld, which retained all theunigrams, 34.6% of the bigrams, 13.6% of thetrigrams, 19.5% of the 4-grams, 23.1% of the 5-grams, and 11.73% of the 6-grams. Perplexity onthe NAB Eval ’95 test set is 156.83.

We applied the transducer optimization steps de-scribed in the previous section but only to the level ofL ◦G (whereG is each lattice). Table 4.2.4(b) showsthe speed of second-pass recognition on a CompaqAlpha 21264 processor for these optimizations whenthe word accuracy is fixed at 88.0% on the DARPAEval ’95 test set.8 We see that the optimized recogni-tion transducers again substantially speed up recog-nition. The median number of lattice states and arcsis reduced by∼ 50% by the optimizations.

5. CONCLUSION

We presented an overview of weighted finite-statetransducer methods and their application to speechrecognition. The methods are quite general, and canalso be applied in other areas of speech and languageprocessing, including information extraction, speechsynthesis [Sproat, 1997, Allauzen et al., 2004a],phonological and morphological analysis [Kaplanand Kay, 1994, Karttunen, 1995], in optical char-acter recognition, biological sequence analysis, andother pattern matching and string processing applica-tions [Crochemore and Rytter, 1994], and in imageprocessing [Culik II and Kari, 1997], just to mentionsome of the most active application areas.

Acknowledgments

We thank Andrej Ljolje for providing the acoustic modelsand Don Hindle and Richard Sproat for providing the lan-guage models used in our experiments.

The first author’s work was partially funded by theNew York State Office of Science Technology and Aca-demic Research (NYSTAR). The second author’s work was

8The recognition speed excludes the offline transducer con-struction time.


1

1

11

1

x real-time

wor

d ac

cura

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0

6065

7075

8085

2

2

22

2

3

3

33

3

4

4

44

4

11

12

13

14

15

16

17

18

19

20

0 0.5 1 1.5 2 2.5 3 3.5 4

WE

R

RTF

"5""10""40"

Figure 18: (a) Effect of vocabulary size: NAB bigram recognition results for vocabularies of (1) 10,000words, (2) 20,000 words, (3) 40,000 words, and (4) 160,000 words (LG Optimized Only). (b) NAB Eval’95 recognition results for the Seymore & Rosenfeld shrink factors of 5, 10, and 40 (thanks to RWTH; usesRWTH acoustic models).

partly funded by NSF grants EIA 0205456, EIA 0205448,and IIS 0428193.

References

Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ull-man. The design and analysis of computer algo-rithms. Addison Wesley, Reading, MA, 1974.

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.Compilers, Principles, Techniques and Tools. Ad-dison Wesley, Reading, MA, 1986.

Cyril Allauzen and Mehryar Mohri. An Optimal Pre-Determinization Algorithm for Weighted Trans-ducers. Theoretical Computer Science, 328(1-2):3–18, November 2004.

Cyril Allauzen and Mehryar Mohri. Efficient Al-gorithms for Testing the Twins Property.Journalof Automata, Languages and Combinatorics, 8(2):117–144, 2003.

Cyril Allauzen, Mehryar Mohri, and Michael Riley.Statistical Modeling for Unit Selection in SpeechSynthesis. In42nd Meeting of the Associationfor Computational Linguistics (ACL 2004), Pro-ceedings of the Conference, Barcelona, Spain, July2004a.

Cyril Allauzen, Mehryar Mohri, Brian Roark, andMichael Riley. A Generalized Constructionof Integrated Speech Recognition Transducers.In Proceedings of the International Conferenceon Acoustics, Speech, and Signal Processing(ICASSP 2004), Montreal, Canada, May 2004b.

Jean Berstel.Transductions and Context-Free Lan-guages. Teubner Studienbucher, Stuttgart, 1979.

Jean Berstel and Christophe Reutenauer.RationalSeries and Their Languages. Springer-Verlag,Berlin-New York, 1988.

Jack W. Carlyle and Azaria Paz. Realizations byStochastic Finite Automaton.Journal of Computerand System Sciences, 5:26–40, 1971.

Maxime Crochemore and Wojciech Rytter.Text Al-gorithms. Oxford University Press, 1994.

Karel Culik II and Jarkko Kari. Digital Images andFormal Languages. In Grzegorz Rozenberg andArto Salomaa, editors,Handbook of Formal Lan-guages, pages 599–616. Springer, 1997.

Samuel Eilenberg.Automata, Languages and Ma-chines, volume A-B. Academic Press, 1974-1976.

John E. Hopcroft and Jeffrey D. Ullman.Introduc-tion to Automata Theory, Languages, and Compu-tation. Addison Wesley, Reading, MA, 1979.


Stephan Kanthak, Hermann Ney, Michael Riley, andMehryar Mohri. A Comparison of Two LVRSearch Optimization Techniques. InProceedingsof the International Conference on Spoken Lan-guage Processing 2002 (ICSLP ’02), Denver, Col-orado, September 2002.

Ronald M. Kaplan and Martin Kay. Regular Mod-els of Phonological Rule Systems.ComputationalLinguistics, 20(3), 1994.

Lauri Karttunen. The Replace Operator. In33rd

Meeting of the Association for Computational Lin-guistics (ACL 95), Proceedings of the Conference,MIT, Cambridge, Massachussetts. ACL, 1995.

Slava M. Katz. Estimation of probabilities fromsparse data for the language model component ofa speech recogniser.IEEE Transactions on Acous-tic, Speech, and Signal Processing, 35(3):400–401, 1987.

Werner Kuich and Arto Salomaa. Semirings,Automata, Languages. Number 5 in EATCSMonographs on Theoretical Computer Science.Springer-Verlag, Berlin, Germany, 1986.

Chris Leggetter and Phil Woodland. Maximum Like-lihood Linear Regession for Speaker Adaptationof Continuous Density HMMs.Computer Speechand Language, 9(2):171–186, 1995.

Daniel J. Lehmann. Algebraic Structures for Tran-sitive Closures.Theoretical Computer Science, 4:59–76, 1977.

Andrej Ljolje, Fernando Pereira, and Michael Riley.Efficient General Lattice Generation and Rescor-ing. In Proceedings of the European Conferenceon Speech Communication and Technology (Eu-rospeech ’99), Budapest, Hungary, 1999.

Mehryar Mohri. Semiring Frameworks and Algo-rithms for Shortest-Distance Problems.Journalof Automata, Languages and Combinatorics, 7(3):321–350, 2002.

Mehryar Mohri. Finite-State Transducers in Lan-guage and Speech Processing.Computational Lin-guistics, 23(2), 1997.

Mehryar Mohri. Statistical Natural Language Pro-cessing. In M. Lothaire, editor,Applied Combi-natorics on Words. Cambridge University Press,2005.

Mehryar Mohri and Mark-Jan Nederhof. Reg-ular Approximation of Context-Free Grammarsthrough Transformation. In Jean claude Junquaand Gertjan van Noord, editors,Robustness inLanguage and Speech Technology, pages 153–163.Kluwer Academic Publishers, The Netherlands,2001.

Mehryar Mohri and Fernando C.N. Pereira. DynamicCompilation of Weighted Context-Free Gram-mars. In36th Annual Meeting of the ACL and17th International Conference on ComputationalLinguistics, volume 2, pages 891–897, 1998.

Mehryar Mohri and Michael Riley. IntegratedContext-Dependent Networks in Very Large Vo-cabulary Speech Recognition. InProceedings ofthe 6th European Conference on Speech Commu-nication and Technology (Eurospeech ’99), Bu-dapest, Hungary, 1999.

Mehryar Mohri and Michael Riley. A WeightPushing Algorithm for Large Vocabulary SpeechRecognition. InProceedings of the 7th Euro-pean Conference on Speech Communication andTechnology (Eurospeech ’01), Aalborg, Denmark,September 2001.

Mehryar Mohri and Michael Riley. Network Opti-mizations for Large Vocabulary Speech Recogni-tion. Speech Communication, 25(3), 1998.

Mehryar Mohri, Fernando Pereira, and Michael Ri-ley. Weighted Automata in Text and Speech Pro-cessing. InECAI-96 Workshop, Budapest, Hun-gary. ECAI, 1996.

Mehryar Mohri, Michael Riley, Don Hindle, AndrejLjolje, and Fernando Pereira. Full Expansion ofContext-Dependent Networks in Large VocabularySpeech Recognition. InProceedings of the In-ternational Conference on Acoustics, Speech, andSignal Processing (ICASSP ’98), Seattle, Wash-ington, 1998.


Mehryar Mohri, Fernando Pereira, and Michael Ri-ley. The Design Principles of a Weighted Finite-State Transducer Library.Theoretical ComputerScience, 231:17–32, January 2000.

Mark-Jan Nederhof. Practical Experiments with Reg-ular Approximation of Context-free Languages.Computational Linguistics, 26(1), 2000.

Stefan Ortmanns, Hermann Ney, and A. Eiden.Language-Model Look-Ahead for Large Vocabu-lary Speech Recognition. InProceedings of the In-ternational Conference on Spoken Language Pro-cessing (ICSLP’96), pages 2095–2098. Universityof Delaware and Alfred I. duPont Institute, 1996.

Fernando Pereira and Michael Riley.Finite StateLanguage Processing, chapter Speech Recogni-tion by Composition of Weighted Finite Automata.The MIT Press, 1997.

Fernando Pereira and Rebecca Wright. Finite-StateApproximation of Phrase-Structure Grammars. InE. Roche and Y. Schabes, editors,Finite-StateLanguage Processing, pages 149–173. MIT Press,1997.

Dominique Revuz. Minimisation of Acyclic Deter-ministic Automata in Linear Time. TheoreticalComputer Science, 92:181–189, 1992.

Michael Riley, Fernando Pereira, and MehryarMohri. Transducer Composition for Context-Dependent Network Expansion. InProceedings ofEurospeech’97. Rhodes, Greece, 1997.

Michael Riley, William Byrne, Michael Finke, San-jeev Khudanpur, Andrej Ljolje, John McDonough,Harriet Nock, Murat Saraclar, Charles Wooters,and George Zavaliagkos. Stochastic pronuncia-tion modelling form hand-labelled phonetic cor-pora.Speech Communication, 29:209–224, 1999.

Arto Salomaa and Matti Soittola. Automata-Theoretic Aspects of Formal Power Series.Springer-Verlag, New York, 1978.

Muret Saraclar, Michael Riley, Enrico Bocchieri, andVincent Goffin. Towards automatic closed cap-tioning : Low latency real time broadcast newstranscription. InProceedings of the InternationalConference on Spoken Language Processing (IC-SLP’02), 2002.

Kristie Seymore and Roni Rosenfeld. Scalable Back-off Language Models. InProceedings of ICSLP,Philadelphia, Pennsylvania, 1996.

Richard Sproat. Multilingual Text Analysis for Text-to-Speech Synthesis.Journal of Natural LanguageEngineering, 2(4):369–380, 1997.

Indexacceptor

weighted, 2algorithm

composition, 4, 11determinization, 6, 13, 23equivalence, 6factoring, 25intersection, 13minimization, 7, 17, 23projection, 11reweighting, 8shortest-distance, 8, 16, 18weight-pushing, 8, 16

automatondeterministic, 6minimal, 17stochastic, 17, 18weighted, 11

chain, 25composition, 4, 11cycle, 11

epsilon, 11

determinization, 6, 13, 23

equivalenceacceptor, 6transducer, 6

factoring, 25

HMM specification, 25

intersection, 13

language model, 18backoff, 18shrunken, 26

lazy evaluation, 4lexicon

pronunciation, 9, 19

minimization, 7, 17, 23

path, 11product

Hadamard, 13projection, 11pushing

weight, 8, 16

reweighting, 8

semiringBoolean, 10cancellative, 10closed, 10commutative, 10log, 6, 10probability, 10string, 6tropical, 6, 10weakly left-divisible, 10zero-sum-free, 10

shortest-distance, 8, 16, 18speech recognition

first-pass, 25real-time, 25rescoring, 27second-pass, 27

symbolsauxiliary, 10, 19

transducercontext-dependency, 9, 19deterministic, 13regulated, 11speech recognition, 8, 18standardized, 23subsequential, 13trim, 11weighted, 11

twins property, 14

Viterbi approximation, 10

WEIGHTED-COMPOSITION, 12WEIGHTED-DETERMINIZATION, 15

31

speech recognition with weighted finite-state...

Documents