applications of statistical machine translation approaches to spoken language understanding

16
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 803 Applications of Statistical Machine Translation Approaches to Spoken Language Understanding Klaus Macherey, Oliver Bender, and Hermann Ney, Senior Member, IEEE Abstract—In this paper, we investigate two statistical methods for spoken language understanding based on statistical machine translation. The first approach employs the source-channel para- digm, whereas the other uses the maximum entropy framework. Starting with an annotated corpus, we describe the problem of natural language understanding as a translation from a source sentence to a formal language target sentence. We analyze the quality of different alignment models and feature functions and show that the direct maximum entropy approach outperforms the source channel-based method. Furthermore, we investigate how both methods perform if the input sentences contain speech recognition errors. Finally, we investigate a new approach to combine speech recognition and spoken language understanding. For this purpose, we employ minimum error rate training which directly optimizes the final evaluation criterion. By combining all knowledge sources in a log-linear way, we show that we can decrease both the word error rate and the slot error rate. Ex- periments were carried out on two German inhouse corpora for spoken dialogue systems. Index Terms—Combined approach, machine translation, max- imum entropy, minimum error rate training, speech recognition, spoken language understanding. I. INTRODUCTION I N spoken language understanding (SLU), the objective is to extract all the information from a speech-based input that is relevant to a specific task. Formally, the objective can be de- scribed as follows: given an input utterance represented as a sequence of acoustic vectors , we search for the sequence of concepts having the highest probability among all possible concept sequences (1) (2) Manuscript received May 21, 2008; revised November 23, 2008. Current ver- sion published April 01, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ruhi Sarikaya. K. Macherey was with the Computer Science Department 6, RWTH Aachen University, D-52056 Aachen Germany. He is now with Google, Inc., Mountain View, CA 94043 USA (e-mail: [email protected]). O. Bender and H. Ney are with the Human Language Technology and Pat- tern Recognition, Computer Science Department 6, RWTH Aachen University, D-52056 Aachen, Germany (e-mail: [email protected]; [email protected] aachen.de). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2009.2014262 where the word sequence is a hidden variable and is assumed to be independent of the sequence of acoustic vectors. denotes the statistical approach to automatic speech recognition (ASR) and is the statistical natural language understanding (NLU) component. Often, (2) is approximated by a sequential decoding (3) that is, in the first step, the optimal word sequence maxi- mizing is determined which is then passed to the NLU component in the second step. This approximation can be suboptimal because errors made in the ASR component often remain undetected and are propagated to the subsequent NLU component. To alleviate this problem, the optimal word sequence can be replaced by word graphs or -best lists on which then the search for the optimal concept sequence is carried out [1]. Modeling the conditional probability is a com- plex task. Many NLU systems perform a rule-based semantic analysis using stochastic context free grammars in order to determine the optimal sequence of concepts [2]. Although rule-based approaches are useful if only little or no data is available, providing handcrafted, consistent rules is often very costly. Therefore, alternative approaches are useful that reduce the manual effort of writing rules explicitly. In this paper, our main focus is on modeling the NLU probability and on combining ASR with NLU such that the overall error criterion is minimized. The NLU models we propose are based on statistical machine translation (SMT). Throughout the paper, we will make use of ideas and concepts used in SMT and apply them to the problem of NLU. The remainder of this paper is organized as follows. In Section II, we give a general view of NLU based on SMT from which we derive two approaches, a source channel-based approach and a maximum entropy (ME)-based approach. We describe the concept language used and introduce alignment models employed for both approaches. In Section III, we describe the source-channel approach using phrase-based machine translation. In Section IV, we define several feature functions that are suitable for NLU and are used in the ME model. The framework for combining ASR with NLU is outlined in Section V. Related work is discussed in Section VI and results are presented in Section VII. II. GENERAL VIEW OF NATURAL LANGUAGE UNDERSTANDING The objective of NLU is to extract all the information from a natural language-based input that is relevant to a specific task. 1558-7916/$25.00 © 2009 IEEE

Upload: h

Post on 24-Sep-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 803

Applications of Statistical Machine TranslationApproaches to Spoken Language Understanding

Klaus Macherey, Oliver Bender, and Hermann Ney, Senior Member, IEEE

Abstract—In this paper, we investigate two statistical methodsfor spoken language understanding based on statistical machinetranslation. The first approach employs the source-channel para-digm, whereas the other uses the maximum entropy framework.Starting with an annotated corpus, we describe the problem ofnatural language understanding as a translation from a sourcesentence to a formal language target sentence. We analyze thequality of different alignment models and feature functions andshow that the direct maximum entropy approach outperformsthe source channel-based method. Furthermore, we investigatehow both methods perform if the input sentences contain speechrecognition errors. Finally, we investigate a new approach tocombine speech recognition and spoken language understanding.For this purpose, we employ minimum error rate training whichdirectly optimizes the final evaluation criterion. By combiningall knowledge sources in a log-linear way, we show that we candecrease both the word error rate and the slot error rate. Ex-periments were carried out on two German inhouse corpora forspoken dialogue systems.

Index Terms—Combined approach, machine translation, max-imum entropy, minimum error rate training, speech recognition,spoken language understanding.

I. INTRODUCTION

I N spoken language understanding (SLU), the objective is toextract all the information from a speech-based input that is

relevant to a specific task. Formally, the objective can be de-scribed as follows: given an input utterance represented as asequence of acoustic vectors , we search forthe sequence of concepts having the highestprobability among all possible concept sequences

(1)

(2)

Manuscript received May 21, 2008; revised November 23, 2008. Current ver-sion published April 01, 2009. The associate editor coordinating the review ofthis manuscript and approving it for publication was Dr. Ruhi Sarikaya.

K. Macherey was with the Computer Science Department 6, RWTH AachenUniversity, D-52056 Aachen Germany. He is now with Google, Inc., MountainView, CA 94043 USA (e-mail: [email protected]).

O. Bender and H. Ney are with the Human Language Technology and Pat-tern Recognition, Computer Science Department 6, RWTH Aachen University,D-52056 Aachen, Germany (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2009.2014262

where the word sequence is a hiddenvariable and is assumed to be independent ofthe sequence of acoustic vectors. denotes thestatistical approach to automatic speech recognition (ASR) and

is the statistical natural language understanding(NLU) component. Often, (2) is approximated by a sequentialdecoding

(3)

that is, in the first step, the optimal word sequence maxi-mizing is determined which is then passed to theNLU component in the second step. This approximation canbe suboptimal because errors made in the ASR componentoften remain undetected and are propagated to the subsequentNLU component. To alleviate this problem, the optimal wordsequence can be replaced by word graphs or -best listson which then the search for the optimal concept sequenceis carried out [1].

Modeling the conditional probability is a com-plex task. Many NLU systems perform a rule-based semanticanalysis using stochastic context free grammars in order todetermine the optimal sequence of concepts [2]. Althoughrule-based approaches are useful if only little or no data isavailable, providing handcrafted, consistent rules is often verycostly. Therefore, alternative approaches are useful that reducethe manual effort of writing rules explicitly. In this paper,our main focus is on modeling the NLU probability and oncombining ASR with NLU such that the overall error criterionis minimized. The NLU models we propose are based onstatistical machine translation (SMT). Throughout the paper,we will make use of ideas and concepts used in SMT and applythem to the problem of NLU. The remainder of this paper isorganized as follows. In Section II, we give a general view ofNLU based on SMT from which we derive two approaches,a source channel-based approach and a maximum entropy(ME)-based approach. We describe the concept language usedand introduce alignment models employed for both approaches.In Section III, we describe the source-channel approach usingphrase-based machine translation. In Section IV, we defineseveral feature functions that are suitable for NLU and are usedin the ME model. The framework for combining ASR withNLU is outlined in Section V. Related work is discussed inSection VI and results are presented in Section VII.

II. GENERAL VIEW OF NATURAL LANGUAGE UNDERSTANDING

The objective of NLU is to extract all the information from anatural language-based input that is relevant to a specific task.

1558-7916/$25.00 © 2009 IEEE

Page 2: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

804 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

Often, stochastic grammars are used for this task requiring hand-crafted rules, that is, dependencies between words and conceptsmust be modeled explicitly. Although partial parsing techniquesare well suited to parse ungrammatical sentences, the sole usageof rule-based methods can turn out to be inflexible. Especially,when extending the application scenario or the application do-main itself, many rules must be rewritten by hand to adjust themto new phrases and keywords. Therefore, the question is howthis process can be simplified and whether the complex depen-dencies between words and concepts can be learned automati-cally by just providing pairs of sentences and concept strings.

To achieve this goal, we take a closer look at the NLUproblem. Basically, NLU can be viewed as a translation froma source language to a formal target language, that is, givena natural language source sentence wechoose the formal language target sentenceamong the space of all possible translations that has the highestprobability, leading to the following decision rule:

(4)

(5)

Equations (4) and (5) induce two different approaches: eitherwe can directly optimize the posterior probabilityusing, for example, the maximum entropy framework or we canemploy the classical source-channel approach using Bayes’ the-orem and decompose the posterior probability into two proba-bility distributions: the translation probability andthe concept language probability . Once we have sub-stituted these probability distributions with suitable model dis-tributions, we can learn the model parameters with supervisedlearning techniques. Before we describe the methods in moredetail we will first discuss the concept language used and re-view some basic ideas from machine translation like alignmentmodels that we employ for both approaches in order to automat-ically align words to concepts.

A. Concept Language

A concept is defined as the smallest unit of meaning thatis relevant to a specific task [3]. Concept representations aresometimes used in a hierarchical manner, that is, concepts canbe nested and subsume other concepts [4]. Here, we use non-hierarchical concepts which we call flat concepts. By using flatconcepts as formal target language, the annotation of the inputsentences can easily be provided because for a human annotatorno structural knowledge about concepts and their interplay isnecessary. Although not a strict requirement, we assume thatthe concepts occur in the same order as given by the sourcetext. An example is provided in Fig. 1. The links between thesentences indicate to which concepts the words are aligned.Note that these alignments between words and concepts are notgiven explicitly but must be determined automatically from thetraining data which is provided as pairs of source sentences and

Fig. 1. Example of a correctly and wrongly aligned word sequence for the ut-terance “I would like to go from Cologne to Munich.” In (a), the phrase “fromCologne” is correctly aligned to the concept @origin. In (b), the city name“Cologne” is wrongly aligned to the concept @destination. If each concept to-gether with its associated words is considered separately, it will be impossibleto extract the attribute value for the concept @origin in case (b).

concept strings. Although not depicted in the figure, each con-cept is associated with a (possibly empty) set of attributes andeach attribute is associated with a more or less complex rule de-scribing how to extract the value from the words aligned to theconcept.

B. Alignment Models and Local Context Dependencies

As depicted in Fig. 1, the meaning of a word and thus its align-ment to the correct concept depends on its context. For example,the allocation of the city name Köln (Cologne) to the concept@origin is only apparent when looking at the preceding prepo-sition von (from) in the source sentence. Although the sequenceof concepts is identical in both cases, only case (a) allows fora correct extraction of the attribute values. If each concept to-gether with its aligned words is considered independently fromall the other concepts, it will be impossible to extract the cor-rect attribute value from the concept @origin in case (b). There-fore, finding correct alignments automatically is essential forboth approaches. In the following, we will briefly review dif-ferent alignment models that are used in machine translationand which we will use for either approach. A consequence ofwords and concepts having the same order is that the gener-ated alignments will be monotonic. Also, we enforce that everyword is aligned to exactly one concept. This might be problem-atic if the source language contains compounds that could bealigned to more than one concept. For example, the Germancompound “Privatadresse (home address)” as opposed to em-ployer’s address could be aligned to two target concepts @pri-vate and @address to distinguish it from “Büroadresse (officeaddress)” which could be mapped to @office and @address. Toenforce the one-word-to-one-concept constraint, we can designthe target language appropriately and shift the distinction be-tween home and office address into the attributes of the @ad-dress concept. Words not contributing to the meaning of a sen-tence are aligned to filler concepts.

For the introduction of the alignment models, we switch ournotation and identify a word sequence with and a se-quence of concepts with in order to use the same notationthat is common in machine translation. If we rewrite the trans-lation probability by introducing a hidden alignment

Page 3: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 805

Fig. 2. Phrase-based translation with alignment templates for the utterance“yes, hello, I need a connection from Cologne to Munich.” The words on the�-axis denote the source words, the words on the �-axis are the concepts. Therectangles mark phrase boundaries; the solid boxes within the rectangles de-scribe word-to-concept alignments.

, with , we obtain thefollowing identity:

(6)

(7)

(8)

The different alignment models we consider result from dif-ferent decompositions of . In the fol-lowing, we will briefly introduce these alignment models whichwere proposed in [5] and [6], and are the basis for extractingword-concept phrase pairs.

1) HMM Alignment: If we assume a first-order dependencefor the alignment in (8), restrict the dependent quantitiesof the translation probability only to , and assume a simplelength model , we arrive at the HMM align-ment [6]

(9)

2) Model 1: By replacing the dependency in (9) fromto , we obtain a zero-order hidden Markov model. Assuminga uniform alignment probability , we obtain themuch simpler model 1 which we use as starting point

(10)

3) Model 3, Model 4, and Model 5: In Models 3 and 4, fertil-ities of target words are introduced. The fertility of a target wordis the number of source words it can generate. Models 3-5 arealignment models with increasing complexity. Their exact def-initions can be found in [5] but are not important for the basicideas presented in this paper. Starting from model 1, the sequen-tial application of the more complex alignment models togetherwith a search process finally yields the Viterbi alignment be-tween words and concepts.

III. NATURAL LANGUAGE UNDERSTANDING USING

THE SOURCE-CHANNEL PARADIGM

For the source channel-based approach, we use aphrase-based translation system based on so-called alignmenttemplates [7]. Due to , the basic alignmentmodels introduced in the previous section can capture only1-to-many alignments. A phrase-based translation systemgoes one step further and allows many-to-many alignmentsby providing a two-level alignment: a phrase level alignmentand a within-phrase many-to-many word-level alignment. Thekey idea for obtaining many-to-many word alignments is tosymmetrize the training directions. By performing a trainingin both translation directions (from source to target and fromtarget to source) we obtain two Viterbi alignments and foreach sentence pair which can be merged. Although 1-to-manyalignments are sufficient because of the restriction that eachword must align to exactly one concept, alignments obtainedfrom symmetrization and merging have usually a higher preci-sion [8]. Let the source and target sentence be segmented into

word-groups describing the phrases

By decomposing the translation probability with theabove-mentioned definitions, we arrive at the followingfirst-order decomposition:

(11)

With denoting an alignment template, we ob-tain . The phrase trans-lation probability is then decomposed according tothe following equation:

(12)

The Kronecker functions ensure that only those alignmenttemplates are chosen which are consistent with and .The probability can be decomposed in the following

Page 4: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

806 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

Fig. 3. Architecture of the natural language understanding component using astatistical machine translation approach based on the source-channel paradigm.

way:

(13)

(14)

if are linked in the merged alignmentotherwise

(15)

Fig. 2 depicts an example of a phrase-based translation usingalignment templates.

A. Training

In machine translation, source words that have no correspon-dence on the target side are mapped onto the empty word. Incontrast to translations between natural languages, we do notallow empty words, that is, each word of the source sentencemust be aligned to a concept of the target sentence since wordsthat do not contribute to the meaning of a sentence are explic-itly modeled by a filler concept. The same holds if we exchangesource and target language. This results in the following trainingprocedure.

1) The fertility for the empty word is explicitly set to 0.2) We compute the Viterbi alignment for each sentence of the

training corpus and symmetrize the training directions [8].3) We compute the alignment matrices using intersections

with refinements as combination method [8].

B. Search

According to (5), both probability distributions are combinedduring a search process which leads to the architecture depictedin Fig. 3. If we plug in the phrase-based translation model to-gether with a standard left-to-right trigram language model intothe source-channel approach, we obtain the following search

criterion in maximum approximation which is used in combi-nation with beam search:

(16)

(17)

IV. NATURAL LANGUAGE UNDERSTANDING

USING MAXIMUM ENTROPY

Alternatively to the source-channel approach, we can di-rectly model the posterior probability in (4). Awell-founded framework for doing this is maximum entropy(ME). This framework allows for integrating information frommany heterogeneous information sources for classification.The data for a classification problem is described as a numberof problem-specific features . Each feature

corresponds to a constraint of the model and is associatedwith a model parameter . Among all possible models thatsatisfy these constraints, the model with maximum entropy iscomputed during the training phase [9]. Within this framework,the posterior probability can be modeled as follows:

(18)

The architecture of the ME approach is depicted in Fig. 4. Thefeature functions in (18) depend on complete sentences. Similarto [10], we could generate a set of candidate translations andapply the ME framework on top of this candidate set, so thatthe parameters can be estimated such that we obtain bettertranslations. This step would require a separate decoder to gen-erate the candidate translations which we want to avoid. If wecan formulate the NLU task directly within the ME frameworkwe will not depend on an additional classifier generating listsof candidate translations, but would define feature functions onthe basis of words and phrases and apply them directly on thesource sentences. Furthermore, we do not just want to combinethe knowledge sources from the previous section in a log linearway, but also want to overcome some shortcomings of the align-ment templates approach. A shortcoming of the alignment tem-plates approach is that each word can be used in at most onetemplate. Overlapping templates are not allowed due to the seg-mentation into word groups. For the ME approach, we stillwant to assign each word to one concept, but we also want to usethe context of each word in a more flexible way such that eachword can be used in more than one feature function. To solvethis problem in the ME approach, we proceed as follows. Weslightly modify the NLU task such that source and target sen-tence are of the same length yielding a one-to-one

Page 5: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 807

Fig. 4. Architecture of the natural language understanding component usinga statistical machine translation approach based on the maximum entropyframework.

correspondence between words and concepts.1 As discussed inSection II-B, mapping compounds to concepts is not a problem.A case that occurs more frequently is that several contiguouswords are mapped onto a single concept. Therefore, we distin-guish whether a word belongs to an initial or a noninitial con-cept. This modeling was explicitly required in shallow parsingtasks [11]. In the context of NLU, it was introduced in [12].With these changes, our classification model becomes very sim-ilar to maximum entropy Markov models (MEMMs) which areoften used in information extraction tasks [13]. The differenceto a standard information extraction problem is that for the NLUtask described in this paper, we do not know the correct align-ment between words and concepts during training, that is, wordand concept sequences have different lengths. Thus, MEMMsare not directly applicable here. We solve this problem by firstcomputing the Viterbi alignment between words and conceptson our training data and then by keeping the alignment fixed forall subsequent training steps. Fig. 5 depicts a one-to-one map-ping applied to a sentence/concept pair from the German TABA

corpus. We assume that the decisions only depend on a windowaround word and on the prede-

cessor concept. Thus, we can factorize the posterior probabilityof (4) and obtain the following first-order model:

(19)

(20)

Due to the distinction between initial and noninitial concepts,we must ensure that a noninitial concept only follows its cor-

1Because a concept can cover multiple words, this will turn the notion of aconcept into a fractional concept which we call a concept tag. Thus, a consec-utive sequence of similar concept tags then forms a concept. For ease of termi-nology we will continue to speak of concepts rather than of tags.

Fig. 5. Example of a sentence-to-concept mapping for the utterance “yes, hello,I need a connection from Cologne to Munich” using maximum entropy. Eachconcept starts with an initial concept marked by “i” and optionally continueswith a noninitial concept denoted by “n.”

responding initial one which can be guaranteed by appropriatetransition features.

A. Feature Functions

In the following, we will describe a set of binary valued fea-ture functions that are used for text-based NLU. We will intro-duce additional feature functions when we apply the ME frame-work on speech-based inputs.

1) Lexical Features: The lexical feature employslexical information. The parameter allows for taking contextwords into account. Formally, the feature is defined as follows:

(21)

It will only fire if the word matches and if the predictionfor the current concept is equal to . Here, denotes theKronecker function.

2) Prefix and Suffix Features: In case that a test corpus con-tains words that were not observed in the training corpus (un-known words), the necessity for vocabulary independent fea-tures arises. To achieve this, we generalize the words by lookingat their prefixes or suffixes. Let be a decompositionof word such that is a prefix of and is its suffix. Ifthe prefix (suffix) of is equal to a given prefix (suffix), thesefeatures will fire:

ifotherwise

(22)

ifotherwise

(23)

3) Capitalization Features: A capitalization feature will fireif starts with a capitalized letter, has an internal capital letter,

Page 6: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

or is all capitalized

if starts with a capital letterotherwise

(24)

if has an internal capital letterotherwise

(25)

if is all capitalizedotherwise

(26)

Capitalization features can be used to detect new proper namesthat were not observed during training.

4) Transition Features: Transition features model the depen-dence on the predecessor concept

(27)

This feature will fire if the prediction of the current class isequal to and the predecessor concept is equal to .

5) Prior Features: The single concept priors are incorporatedby prior features. The prior feature describes the prior knowl-edge of the concept on the training corpus and is the concept’sunigram count

(28)

6) Compound Features: The feature functions defined so farproduce only features that refer to single words or concepts. Toenable word phrases and word-concept combinations, we intro-duce the following compound features

(29)

where defines the length of a compound feature. Forthe compound feature is constructed of lexical features. For

it corresponds to multiple transition features with theadditional constraint that .

B. Training

The convexity of the ME objective function prohibits the in-corporation of hidden variables. Thus, it is not possible to incor-porate a hidden alignment as we did in Section III. Therefore,we will determine the alignment beforehand and keep it fixedduring training. For training, we consider the set of alignedtraining sentences to form a single long sentence. Since modelstrained within the ME framework tend to overfit on training data,we follow [14] and apply a Gaussian prior on the param-eter set. This yields the following training criterion:

(30)

where

(31)

with the standard deviation as the smoothing parameter.

C. Search

For decoding a sequence of words , we perform a breadth-first search where all hypotheses are expanded position-syn-chronously, that is, we start with expanding hypotheses for word

at position 1, then continue with hypotheses for word atposition 2, etc. To solve the search problem in maximum ap-proximation (Viterbi search) we define the auxiliary quantity

as the maximal probability of the best partial pathcovering positions and ending in concept . This leadsto the following recursive equations which are solved using dy-namic programming

(32)

(33)

Here, denotes the sentence start symbol. If no pruning is ap-plied, the Viterbi search is guaranteed to find the most likelysequence of concepts w.r.t. the underlying models.

V. COMBINING SPEECH RECOGNITION AND NATURAL

LANGUAGE UNDERSTANDING

As discussed in Section I, the sequential application of ASRand NLU can turn out to be suboptimal. Therefore, we needan approach that allows for a tighter coupling between bothcomponents. Furthermore, the approach used should allow usto easily integrate additional knowledge sources. The ASRmodule is based on the source-channel paradigm. Since it isdifficult to integrate additional knowledge sources into thisframework, we will use -best lists for representing the mostlikely word hypotheses that are promising candidates for theNLU task, and perform all following investigations on thisstructure. Using -best lists rather than word graphs has theadvantage that also sentence-based features can be employed.The features will be combined in a log-linear model and thecorresponding feature weights are trained in a discriminativemanner using the minimum error rate criterion. We will pro-ceed as follows: we briefly review the minimum error ratecriterion and describe the optimization of the feature weights

using the Line Sweep algorithm. The optimization is aniterative procedure that is applied on an -best repository. Aserror criterion to be minimized we will use both the slot errorrate and the word error rate. Thus, we can investigate whetherfor a given set of feature functions the NLU component canbenefit from ASR features and, vice versa, whether the ASRcomponent can benefit from features derived from the NLUcomponent.

A. Minimum Error Criterion

Minimum error criteria have a wide range of applications inpattern recognition tasks. In the field of ASR, a minimum errorcriterion was for example employed in [15]. In the context of

Page 7: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 809

machine translation, it was first used in [16]. To the best of ourknowledge minimum error criteria have not been applied to SLUtasks. Since we regard NLU as a translation task, we can definethe minimum error criterion similar to the approach proposed in[16] with the extension that we aim at minimizing both the worderror rate and the slot error rate. The challenging task is then tofind appropriate feature functions that result in a reduction ofboth error rates.

Denote the sequence of acoustic observation vectors,the sequence of words, and the sequence

of concepts, then we want to find that sequence of parameterswhich minimizes the number of errors denoted by an error

function

(34)

where

(35)

Here, denotes the sentence index and denotes the max-imizing concept and word sequence, respectively. definesthe repository of candidates which for each utterance is theproduct of the -best candidates from the ASR module timesthe -best list of concept strings for each recognized sentencecandidate. For ease of terminology, we will refer to pairsof translation and sentence hypotheses as candidate translations.Once produced by the two decoding steps they are kept fixedthroughout the algorithm.

B. Line Sweep Algorithm

Algorithm 1: Minimum Error Rate Training

Input: initial weights ���

Output: optimized weights ����

repeat

Generate � -best repositories with current ���

for all dimensions � do

for all sentences � do

Compute upper envelope and error statistics

end for

Merge error statistics

Search for optimal � and determine error reduction ���

��

� � �� � �

end for

�� � ��

��with �� � ����

�����

until convergence.

The sum of weighted feature functions can be expressed as ascalar product

(36)

In each iteration we try to change the weight vector suchthat the total number of errors is minimized. This is an -di-mensional optimization problem. For simplicity, we loop overall dimensions and optimize each dimension separately. Letdenote the change of weight for a given dimension . Thenthis change can be expressed as

(37)

where is the -dimensional unit vector with component. For each utterance there is a repository of can-

didate translations. Each candidate translation defines a linewith slope and offset , and is associated with its error w.r.t.the reference translation. For the one-dimensional optimizationproblem, the goal is now to adjust parameter such that theerror is minimal. From a geometric point of view the intersec-tion points of the lines divide each line into at most linesegments. The line segments form polyhedrons partitioning theplane into regions of equal numbers of errors. However, dueto the arg max decision rule we only need to consider thoseline segments with maximal scores, that is, which belong to theupper envelope. A line segment belongs to the upper envelopeif a vertical upward ray starting at this line segment does not hitany other line. Since each line is associated with its error ratewe can search for that segment along the upper envelope whichminimizes the error. By projecting the points of intersections ofthe corresponding line segments together with their error countsonto the -axis, we obtain a histogram of the total error counts.The phase transitions mark the positions where the total errormay change. The basic principle is depicted in Fig. 6.

The upper envelope can be efficiently computed using the linesweep algorithm which was introduced in [17]. A vertical line ismoved from the far left to the far right of the plane, and the exactorder of the intersections with the segments is maintained as theline sweeps along. If we compute the error counts for all sen-tences and merge the projected points of intersections togetherwith their error counts, we obtain a finite sequence of valuestogether with their associated error change. A linear search inthis sequence yields the optimal parameter . The pseudo codefor this procedure is given in Algorithm 1 which is guaranteedto converge to a local optimum.

C. Feature Functions

We introduce a new set of sentence-based feature functionsin order to take ASR and NLU features into account. The ASR

-best lists are annotated with concepts using the ME ap-proach. We omit the one-to-one correspondence between wordsand initial-continue concepts and use the notation,where denotes an ASR sentence hypothesis stored in an

-best list, and denotes a candidate translation from the

Page 8: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

810 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

Fig. 6. Upper envelope and error counting. The bold red line segments definethe upper envelope of the set of solid blue lines. The dashed black colored linesare parallel lines that have a lower score than the blue lines and can be discardedfor the computation of the upper envelope. Each segment of the upper envelopedefines an area of constant errors. By projecting the upper envelope to the�-axis,we obtain a histogram of the total number of errors. The total number of errorscan only change at phase transitions, that is, at the points of intersections of theline segments.

NLU module stored in a corresponding -best list, that is,each recognition result stored in an -best list produces a setof -best translations. Thus, a user utterance is represented bya total of sentence pair hypotheses. For ease of notationwe will not distinguish between and , but rather use theterm -best list. Using these conventions, we can define thefollowing sentence-based feature functions.

1) Language Model Features: The language model featureassigns each candidate sentence derived from an -best listits language model probability. Here, we use trigram languagemodels and define feature functions for both class-based andnon class-based language models. Thus, we obtain the followingtwo feature functions:

(38)

and

(39)

where denotes the set of word classes for the class-basedlanguage model.

2) Acoustic Model Features: Similar to the language modelfeatures we use the acoustic score of a sentence hypothesis asadditional feature. This yields the following feature function:

(40)

where denotes the sequence of acoustic vectors assigned toword hypothesis .

3) ASR Sentence Length: The length of an ASR sentencehypothesis is defined as the number of evaluation words con-tained in this sentence, disregarding silence and noise words.The weight for the sentence length is usually referred to as wordpenalty. Formally, the feature function is defined as follows:

(41)

4) ASR Posterior Probabilities: Word posterior probabilitieswere introduced in [18]. By defining the posterior probabilityof a sentence hypothesis as the product of its word posteriorprobabilities we arrive at

(42)

5) ASR Confidence Measures: Word posterior probabilitiesare also the basis for word-based confidence measures. For asentence-based feature function, we define the geometric meanof the word-based confidence values contained in a sentencehypothesis as the sentence confidence value. This yields the fol-lowing feature function:

(43)

6) NLU Model Features: The NLU model feature is the fac-torized posterior probability obtained from the ME model de-fined in (20) for a given input sentence . This is the firstfeature function that takes NLU-related knowledge sources intoaccount. It is defined as follows:

(44)

7) NLU Sentence Length: Similar to the ASR sentence lengthwe use the NLU sentence length as an additional feature in orderto penalize sentence hypotheses that are too short or too long

(45)

8) Concept Language Model: A trigram language modeltrained on concepts is used as additional feature function. Theconcept language model uses absolute discounting with backingoff as smoothing variant. The feature function is defined as fol-lows:

(46)

VI. RELATED WORK

The first approach to NLU using methods derived frommachine translation is described in [19]. The approach uses avariant of model 1 and lacks local context dependencies. In

Page 9: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 811

[10], this approach is extended by employing the ME frame-work for reranking the hypotheses of the system output.

Using generative models for NLU has been widely investi-gated. In [3], the NLU component is based on an HMM-likeprocess where the hidden states correspond to flat concepts. In[20], tree-structured concepts are employed. The hidden vectorstate model introduced in [21] associates each state of a push-down automaton with the state of an HMM and is complexenough to capture hierarchical structures. More recently, dis-criminative methods have been applied. In [22], support vectormachines are used for a call classification task which is extendedin [23] using Adaboost.

Another active research field is how to obtain a tighter cou-pling between ASR and NLU. In [24], finite-state transducersare employed for combining ASR with translation models. Be-cause NLU can be viewed as a translation task, this approachis directly applicable to SLU. In [25], the language model is re-placed by a word-to-concept translation model which is directlyintegrated into an ASR framework. Other researchers have en-riched the NLU component with ASR knowledge sources. In[26], ASR-based confidence measures are incorporated into theunderstanding process. In [27], word-level confusion networksobtained from ASR word graphs are exploited and combinedwith confidence measures in order to reduce the classificationerror rate of a dialog system. The contributions of this paper aremade clear after the results are presented.

VII. RESULTS

All experiments were conducted on two corpora, the TELDIR

and the TABA corpus. The TELDIR corpus is a German in-housetelephone directory assistance corpus. The objective is to an-swer naturally spoken requests for telephone numbers, fax num-bers, and email addresses of persons, companies, and organiza-tions. The corpus also covers utterances for spelling units. Thedata was recorded over several months from fixed-line phonesas well as wireless and mobile phones under changing condi-tions. The conditions cover clean speech, office noise, and trafficnoise. The corpus allocation is summarized in Table I. In con-trast to the ASR vocabulary, the NLU source vocabulary is open.

The TABA corpus is a German in-house corpus covering thedomain of a train timetable scheduling task. The speech data wascollected from real users interacting with a fully functional auto-matic spoken dialogue system developed by Philips Speech Pro-cessing [28]. The TABA corpus was kindly provided by PhilipsSpeech Processing and Scansoft. The corpus allocation is sum-marized in Table II. Similar to the TELDIR corpus, each utter-ance of the TABA corpus is annotated with a flat conceptualmeaning representation. Again, the ASR vocabulary is closedwhereas the NLU vocabulary is open.

The ASR component for both tasks is based on maximumlikelihood-trained Gaussian mixture models with a pooledvariance. We use 6-state HMMs in Bakis topology for mod-eling triphones where two contiguous states are identical.An acoustic feature vector consists of 12 MFCCs with firstand second derivatives, resulting in a 25–dimensional featurevector. An LDA is applied which maps three consecutiveacoustic feature vectors onto a final 33-dimensional feature

TABLE ICORPUS ALLOCATION FOR THE TELDIR CORPUS. TELDIR IS A

GERMAN IN-HOUSE CORPUS COVERING THE DOMAIN OF A TELEPHONE

DIRECTORY ASSISTANCE TASK. WITHIN THIS TASK, USERS CAN ASK

FOR TELEPHONE NUMBERS, FAX NUMBERS, AND EMAIL ADDRESSES

OF PERSONS, COMPANIES, AND ORGANIZATIONS

TABLE IICORPUS ALLOCATION FOR THE TABA CORPUS. TABA IS A GERMAN IN-HOUSE

CORPUS COVERING THE DOMAIN OF A TRAIN TIMETABLE INFORMATION

SYSTEM FOR MORE THAN 1000 GERMAN TERMINALS. THE CORPUS

WAS KINDLY PROVIDED BY PHILIPS AND SCANSOFT

vector. The ASR decoder employs an integrated trigram Viterbibeam search. The system is optimized for real-time recognitionsin a spoken dialogue system. For acoustic adaptation, we usean incremental F-MLLR which estimates the transformationmatrix on previous utterances from a user for a single dialoguetransaction. The system generates word graphs and -bestlists under real-time constraints. Table III lists recognition andconfidence measure results for both tasks. The ASR confidencemeasures are computed according to [18]. The confidenceerror rate (CER) is the ratio of the sum of word insertion and

Page 10: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

812 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

TABLE IIIRECOGNITION AND CONFIDENCE MEASURE RESULTS IN TERMS OF

WORD ERROR RATE (WER), GRAPH ERROR RATE (GER), AND

CONFIDENCE ERROR RATE (CER) FOR THE TELDIR AND TABA

CORPUS WITHOUT AND WITH ACOUSTIC ADAPTATION

substitution errors not detected by the confidence measure,normalized by the total number of recognized words. TheCER-baseline is the error rate obtained if no ASR confidencemeasure is used. The graph error rate (GER) is the word errorrate of the oracle-best sentences found in the word graphs.

A. Evaluation Metrics

To evaluate the quality of the different approaches, we use thefollowing error criteria.

• Slot Error Rate (a.k.a. Concept Error Rate)The slot error rate (Slot-ER) is similar defined to the wellknown word error rate and is the ratio of the sum of deleted,inserted, and substituted concepts (slots), normalized bythe total number of reference concepts.

• Attribute Error RateThe attribute error rate (AER) is defined as the numberof falsely assigned or missed attributes normalized by thetotal number of attributes. Thus, the AER measures to whatextent an approach is able to extract the correct values fromthe input sentences.

• Word-based Slot Error RateThe word-based slot error rate (word-based Slot-ER) isdefined as the number of words that have been aligned towrong concepts, normalized by the number of all words.Thus, the word-based Slot-ER quantifies the number ofwrongly aligned words which cannot be accomplished bythe AER. Note that this measure requires reference align-ments.

B. Preprocessing

Each source sentence is preprocessed before the sequence ofconcepts is determined. For both NLU approaches the prepro-cessing consists of the following steps.

1) Removal of no-word entitiesThis step mainly affects transcriptions derived from theASR module. In this step, all no-word entities like silence,hesitations, and other noise occurrences are removed.

2) Normalization of pronunciation variantsSimilar to the first step, this step only affects transcriptionsobtained from the ASR module. Pronunciation variants aremapped onto the first pronunciation variant which is deter-mined by the order the words occur in the pronunciationlexicon.

3) CategorizationTo reduce the number of unseen events, proper names,numbers, and date expressions are mapped onto categories.Categories for proper names can easily be obtained fromthe entries of the database a dialogue system is connectedwith.

C. Postprocessing

No postprocessing steps were applied for the ME approach.However, an additional filler-sequence penalty was introducedfor decoding that avoids chains of filler concepts. This filler-sequence penalty is similar to the word penalty in ASR. Forthe source-channel approach, special postprocessing steps wereapplied for unknown words. Unknown words are words whichdo not have an entry in the phrase table and are translated bytheir identity. The steps are applied in the following order.

1) Unknown words that occur at the beginning of a translatedsentence are always mapped onto the filler concept.

2) Unknown words that occur after the first position in thetarget sentence are assigned to their preceding concept.

3) Sequences of equal concepts are reduced to a single con-cept.

D. Comparison of Alignment Templates and MaximumEntropy Approach Using Text Input

We first determine the alignments between source and targetsentences for the TELDIR training corpus using assequence of models for the source-channel approach (i.e., weapply three iterations of model 1, continue with three iterationsof the HMM model, etc.) and for the ME approach.For the TABA corpus, the sequence of model iterations used is

for both the source-channel and the ME approach.The optimal training sequences were determined with respect tothe error rates obtained on the development sets.

1) Feature Selection, Count Cutoffs, and Smoothing: Fea-ture selection for the feature functions described in the previoussection is done on the development corpus of each task. Theselection is always done for complete feature types rather thanfor individual features. For the lexical features, the count cutoffis set to 1, that is, we keep all triples observed duringtraining. For prefix and suffix features, we additionally requirethat a word must consist of at least ten characters to generate afeature. The smoothing parameter is set to 3 for all corpora.

2) Preprocessing Using Categorizations: A problem thatoften occurs in NLU tasks is that many semantically relevantwords are not covered by the training corpus. For example, it isunlikely that all proper names are observed during training. Toimprove the results we use categorizations for both approachesand map words like proper names and numbers onto categories.Such categories can easily be obtained from the entries ofthe database a dialogue system is connected with. Table IVshows an excerpt of the categories used in the TELDIR and

Page 11: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 813

TABLE IVEXCERPT OF WORD CATEGORIES. IN A PREPROCESSING STEP SOURCE

WORDS COVERED BY THE APPLICATION DATABASE ARE MAPPED

ONTO CORRESPONDING WORD CATEGORIES

TABLE VEFFECT OF DIFFERENT ALIGNMENT MODELS ON THE SLOT ERROR RATE FOR

THE TELDIR CORPUS. ERROR RATES ARE PRESENTED WITH AND WITHOUT

CATEGORIZATIONS FOR THE ALIGNMENT TEMPLATES (AT) APPROACH

AND THE MAXIMUM ENTROPY (ME) FRAMEWORK

TABA corpus. Because the database does not contain all namedentities, a large proportion of proper names occurring in thetest data remains unobserved.

3) Effect of Alignment Models: Tables V and VII summarizeresults for both the alignment templates (AT) approach and theME framework using different alignment models. Both tablespresent results with and without categorizations. Unlike trans-lations between natural language pairs, word reorderings do notoccur in NLU tasks described in this paper. Therefore, we dis-allow word reorderings for the AT approach by penalizing pos-sible reorderings with additional costs set to infinity. Becausethe ME approach basically works similar to a tagger, the mono-tonicity of the word-concept alignment is guaranteed for thismethod.

The ME results presented in Tables V and VII are based onthe six feature types listed in Tables VI and VIII. Starting withonly lexical features, we successively extend the model by in-cluding additional feature functions. The results show that theME models clearly outperform the AT approach. The quality ofthe AT approach is achieved within the ME framework by justincluding lexical and transition features, and is significantly im-proved by adding further features.

Another interesting effect when looking at the results ob-tained without categorizations is that the ME framework suffersless from unknown words compared to the AT approach. Theproblem can be mitigated but not eliminated if more trainingdata is available as is the case for the TABA corpus. The reason

TABLE VIDEPENDENCE OF ERROR RATES ON THE NUMBER OF INCLUDED FEATURE

TYPES FOR THE TELDIR CORPUS USING ME. THE SOURCE SENTENCES

ARE PREPROCESSED USING CATEGORIZATIONS. THE FEATURES WERE

TRAINED ON THE MODEL 5 ALIGNED TRAINING CORPUS

TABLE VIIEFFECT OF DIFFERENT ALIGNMENT MODELS ON THE SLOT ERROR RATE FOR

THE TABA CORPUS. ERROR RATES ARE PRESENTED WITH AND WITHOUT

CATEGORIZATIONS FOR THE ALIGNMENT TEMPLATES (AT) APPROACH

AND THE MAXIMUM ENTROPY (ME) FRAMEWORK

is that the ME model can use variable-length contexts flexibly.Whereas the template approach either has an appropriate tem-plate or not and each word can occur in at most one template,the ME model can reuse contexts flexibly for various adjacentwords.

Comparing the performance on both the TELDIR and theTABA task, we see that the error rates are much lower for theTABA task than for the TELDIR task; the reason is due to thevery limited training data of the TELDIR task. However, westill achieve good results using the ME framework, even if thetraining corpus is very small.

4) Effect of Context Lengths: Both the AT approach as wellas the feature functions defined for the ME framework make useof local word contexts, that is, predecessor and successor wordsare explicitly taken into account. For the AT approach, we havechosen a maximum template length of 7 for both corpora. Thisvalue was selected empirically from the TABA corpus in Fig. 7.

As can be derived from the figure, a larger value does notimprove the Slot-ER. The feature functions for the ME modeluse a shorter context of five consecutive words. Augmenting the

Page 12: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

814 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

Fig. 7. Effect of the maximal allowed context length on the slot error rate forboth the AT approach and the ME-based framework where a symmetric contextwindow is used. The plot shows error rates for the TABA development corpususing categorizations.

TABLE VIIIDEPENDENCE OF ERROR RATES ON THE NUMBER OF INCLUDED FEATURE

TYPES FOR THE TABA CORPUS USING ME. THE SOURCE SENTENCES

ARE PREPROCESSED USING CATEGORIZATIONS. THE FEATURES WERE

TRAINED ON THE MODEL 5 ALIGNED TRAINING CORPUS

context length slightly deteriorates the performance and resultsin an overfitted model.

5) Effect of Amount of Training Data on Error Rate: Theamount of training data used for the TELDIR and the TABA

corpus differs roughly by a factor of 40. Since the source sen-tences need to be annotated manually by sequences of formalconcepts in order to apply supervised learning techniques, thequestion is what degradation in error rate we have to expect ifwe reduce the amount of training data used. This is importantif an NLU component is to be trained for a new domain whereonly little data is available. Because additional training datashould be collected from dialogue transactions of real usersinteracting with a fully functional and running dialogue system,a curve plotting the NLU performance against the amountof training data gives some hints on how much training datashould be used for setting up a new system. Fig. 8 showsthe performance of the NLU component in the course of theamount of training data used. Both plots use the full set of

Fig. 8. Effect of amount of training data on the slot error rate for the TELDIR

and TABA evaluation corpus using the ME-based approach.

Fig. 9. Attribute error rate in course of slot error rate for the TELDIR and TABA

evaluation corpus using the ME-based approach.

feature functions described in Section IV-A. In order to get areasonable Slot-ER, it is sufficient to have around 1000 trainingtokens for the TELDIR task. The TABA task is more complexdue to its time and date expressions. Here, a Slot-ER of around6% can be achieved with 10 000 training samples. Althoughboth tasks differ in their complexity, they almost have the sameslope.

6) Effect of Slot Error Rate on Attribute Error Rate: Eachconcept is associated with a (possibly empty) set of attributes.Because a wrongly assigned concept also affects the AER,we want to analyze this effect more thoroughly. Fig. 9 plotsthe AER in course of the Slot-ER using the ME approach. In-creasing Slot-ERs were obtained by successively reducing thenumber of training data (see Fig. 8). For both tasks, the AER isproportional to the Slot-ER. This effect partly results from thecategorizations used which yield a good generalization even ifonly a few training samples have been used.

E. Comparison of Alignment Templates and Maximum EntropyApproach Using Speech Input

If the input modality is speech, inputs to the NLU com-ponent can contain speech recognition errors. Furthermore,

Page 13: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 815

TABLE IXSPEECH UNDERSTANDING RESULTS FOR THE TELDIR AND TABA CORPUS

spontaneous speech effects like disfluencies, hesitations, andstressed speech can deteriorate the performance of the ASRmodule and, thus, the quality of the NLU component. There-fore, we want to investigate how both approaches perform ifthe input modality is speech. For this purpose, both modelswere trained on sentence pairs using reference transcriptions.Table IX summarizes Slot-ERs and AERs for both tasks. Forcomparison with Section VII-D, the first row contains errorrates for text-based results.

If we compare the source-channel approach with the directmodel, we see that the ME approach clearly outperforms the ATapproach which is in accordance to the results obtained fromtext input. However, compared to the TABA corpus, the AERfor the TELDIR corpus deteriorates much more from erroneousinput data. Apart from less training data, this effect results fromerroneous recognitions of spelling sequences. The TELDIR do-main allows users to spell names which is modeled by a spellingconcept. The attribute of a spelling concept is the sequence ofletters uttered by a user. If this sequence contains only onewrong letter, the concept remains most likely a spelling con-cept, but the attribute value is considered to be wrong, inde-pendent of the total number of letters that were recognized cor-rectly. To mitigate this effect, we can use the proper names con-tained in the application database as an additional knowledgesource. The idea is to extract all proper names from the data-base and decompose them into character sequences. Doing thiswe can match a recognized spelling sequence with the char-acter sequence obtained from proper names contained in thedatabase. If a sufficiently number of characters matches a de-composed database entry, we assume that the remaining un-matched characters resulted from an erroneous recognition andreplace the sequence by the best matching database entry. Thenumber of sufficiently matching characters must be chosen such

TABLE XSPEECH UNDERSTANDING RESULTS FOR THE TELDIR CORPUS

USING AUTOMATIC SPELLING CORRECTION

that on the one hand we benefit from using the database as ad-ditional knowledge source while on the other hand we allowfor accepting new spelling sequences that are not contained inthe database. The threshold for the mutual error rate between ahypothesized spelling sequence and a database entry was opti-mized on the development set and is set to . Results are listedin Table X.

F. Combining Speech Recognition and LanguageUnderstanding

To combine ASR with NLU we proceed as follows: we firstgenerate ASR word graphs for all utterances and compute confi-dence values for each hypothesis contained in a word graph. Wethen extract -best lists from the word graphs and process each

-best entry with the ME approach, thus enriching the -bestlists with the additional features described in Section V-C. Theenriched -best lists are then inserted into a repository on whichthe minimum error rate training algorithm is applied. A problemwith this procedure is that the ASR decoder cannot producebetter results in subsequent iterations because the new featuresare not integrated into the ASR decoding process, but are addedafter the -best lists have been extracted. Since the decodercannot use the new knowledge sources directly, it cannot pro-duce new entries for the repository. Therefore, we have to makesure that word graphs and -best lists contain a sufficientlylarge number of hypotheses after the first iteration. Accordingto Table XI, a 10 10-best list provides enough sentence alter-natives from which we can select promising candidates in orderto lower the error rate. Furthermore, selecting a candidate out ofa small -best list might be more realistic than striving to findthe oracle-best candidate in a huge -best list. The motivationbehind this is that the additional features should rather aim atreranking candidates whose scores are very close to each other,than trying to move a very low-ranked candidate to the first po-sition. The final repository contains a 10 10 best list for eachutterance where each -best entry consists of three streams to-gether with the associated error counts: the word stream ob-tained from ASR decoding, the concept stream obtained fromNLU decoding, and the attribute stream obtained from the at-tribute extraction. The -best lists do not contain duplicates.The hypothesized word-concept segmentation is kept for theconcept stream in order to simplify the attribute extraction.

1) Minimizing the Word Error Rate: We first investigatewhether NLU features can help reducing ASR errors. Table XII

Page 14: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

816 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

TABLE XIORACLE ERROR RATES FOR THE SPOKEN LANGUAGE UNDERSTANDING

COMPONENT. CONCEPT GRAPHS ARE PRODUCED BY APPLYING THE MEAPPROACH ON THE FIRST BEST RESULTS FROM THE ASR COMPONENT. THE

100-BEST LISTS ARE DIRECTLY EXTRACTED FROM THE CONCEPT GRAPHS

WHEREAS THE 100-BEST* LISTS ARE OBTAINED BY MERGING 10-BEST

CONCEPT LISTS DERIVED FROM 10 CONCEPT DECODINGS. EACH DECODING

USES ONE CANDIDATE FROM A 10-BEST ASR LIST AS INPUT

TABLE XIICOMBINATION OF SPEECH RECOGNITION AND NATURAL LANGUAGE

UNDERSTANDING FOR THE TELDIR AND TABA TASK THE OBJECTIVE

FUNCTION USED IS MINIMIZING THE WORD ERROR RATE

presents results for combining ASR with NLU. The featureweights were optimized on the development set according tothe algorithm described in Section V-B. The objective functionused is minimizing the WER. The first line in each table is thebaseline WER obtained under realtime recognition constraintswhich is equal to the ASR performance reported in Table III. Tomeasure the effect of newly added features, we first optimizethe weights for the ASR baseline features on the development

-best repository and apply those weights on the evaluation-best repository. The ASR baseline features comprise the

acoustic model, the language model, and the word penalty. Forthe TELDIR task this gives a small improvement of 0.2% on thedevelopment set which does not generalize to the evaluation set.For the TABA task, the baseline optimization reduces the WERmuch more on both the development and the evaluation corpus.Due to real-time constraints we cannot use the optimized fea-ture weights directly during speech decoding since this wouldrequire larger beam settings. Therefore, applying the optimizedbaseline feature weights on the -best repository is similar toa second-pass decoding, where we rescore the candidates fromthe first pass with optimized weights.

The first added feature is the NLU concept score. With thisfeature, the WER is reduced by 0.6% on the TELDIR develop-

TABLE XIIICOMBINATION OF SPEECH RECOGNITION AND NATURAL LANGUAGE

UNDERSTANDING FOR THE TELDIR AND TABA TASK THE OBJECTIVE

FUNCTION USED IS MINIMIZING THE SLOT ERROR RATE

ment set and by 0.5% on the evaluation set. The improvementson the TABA task are smaller but consistent. A special featurethat we can use for the TELDIR corpus is the spelling correctionfeature. We already showed in Section VII-E that this feature re-duces the AER by comparing the actual attribute values with thevalues stored in the application database. We can use the sameknowledge source here and use the Levenshtein distance of a hy-pothesized spelling sequence w.r.t. the closest database entry asadditional feature. Because only a fraction of the spelled propernames is covered by the database, we leave it to the minimumerror rate training framework to find a tradeoff between coveredand uncovered proper names.

2) Minimizing the Slot Error Rate: Now, we investigate ifwe can reduce the Slot-ER by adding ASR-related features tothe NLU features. Table XIII presents results for combiningASR with NLU when minimizing the Slot-ER. The baselineSlot-ER is equal to the Slot-ER reported in Table XI. Note thatthis is a single feature that employs all the feature functions de-scribed in Section IV-A. Therefore, no baseline optimization onthe -best repository is necessary. However, the first-best resultdetermined on the repository might actually be worse comparedto the baseline result given that we use only the concept scoreas a single feature. The reason is that the baseline result is com-puted from the first best ASR sentence hypothesis for whichthe ASR features like the acoustic model weight, the languagemodel weight, etc., are already fixed. If we take only the conceptscore into account and rescore the repository, we decouple theconcept feature function from the ASR knowledge sources andallow choosing much shorter sentence hypotheses which mighthave more deletion errors but produce lower concept scores. Assoon as we add more ASR feature functions, we observe a reduc-tion in terms of Slot-ERs. Optimizing the weights for all stan-dard ASR feature functions, that is, adding acoustic scores, lan-guage model scores, and word penalties, significantly reducesthe Slot-ER.

3) Minimizing Word and Slot Error Rate Simultaneously: Bycombining the error counts for ASR and NLU and by using allfeatures defined previously, we can minimize both WER andSlot-ER simultaneously. The unweighted error counts are com-bined by summation. Results are presented in Table XIV.

Page 15: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

MACHEREY et al.: APPLICATIONS OF STATISTICAL MACHINE TRANSLATION APPROACHES TO SPOKEN LANGUAGE UNDERSTANDING 817

TABLE XIVCOMBINATION OF SPEECH RECOGNITION AND NATURAL LANGUAGE

UNDERSTANDING. THE OBJECTIVE FUNCTION USED IS MINIMIZING THE SLOT

AND THE WORD ERROR RATE SIMULTANEOUSLY

VIII. SUMMARY AND CONTRIBUTIONS

In this paper, we proposed two approaches to SLU based onstatistical machine translation: a source channel-based approachand a ME-based approach. We evaluated both approaches ontwo NLU tasks which were derived from different domains andshowed that the ME approach clearly outperforms the sourcechannel-based method within these settings. In contrast to [19],where almost no context was taken into account, we showedthat local contexts are crucial to this task. The ME approachdescribed in [10] was evaluated only on pure text data and re-quired a separate decoder that supplies a list of candidate trans-lations on which then the ME model was trained. Here, we di-rectly trained the ME model on bilingual training data. We ana-lyzed the quality of different alignment models and feature func-tions and investigated several effects, including the effect of cat-egorizations, different context lengths, and the relation betweenSlot-ERs and AERs. We also analyzed the performance of bothapproaches when speech was used as input modality. Finally,we showed how ASR and NLU-based knowledge sources canbe log-linearly combined such that a combination of error ratesis minimized. This provided a tighter coupling between ASRand NLU. We employed the minimum error rate training frame-work in order to determine an optimal set of feature weightsand showed that the WER could be reduced using NLU-basedfeature functions, and that, correspondingly, the Slot-ER wasreduced using ASR-based feature functions. In contrast to [24],we used a different optimization technique which computes theerror surface for each feature function used. Furthermore, weinvestigated a larger set of feature functions and took sentence-based features into account.

ACKNOWLEDGMENT

The authors would like to thank Philips Speech Processingand Scansoft for kindly providing the TABA corpus.

REFERENCES

[1] Oerder and H. Ney, “Word graphs: An efficient interface betweencontinuous-speech recognition and language understanding,” in Proc.ICASSP, Minneapolis, MN, Apr. 1993, vol. 2, pp. 119–122.

[2] Y.-Y. Wang and A. Acero, “SGStudioL rapid semantic grammar de-velopment for spoken language understanding,” in Proc. Interspeech,Lisbon, Portugal, Sep. 2005, vol. 2, pp. 1869–1872.

[3] E. Levin and R. Pieraccini, “Concept-based spontaneous speech under-standing system,” in Proc. Eurospeech, Madrid, Spain, Sep. 1995, vol.2, pp. 555–558.

[4] S. Miller, R. Bobrow, R. Schwartz, and R. Ingria, “Statistical languageprocessing using hidden understanding models,” in Proc. Human Lang.Technol. Workshop Sponsored by ARPA, Plainsboro, NJ, Mar. 1994, pp.278–282.

[5] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer,“The mathematics of machine translation: Parameter estimation,”Comput. Lingist., vol. 19, no. 2, pp. 263–311, 1993.

[6] S. Vogel, H. Ney, and C. Tillmann, “HMM-based word alignment instatistical translation,” in Proc. COLING, Copenhagen, Denmark, Aug.1996, vol. 2, pp. 836–841.

[7] F. J. Och, C. Tillmann, and H. Ney, “Improved alignment models forstatistical machine translation,” in Proc. EMNLP/VLC, College Park,MD, Jun. 1999, pp. 20–28.

[8] F. J. Och and H. Ney, “A systematic comparison of various statisticalalignment models,” Comput. Lingist., vol. 29, no. 1, pp. 19–51, 2003.

[9] A. Berger, “Improved iterative scaling: A gentle introduction,” 1997[Online]. Available: http://www.cs.cmu.edu/afs/cs/user/aberger/www/ps/scaling.ps

[10] K. A. Papineni, S. Roukos, and R. T. Ward, “Feature-based languageunderstanding,” in Proc. Eurospeech, Rhodes, Greece, Sept. 1997, pp.1435–1438.

[11] E. T. K. Sang and S. Buchholz, “Introduction to the CoNLL-2000shared task: Chunking,” in Proc. CoNLL-LLL, Lisbon, Portugal, Sep.2000, pp. 127–132.

[12] O. Bender, K. Macherey, F.-J. Och, and H. Ney, “Comparison of align-ment templates and maximum entropy models for natural languageunderstanding,” in Proc. EACL, Budapest, Hungary, Apr. 2003, pp.11–18.

[13] A. McCallum, D. Freitag, and F. Pereira, “Maximum entropy markovmodels for information extraction and segmentation,” in Proc. ICML,Stanford, CA, Jun. 2000, pp. 591–598.

[14] S. F. Chen and R. Rosenfeld, A Gaussian prior for smoothing max-imum entropy models School of Comput. Sci. Carnegie Mellon Univ.,Pittsburgh, PA, Tech. Rep. CMU-CS-99-108, Feb. 1999.

[15] P. Beyerlein, “Diskriminative modellkombination in spracherkennungmit gro em wortschatz,” Ph.D. dissertation, Comput. Sci. Dept., RWTHAachen Univ., Aachen, Germany, 2000.

[16] F. J. Och, “Minimum error rate training in statistical machine transla-tion,” in Proc. ACL, Sapporo, Japan, Jul. 2003, pp. 160–167.

[17] J. L. Bentley and T. A. Ottmann, “Algorithms for reporting andcounting geometric intersections,” IEEE Trans. Comput., vol. C-28,no. 9, pp. 643–647, Sep. 1979.

[18] F. Wessel, K. Macherey, and R. Schlüter, “Using word probabilities asconfidence measures,” in Proc. ICASSP, Seattle, WA, May 1998, vol.1, pp. 225–228.

[19] M. Epstein, K. Papineni, S. Roukos, T. Ward, and S. Della Pietra, “Sta-tistical natural language understanding using hidden clumpings,” inProc. ICASSP, Atlanta, GA, May 1996, vol. 1, pp. 176–179.

[20] S. Miller, M. Bates, R. Ingria, J. Makhoul, and R. Schwartz, “Recentprogress in hidden understanding models,” in Proc. ARPA SpokenLang. Technol. Workshop, Austin, TX, Jan. 1995, pp. 276–280.

[21] Y. He and S. Young, “A data-driven SLU system,” in Proc. ASRU, St.Thomas, U.S. Virgin Islands, Dec. 2003, pp. 583–588.

[22] P. Haffner, G. Tur, and J. Wright, “Optimizing SVMs for complex callclassification,” in Proc. ICASSP, Hong Kong, China, Apr. 2003, vol. 1,pp. 632–635.

[23] N. Gupta, G. Tu, and D. Hakkani-Tür, “The AT&T SLU system,” IEEETrans. Audio, Speech, Lang. Process., vol. 14, no. 1, pp. 213–222, Jan.2006.

[24] E. Matusov, S. Kanthak, and H. Ney, “On the integration of speechrecognition and statistical machine translation,” in Proc. Interspeech,Lisbon, Portugal, Sep. 2005, pp. 3177–3180.

[25] K. Sudoh and H. Tsukada, “Tightly integrated spoken language un-derstanding using word-to-concept translation,” in Proc. Interspeech,Lisbon, Portugal, Sep. 2005, vol. 1, pp. 429–432.

[26] T. J. Hazen, T. Burianek, J. Polifroni, and S. Seneff, “Integratingrecognition confidence scoring with language understanding anddialogue modeling,” in Proc. ICSLP, Beijing, China, Oct. 2000, vol.2, pp. 1042–1045.

[27] G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-Tur,“Improving spoken language understanding using word confusionnetworks,” in Proc. ICSLP, Denver, CO, Sep. 2002, pp. 1137–1140.

Page 16: Applications of Statistical Machine Translation Approaches to Spoken Language Understanding

818 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009

[28] H. Aust, M. Oerder, F. Seide, and V. Steinbiss, “The philips auto-matic train timetable information system,” Speech Commun., vol. 17,pp. 249–262, Nov. 1995.

Klaus Macherey received the Diploma degree incomputer science from RWTH Aachen University,Aachen, Germany. in 1999.

From May 1999 until March 2006, he was aResearch Assistant in the Computer Science De-partment 6, RWTH Aachen. Since the end of March2006, he has been a Research Scientist at Google,Inc. Mountain View, CA. His primary research inter-ests cover machine translation, speech recognition,natural language understanding, dialogue systems,and reenforcement learning.

Oliver Bender received the Diploma degree incomputer science from RWTH Aachen University,Aachen, Germany, in 2002.

Since then, he has been a Research Assistant withthe Computer Science Department 6, RWTH. Hisprimary research interests cover statistical machinetranslation, natural language understanding, spokendialogue systems, and information extraction. InNovember 2008, he joined Nuance CommunicationsAachen GmbH.

Hermann Ney (M’86–SM’07) received the Diplomadegree in physics from Goettingen University, Goet-tingen, Germany, in 1977 and the Dr.-Ing. degree inelectrical engineering from Braunschweig Universityof Technology, Braunschweig, Germany, in 1982.

He is a Full Professor of computer science atRWTH Aachen University, Aachen, Germany. In1977, he joined Philips Research in Germany, andsince 1985 he has headed the Speech RecognitionGroup that pioneered the first prototype systems forlarge vocabulary continuous speech recognition and

spoken dialogue systems. From October 1988 to October 1989, he was a Vis-iting Scientist at Bell Laboratories, Murray Hill, NJ. In July 1993, he joined theComputer Science Department, RWTH Aachen University. His responsibilitiesinclude planning, directing, and carrying out research for national, European,and industrial sponsors and supervising Ph.D. students. His main researchinterests lie in the area of statistical methods for pattern recognition and humanlanguage technology and their specific applications to speech recognition,machine translation, and image object recognition. In particular, he has workedon dynamic programming and discriminative training for speech recognition,on language modeling and on phrase-based approaches to machine translation.His work has resulted in more than 400 conference and journal papers.

Prof. Ney was the recipient of the Technical Achievement Award of the IEEESignal Processing Society in 2005.