Building Ubiquitous and Robust Speech and Natural Language
Interfaces I
Gary Geunbae Lee, Ph.D., ProfessorDept. CSE, POSTECH
2IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing (60min)– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems (80min)– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation (40min)– Statistical Machine Translation– Phrase-based SMT– Speech Translation
3IUI 2007 tutorial
Ubiquitous computing
• Ubiquitous computing: network + sensor + computing• Pervasive computing• Third paradigm computing• Calm technology• Invisible computing
• Irobot style interface – human language + hologram
4IUI 2007 tutorial
Ubiquitous computer interface?
• Computer – robot, home appliances, audio, telephone, fax machine, toaster, coffee machine, etc (every objects)
• Universal speech interface project (CMU)
• VoiceBox commercial systems
• Telematics Dialog Interface (POSTECH, LG, DiQuest)
5IUI 2007 tutorial
Tele-serviceTele-service
Car-navigationCar-navigation Home networkingHome networking
Robot interfaceRobot interface
Example Domain
6IUI 2007 tutorial
What’s hard – ambiguities, ambiguities, all different levels of ambiguities
John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. [from J. Eisner lecture note]
- donut: To get a donut (doughnut; spare tire) for his car?- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or
made of donut?- From work: Well, actually, he stopped there from hunger and exhaustion, not just from work.- Every few hours: That’s how often he thought it? Or that’s for coffee?- it: the particular coffee that was good every few hours? the donut store? the situation- Too expensive: too expensive for what? what are we supposed to conclude about what John
did?
7IUI 2007 tutorial
Structural vs. Statistical: Technology innovation thru dialectic
Statistical analysis
data driven
empirical
connectionist
speech community
Structural analysisrule driven
rationalsymbolic
NLU, Chomskian, Shankian, AI community
8IUI 2007 tutorial
Structural NLP
• grammar rules + lexicons – Grammatical category (POS, syntactic category)– unification features (connectivity, agreements, semantics..)
• chart parsing• compositional semantics
• Limitation: enormous ambiguity– “List the sales of the products produced in 1973 with the products produced
in 1972” ==> 455 parses (Martin et. al. 1981)
9IUI 2007 tutorial
Statistical NLP
• Grammar’s role? – estimating which word sequence is legal?– Pr (w1, w2, …wn) – pr(w1)pr(w2|w1)pr(w3|w1w2)…..pr(wn|w1, ….wn-1)– pr(w2 |w1) = count (w1w2) / count (w1) [MLE]– E.g.) the (big, pig) dog– Shannon game -- predicting the next word given word sequence
• language modeling -- probability matrix– Language model evaluation -- cross entropy – - Σ pr(w1,n) log prM(w1,n) – when prM(w1,n) = pr(w1,n) cross entropy becomes minimum and language
model M is perfect
10IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
11IUI 2007 tutorial
The Noisy Channel Model
• Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993]
• The noisy channel model [Lee et al., 1996]– Acoustic input considered a noisy version of a source sentence
Sourcesentence
Noisysentence
Guess atoriginal sentence
Where is the bus stop ? Where is the bus stop ?
Noisy Channel Decoder
12IUI 2007 tutorial
The Noisy Channel Model
• What is the most likely sentence out of all sentences in the language L given some acoustic input O?
• Treat acoustic input O as sequence of individual observations – O = o1,o2,o3,…,ot
• Define a sentence as a sequence of words:– W = w1,w2,w3,…,wn
)|(maxargˆ OWPWLW∈
=
)()|(maxargˆ WPWOPWLW∈
=
)()()|(maxargˆ
OPWPWOPW
LW∈=
Bayes rule
Golden rule
13IUI 2007 tutorial
Speech Recognition Architecture Meets Noisy Channel
FeatureExtraction Decoding
AcousticModel
PronunciationModel
LanguageModel
Where is the bus stop ?
Speech Signals Word Sequence
Wher is the bus stop ?
NetworkConstruction
SpeechDB
TextCorpora
HMMEstimation
G2P
LMEstimation
)()|(maxargˆ WPWOPWLW∈
=
WO
14IUI 2007 tutorial
Network Construction
I
L
S
A
M
일
이
삼
사
I L
I
S A M
S A삼
사
일
이
Acoustic Model Pronunciation Model Language Model
I
I L
S A M
Wordtransition
P(일|x)
P(사|x)
P(삼|x)
P(이|x)LM is applied
S A
start end이
일
사
삼
Between-wordtransition
Intra-wordtransition
Search Network
• Expanding every word to state level, we get a search network [Demuyncket al., 1997]
15IUI 2007 tutorial
References (1/2)
• L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52.
• C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566.
• K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146.
• T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104.
• M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88.
16IUI 2007 tutorial
References (2/2)
• B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309.
• C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers.
• K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set forthe HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.
• L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286.
• L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall.
• S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department.
• S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK.
17IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
18IUI 2007 tutorial
Spoken Language Understanding (SLU)
• Spoken language understanding is to map natural language speech to frame structure encoding of its meanings [Wang et al., 2005]
• What’s difference between NLU and SLU?– Robustness; noise and ungrammatical spoken language– Domain-dependent; further deep-level semantics (e.g. Person vs. Cast)– Dialog; dialog history dependent and utt. by utt. analysis
• Traditional approaches; natural language to SQL conversion
ASRSpeech
SLU SQLGenerate Database
Text SemanticFrame SQL Response
A typical ATIS system (from [Wang et al., 2005])
19IUI 2007 tutorial
Semantic Representation
• Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky, 2002]
– An intermediate semantic representation to serve as the interface between user and dialog system
– Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting.
“Show me flights from Seattle to Boston”
ShowFlight
Subject Flight
FLIGHT Departure_City Arrival_City
SEA BOS
<frame name=‘ShowFlight’ type=‘void’><slot type=‘Subject’>FLIGHT</slot><slot type=‘Flight’/>
<slot type=‘DCity’>SEA</slot><slot type=‘ACity’>BOS</slot>
</slot></frame>
Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]
20IUI 2007 tutorial
Knowledge-based Systems
• Knowledge-based systems:– Developers write a syntactic/semantic grammar– A robust parser analyzes the input text with the grammar– Without a large amount of training data
• Previous works– MIT: TINA (natural language understanding) [Seneff, 1992]– CMU: PHEONIX [Pellom et al., 1999]– SRI: GEMINI [Dowding et al., 1993]
• Disadvantages1) Grammar development is an error-prone process2) It takes multiple rounds to fine-tune a grammar3) Combined linguistic and engineering expertise is required to construct a
grammar with good coverage and optimized performance4) Such a grammar is difficult and expensive to maintain
21IUI 2007 tutorial
Statistical Systems
• Statistical SLU approaches:– System can automatically learn from example sentences with their
corresponding semantics– The annotation are much easier to create and do not require specialized
knowledge• Previous works
– Microsoft: HMM/CFG composite model [Wang et al., 2005]– AT&T: CHRONUS (Finite-state transducers) [Levin and Pieraccini, 1995]– Cambridge Univ: Hidden vector state model [He and Young, 2005]– Postech: Semantic frame extraction using statistical classifiers [Eun et al.,
2004; Eun et al., 2005; Jeong and Lee, 2006]
• Disadvantages1) Data-sparseness problem; system requires a large amount of corpus2) Lack of domain knowledge
22IUI 2007 tutorial
Reducing the Effort of Human Annotation
• Active + Semi-supervised learning for SLU [Tur et al., 2005]– Use raw data, and divide them into two sets Sraw = Sactive + Ssemi
Raw data
SmallLabeled data Model
Predict &Estimate
Confidence< threshold
ActiveLearning
Filter
Labeledsamples
> threshold
yes
no
Augmenteddata
+
+
23IUI 2007 tutorial
Semantic Frame Extraction
Dialog ActIdentification
Dialog ActIdentification
Frame-SlotExtraction
Frame-SlotExtraction
RelationExtractionRelation
Extraction
UnificationUnification
Feature Extraction / SelectionFeature Extraction / Selection
Info.SourceInfo.
Source
++
++
++
++ ++
Overall architecture for semantic analyzer
I like DisneyWorld.
Domain: ChatDialog Act: StatementMain Action: LikeObject.Location=DisneyWorld
Examples of semantic frame structure
• Semantic Frame Extraction (~ Information Extraction Approach)1) Dialog act / Main action Identification ~ Classification2) Frame-Slot Object Extraction ~ Named Entity Recognition3) Object-Attribute Attachment ~ Relation Extraction– 1) + 2) + 3) ~ Unification
How to get to DisneyWorld?Domain: NavigationDialog Act: WH-questionMain Action: SearchObject.Location.Destination=DisneyWorld
24IUI 2007 tutorial
Frame-Slot Object Extraction
Sequence Labeling Inference
Conditional Random Fields[Lafferty et al. 2001]
yytt--11 yytt yyt+1t+1
xxtt--11 xxtt xxt+1t+1
CRF = Undirected graphical model
• Frame-Slot Extraction ~ NER = Sequence Labeling Problem
• A probabilistic model
25IUI 2007 tutorial
Long-distance Dependency in NER
…… ……flyfly fromfrom denverdenver toto chicagochicago onon decdec.. 10th10th 19991999
DEPART.MONTH
…… ……returnreturn fromfrom denverdenver toto chicagochicago onon decdec.. 10th10th 19991999
RETURN.MONTH
Feature Gain
• A Solution: Trigger-Induced CRF [Jeong and Lee, 2006]– Basic idea is to add only bundle of (trigger) features which increase log-
likelihood of training data– Measuring gain to evaluate the (trigger) features using Kullback-Leibler
divergence
26IUI 2007 tutorial
References (1/2)
• J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, D. and Moran. 1993. Gemini: A natural language system for spoken language understanding. ACL, 54-61.
• J. Eun, C. Lee, and G. G. Lee, 2004. An information extraction approach for spoken language understanding. ICSLP.
• J. Eun, M. Jeong, and G. G. Lee, 2005. A Multiple Classifier-based Concept-Spotting Approach for Robust Spoken Language Understanding. Interspeech 2005-Eurospeech.
• D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288.
• Y. He, and S. Young. January 2005. Semantic processing using the Hidden Vector State model. Computer Speech and Language, 19(1):85-106.
• M. Jeong, and G. G. Lee. 2006. Exploiting non-local features for spoken language understanding. COLING/ACL.
• J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. ICML.
27IUI 2007 tutorial
References (2/2)
• E. Levin, and R. Pieraccini. 1995. CHRONUS, the next generation, In Proceedings of 1995 ARPA Spoken Language Systems Technical Workshop, 269--271, Austin, Texas.
• B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An Architecture for Dialogue Systems. ICSLP.
• R. E. Schapire., M. Rochery, M. Rahim, and N. Gupta. 2002, Incorporating prior knowledge into boosting. ICML. pp538-545.
• S. Seneff. 1992. TINA: a natural language system for spoken language applications, Computational Linguistics, 18(1):61--86.
• G. Tur, D. Hakkani-Tur, and R. E. Schapire. 2005. Combining active and semi-supervised learning for spoken language understanding. Speech Communication. 45:171-186
• Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language Understanding: An introduction to the statistical framework. IEEE Signal Processing Magazine, 27(5)
28IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
29IUI 2007 tutorial
Dialog for EPG (POSTECH)
Unified Chatting and Goal-oriented Dialog (POSTECH)
30IUI 2007 tutorial
Spoken Dialog System
ASRASR
SLUSLUDMDM
RGRG
Models,Rules
Models,Rules
Semantic Meaning
ORIGIN_CITY: WASHINGTONDESTINATION_CITY: DENVERFLIGHT_TYPE: ROUNDTRIP
Dialog Management
System Action
GET DEPARTURE_DATE
Response Generation
System Speech
Which date do you want to fly from Washington to Denver?
Automatic SpeechRecognition
User Speech
“I need a flight from Washington DC to Denver roundtrip”
Recognized Sentence
Spoken Language Understanding
31IUI 2007 tutorial
VoiceXML-based System
• What is VoiceXML? – The HTML(XML) of the voice web. [W3C, working draft]– The open standard markup language for voice application
• Can do– Rapid implementation and management– Integrated with World Wide Web– Mixed-Initiative dialog– Able to input push button on telephone– Simple dialog implementation solution
• VoiceXML dialogs are built from – <menu>, <form> (similar to “Slot & Filling” system)
• Limiting User’s Response– Verification, and Help for invalid response– Good speech recognition accuracy
32IUI 2007 tutorial
Example – <Form>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"><form id="login">
<field name="phone_number" type="phone"> <prompt>
Please say your complete phone number</prompt>
</field> <field name="pin_code" type="digits">
<prompt>Please say your PIN code
</prompt></field><block>
<submit next=“http://www.example.com/servlet/login”namelist=phone_number pin_code"/>
</block></form>
</vxml>
Browser : Please say your complete phone numberUser : 800-555-1212Browser : Please say your PIN codeUser : 1 2 3 4
33IUI 2007 tutorial
Frame-based Approach
• Frame-based system [McTear, 2004]– Asks the user questions to fill slots in a template in order to perform a task
(form-filling task)– Permits the user to respond more flexibly to the system’s prompts (as in
Example 2.)– Recognizes the main concepts in the user’s utterance
Example 1)• System: What is your destination?• User: London.• System: What day do you want to
travel?• User: Friday
Example 2)• System: What is your destination?• User: London on Friday around 10
in the morning.• System: I have the following
connection …
34IUI 2007 tutorial
Agent-Based Approach
• Properties [Allen et al., 1996]– Complex communication using unrestricted natural language– Mixed-Initiative– Co-operative problem solving– Theorem proving, planning, distributed architectures– Conversational agents
• An example
• System attempts to provide a more co-operative response that might address the user’s needs.
User : I’m looking for a job in the Calais area. Are there any servers?
System : No, there aren’t any employment servers for Calais. However, there is an employment server for Pasde-Calais and an employment server for Lille. Are you interested in one of these?
35IUI 2007 tutorial
Galaxy Communicator Framework
• The Galaxy Communicator software infrastructure is a distributed, message-based, hub-and-spoke infrastructure optimized for constructing spoken dialog systems. [Bayer et al., 2001]
• An open source architecture for constructing dialog systems • History: MIT Galaxy system Developed and maintained by MITRE
Message-passing protocolHub and Clients architecture
36IUI 2007 tutorial
References (1/2)
• J. F. Allen, B. Miller, E. Ringger and T. Sikorski. 1996. A Robust System for Natural Spoken Dialogue, ACL.
• S. Bayer, C. Doran, and B. George. 2001. Dialogue Interaction with the DARPA Communicator Infrastructure: The Development of Useful Software. HLT Research.
• R. Cole, editor., Survey of the state of the art in human language technology, Cambridge University Press, New York, NY, USA, 1997.
• G. Ferguson, and J. F. Allen. 1998. TRIPS: An Integrated Intelligent Problem-Solving Assistant, AAAI, pp26-30.
• K. Komatani, F. Adachi, S. Ueno, T. Kawahara, and H. Okuno. 2003. Flexible Spoken Dialogue System based on User Models and Dynamic Generation of VoiceXML Scripts. SIGDIAL.
• S. Larsson, and D. Traum. 2000. Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit, Natural Language Engineering, 6(3-4).
• S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and G. Sagerer. 2003. Providing the basis for human-robotinteraction: A multi-modal attention system for a mobile robot. ICMI. pp. 28–35.
37IUI 2007 tutorial
References (2/2)
• E. Levin, R. Pieraccini, and W. Eckert. 2000, A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing. 8(1):11-23
• C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based Dialogue Management using Dialogue Examples. ICASSP.
• W. Marilyn, H. Lynette, and A. John. 2000. Evaluation for DarpaCommunicator Spoken Dialogue Systems. LREC.
• M. F. McTear, Spoken Dialogue Technology, Springer, 2004.• I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing
advanced spoken dialog management in Java. Speech Communication, 54(1):99-124.
• B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An Architecture for Dialogue Systems. ICSLP.
• A. Rudnicky, E. Thayer, P. Constantinides, C. Tchou, R. Shern, K. Lenzo, W. Xu, and A. Oh. 1999. Creating natural dialogs in the Carnegie Mellon Communicator system. Eurospeech, 4, pp1531-1534.
• W3C, Voice Extensible Markup Language (VoiceXML) Version 2.0 Working Draft, http://www.w3c.org/TR/voicexml20/
38IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
39IUI 2007 tutorial
The Role of Dialog Management
• For example, in the flight reservation system– System : Welcome to the Flight Information Service. Where would you like
to travel to?– Caller : I would like to fly to London on Friday arriving around 9 in the
morning.– System : ????????????????????
In order to process this utterance, the system has to engage in the following
processes:
1) Recognize the words that the caller said. (Speech Recognition)
2) Assign a meaning to these words. (Language Understanding)
3) Determine how the utterance fits into the dialog so far and decide what to
do next. (Dialog Management)
There is a flight that departs at 7:45 a.m. and arrives at 8:50 a.m.
40IUI 2007 tutorial
Information State Update Approach – Rule-based DM(Larsson and Traum, 2000 )
• A method of specifying a dialogue theory that makes it straightforward to implement
• Consisting of following five constituents– Information Components
– Including aspects of common context– (e.g., participants, common ground, linguistic and intentional structure,
obligations and commitments, beliefs, intentions, user models, etc.)– Formal Representations
– How to model the information components– (e.g., as lists, sets, typed feature structures, records, etc.)
41IUI 2007 tutorial
Information State Approach
– Dialogue Moves– Trigger the update of the information state– Be correlated with externally performed actions
– Update Rules– Govern the updating of the information state
– Update Strategy– For deciding which rules to apply at a given point from the set of applicable
ones
42IUI 2007 tutorial
Example Dialogue
43IUI 2007 tutorial
Example Dialogue
44IUI 2007 tutorial
• A Tree Branching for Every Possible Situation– It can become very complex. Start
Information +Origin
Information +Destination
Information +Origin + Dest.
Information +Date
Information +Origin + Date
Information +Origin + Dest +Date
Information +Dest + Date
Flight #
Flight # +Date
Flight # +Information
Flight # +Reservation
The Hand-crafted Dialog Model is Not Domain Portable
45IUI 2007 tutorial
An Optimization Problem
• Dialog Management as an Optimization Problem– Optimization Goal
– Achieve an application goal to minimize a cost function (=objective function)– In General
– To minimize the turn of user-system and the DB access until filling all slots
– Simple Example : Month and Day Problem– Designing a dialog system that gets a correct date (month and day) from a user
through the shortest possible interaction– Objective Function
• How to Mathematically Formalize?– Markov Decision Process (MDP)
slots unfilled nsinteractio *#Errors*#*#C feiD ωωω ++=
46IUI 2007 tutorial
Mathematical Formalization
• Markov Decision Process (MDP) (Levin et al 2000)– Problems with cost (or reward) objective function are well modeled as
Markov Decision Process.– The specification of a sequential decision problem for a fully observable
environment that satisfies the Markov Assumption and yields additive rewards.
Dialog Action(Prompts, Queries, etc.)
Dialog Manager
Environment (User, External DB or other Servers)
Dialog StateCost(Turn, Error, DB Access, etc.)
47IUI 2007 tutorial
Month and Day Example
Optimal strategy is the one that minimizes the cost.
Strategy 1 is optimal if wi + P2* we - wf > 0 Recognition error rate is too high
Strategy 1. Good Bye.
Strategy2.
Strategy 3.
--
--
DayMonth
Which date ? Good Bye.
--
DayMonth
Day-
Which day ? Which month?--
Good Bye.
--
--
2 11 **C fi ωω +=
0 P*2* 3 23 **C fei ωωω ++=
Strategy 3 is optimal if 2*(P1-P2)* we - wi > 0 P1 is much more high than P2 against a cost of longer interaction
0 P*2* 2 12 **C fei ωωω ++=
48IUI 2007 tutorial
POMDP (Young 2002)
• Partially Observable Markov Decision Process (POMDP)– POMDP extends Markov Decision Process by removing the requirement
that the system knows its current state precisely.– Instead, the system makes observations about the outside world which give
incomplete information about the true current state.– Belief State : A distribution over MDP states in the absence of knowing its state
exactly .
s
r(s,a)
),|(
)(),|'(),'|(),,|'()'(
baop
sbsaspasopbaospsb Ss
∑∈==
b(s)
s`
∑∈
=Ss
asrsbab ),()(),(ρ
MDP POMDP
Current State
Reward Function
Next State
49IUI 2007 tutorial
Example-based Dialog Model Learning (Lee et al 2006)
• Example-Based Dialog Modeling– Automatically modeling from dialog corpus
– Example-based techniques using dialog example database (DEDB).– This model is simple and domain portable.
– DEDB Indexing and Searching – Query key : user intention, semantic frames, discourse history.
– Tie-breaking – Utterance similarity Measure
– Lexico-Semantic Similarity : Normalized edit distance– Discourse History Similarity : Cosine similarity
50IUI 2007 tutorial
• Indexing and Querying– Semantic-based indexing for dialog example database
– Lexical-based example database needs much more examples.– The SLU results is the most important index key.
– Automatically indexing from dialog corpus.
Example-based Dialog Modeling
Utterance 그럼 SBS 드라마는 언제 하지?(when is the SBS drama showing?)
Dialog Act Wh-question
Main Action Search_start_time
Component Slots [channel = SBS, genre =drama]
Discourse History [1,0,1,0,0,0,0,0,0]
System Action Inform(date, start_time, program)
Input : User Utterance
Output : System Concept
Indexing Key
51IUI 2007 tutorial
Example-based Dialog Modeling
• Tie-breaking– Lexico-Semantic Representation
– Utterance Similarity Measure
User Utterance그럼 SBS 드라마는 언제 하지?(when is the SBS drama showing?)
Component Slots [channel = SBS, genre = 드라마(dramas)]
Lexico-Semantic Representation 그럼 [channel] [genre] 는 언제 하 지
그럼 [channel] [genre] 는 언제 하 지
Slot-Filling Vector : [1,0,1,0,0,0,0,0,0][date] [genre] 는 몇 시에 하 니
Slot-Filling Vector : [1,0,0,1,0,0,0,0,0]
Current User Utterance
Retrieved ExamplesLexico-Semantic Similarity
Discourse History Similarity
52IUI 2007 tutorial
Strategy of Example-based Dialog Modeling
DialogueCorpus
Dialogue Example DB
DomainExpert
User’s Utterance
AutomaticIndexing
Retrieval
DiscourseHistory
Query Generation
DialogueExamplesTie-breaking
Lexico-semantic SimilarityDiscourse history Similarity
Utterance Similarity
SemanticFrame
Best DialogueExample
User Intention
System Responses
DialogueCorpus
DialogueCorpus
Dialogue Example DB
DomainExpert
User’s Utterance
AutomaticIndexing
Retrieval
DiscourseHistory
Query Generation
DialogueExamplesTie-breaking
Lexico-semantic SimilarityDiscourse history Similarity
Utterance Similarity
Lexico-semantic SimilarityDiscourse history Similarity
Utterance Similarity
SemanticFrame
Best DialogueExample
User Intention
System Responses
53IUI 2007 tutorial
EPGDEDB
EPGDialog Corpus
EPG Expert
Discourse HistoryStack
• previous user utterance• previous dialog act and semantic frame• previous slot-filling vector
Frame-SlotExtraction (EPG)
Dialog Act Identification
DiscourseInference
USER : What is on TV now?USER : What is on TV now?
• Agent = Task• Domain = EPG
• Dialog Act = Wh-question
• Main Action= Search_Program• Start_Time = now
Retrieved Dialog Examples
• Calculate utterance similarity System
Response
EPGMeta-Rule
XML RuleParser
• When no example is retrieved, meta-rules are used.
Domain SpotterAgent Spotter
SYSTEM : “XXX” is on SBS, …..SYSTEM : “XXX” is on SBS, …..
Web Contents
DatabaseManager
TV ScheduleDatabase
Multi-domain/genre Dialog Expert
54IUI 2007 tutorial
References
• S. Larsson and D. Traum, “Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit”, Natural Language Engineering, vol. 6, no. 3-4, pp. 323-340, 2000
• E. Levin, R. Pieraccini, and W. Eckert, “A stochastic model of human-machine interaction for learning dialog strategies”, IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 11-23, 2000.
• Steve Young. Talking to Machine (Statistically speaking), ICASSP 2002 , Denver• I. Lane and T. Kawahara. 2006. Verification of speech recognition results incorporating
in-domain confidence and discourse coherence measures, IEICE Transactions on Information and Systems, 89(3):931-938.
• C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based Dialogue Management using Dialogue Examples. ICASSP.
• C. Lee, S.Jung, M. Jeong, and G. G. Lee. 2006. Chat and Goal-Oriented Dialog Together: A Unified Example-based Architecture for Multi-Domain Dialog Management, Proceedings of the IEEE/ACL 2006 workshop on spoken language technology (SLT), Aruba.
• D. Litman and S. Pan. 1999. Empirically evaluating an adaptable spoken dialogue system. ICUM, pp55-64.
• M. F. McTear, Spoken Dialogue Technology, Springer, 2004.• I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing advanced
spoken dialog management in Java. Speech Communication, 54(1):99-124. • M. Walker, D. Litman, C. Kamm, and A. Abella. 1997. PARADISE: A general
framework for evaluating spoken dialogue agents. ACL/EACL, pp271-280.
55IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
56IUI 2007 tutorial
• Motivation– The biggest problem to use dialog systems in a practical field is
“System Maintenance is difficult!”– Practical Dialog Systems need:
– Easy and Fast Dialog Modeling to handle new patterns of dialog– Easy to build up new information sources
– TV-Guide domain needs new TV-Schedule everyday– Reduce human efforts for maintaining– All dialog components should be synchronized!– Easy to tutor the system– Semi-automatic learning ability is necessary.
– Human can’t teach everything.
• Previous work– Rapid application development; CSLU Toolkit [CSLU Toolkit]– Scheme design & management; SGStudio [Wang and Acero, 2005]– Help non-experts in developing a user interface; SUEDE [Anoop et al., 2001]
Dialog Workbench/Studio
57IUI 2007 tutorial
Dialog Workbench
• Dialog Studio [Jung et al., 2006]– Dialog workbench System for example-based spoken dialog system– Can do
– Tutor the dialog system by adding & editing dialog examples– Synchronize all dialog components
– ASR + SLU + DM + Information Accessing– Providing semi-automatic learning ability– Reducing human-efforts for building up or maintaining dialog systems.
– Key idea– Generate Possible Dialog Candidates from Corpus– Predicting the possible dialog tagging information using a current model– Human approving or disapproving.
58IUI 2007 tutorial
Issue – “Human Efforts Reduction”• New dialog example Tagging
– Can be supported by the System using old models.
–
– DUP automatically generates the instances.– Administrator can audit DUP and modify the instances.
– ASR, SLU models are automatically trained
New dialog utterance
Old dialog manager tries to handle it
Display the result.
Human audit & modify the result
Dialog Example Editing
Dialog Utterance Pool(Automatically generated example candidates)
ASRModel
SLUModel
Example-basedDM Model
New Corpus Generation Example-DB Indexing
Recommendation
Generation
Audit & Modify
59IUI 2007 tutorial
POSTECH Dialog Studio Demo
60IUI 2007 tutorial
References (1/2)
• S. J. Cox, and S. Dasmahapatra. 2000. A semantically-based confidence measure for speech recognition. In Proc. of the ICSLP 2000, Beijing.
• J. Eun, C. Lee, and G. G. Lee. 2004. An information extraction approach for spoken language understanding. In: Proc. of the ICSLP, Jeju Korea.
• T. J. Hazen, J. Polifroni, and S. Seneff. 2002. Recognition confidence scoring and its use in speech language understanding systems. Computer Speech and Language, vol. 16, no. 1, pp. 49–67.
• T. J. Hazen, T. Burianek, J. Polifroni, and S. Seneff. 2000. Recognition confidence scoring for use in speech understanding systems. In Proc. of the the ISCA ASR2000 Tutorial and Research Workshop, Paris.
• H. Jiang. 2005. Confidence measures for speech recognition. Speech Communication, vol. 45, no. 4, pp. 455–470.
• S. Jung, C. Lee, G. G Lee, 2006. Three Phase Verification for Spoken Dialog System. In Proc. IUI.
61IUI 2007 tutorial
References (2/2)
• M. McTear, I. O’Neill, P. Hanna, and X. Liu. 2005. Handling errors and determining confirmation strategies - an object-based approach. Speech Communication, vol. 45, no. 3, pp. 249–269.
• I. O’Neill, P. Hanna, X. Liu, D.Greer, and M. McTear. 2005. Implementing advanced spoken dialogue management in Java. Science of Computer Programming, vol. 54, no. 1, pp. 99–124.
• T. Paek, and E. Horvitz. 2000. Conversation as action under uncertainty. In Proc. of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 455-464.
• Ratnaparkhi, 1998. A Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. Dissertation. University of Pennsylvania.
• F. Torres, L.F. Hurtado, F.E Garcia, Sanchis, and E. Segarra. 2005. Error handling in a stochastic dialog system through confidence measures. Speech Communication, vol. 45, no. 3, pp. 211–229.
62IUI 2007 tutorial
References
• K. S. Anoop, R.K. Scott, J. Chen, A. Landay, and C. Chen, 2001. SUEDE: Iterative, Informal Prototyping for Speech Interfaces. Video poster in Extended Abstracts of Human Factors in Computing Systems: CHI, Seattle, WA, pp. 203-204.
• S. Jung, C. Lee, G. G. Lee. 2006. Dialog Studio: An Example Based Spoken Dialog System Development Workbench, Dialogs on dialog: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems, Interspeech2006-ICSLP satellite workshop
• Y. Wang, and A. Acero. 2005. SGStudio: Rapid Semantic Grammar Development for Spoken Language Understanding. Proceedings of the Eurospeech Conference. Lisbon, Portugal.
• CSLU Toolkit, http://cslu.cse.ogi.edu/toolkit/
63IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
64IUI 2007 tutorial
Information Access Dialog
InformationSources
Dialog Manager
Question Query
Answer Result
65IUI 2007 tutorial
Information Access Agent
RDB Access Module Question Answering Module
Relational Database
WEB
Information Sources
66IUI 2007 tutorial
Building Relational DB from Unstructured Data
• A Relational DB Model is Equivalent to an Entity-Relationship Model
• We can build an ER Model with the Information Extraction Approach– Named-Entity Recognition (NER)– Relation Extraction
Relational Database
WEB
67IUI 2007 tutorial
• Named-Entity Recognition (NER)
– A task that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, etc. [Chinchor, 1998]
Hillary Clinton moved to New York last year.Hillary Clinton moved to New York last year.
Person Geo-Political Entity
Named-Entity Recognition
68IUI 2007 tutorial
Relation Extraction
• Relation Extraction
– A task that detects and classification relations between named-entities
Hillary Clinton moved to New York last year.Hillary Clinton moved to New York last year.
Person Geo-Political Entity
AT.Residence
69IUI 2007 tutorial
Question Answering
• Question Answering System for Information Access Dialog System– SiteQ [Lee et al. 2001; Lee and Lee, 2002]– Search answers, not documents
POS Tagging
Answer TypeIdentification
AnswerJustification
Query Formation
Dynamic AnswerPassage Selection
Answer Finding
DocumentRetrieval
Question
Answer Type
Answer
70IUI 2007 tutorial
References (1/2)
• C. Blaschke, L. Hirschman, and A.Yeh. 2004. BioCreative Workshop.• N. Chinchor. 1998. Overview of MUC-7/MET-2, MUC-7.• N. Kambhatla. 2004. Combining lexical, syntactic and semantic features with
Maximum Entropy models for extracting relations. ACL.• E. Kim, Y. Song, C. Lee, K. Kim, G. G. Lee, B. Yi, and J. Cha. 2006. Two-
phase learning for biological event extraction and verification. ACM TALIP 5(1):61-73
• J. Kim, T. Ohta, Y. Tsuruoka, and Y. Tateisi. 2003. GENIA corpus - a semantically annotated corpus for bio-textmining, Bioinformatics, Vol 19 Suppl.1, pp. 180-182.
• J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labelling sequence data. ICML.
• G. G. Lee, J. Seo, S. Lee, H. Jung, B. H. Cho, C. Lee, B. Kwak, J. Cha, D. Kim, J. An, H. Kim, and K. Kim. 2001. SiteQ: Engineering High Performance QA system Using Lexico-Semantic Pattern Matching and Shallow NLP. TREC-10.
71IUI 2007 tutorial
References (2/2)
• S. Lee, and G. G. Lee. 2002. SiteQ/J: A question answering system for Japanese. NTCIR workshop 3 meeting: evaluation of information retrieval, automatic text summarization and question answering, QA tasks.
• A. McCallum, and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, CoNLL.
• S. Soderland. 1999. Learning information extraction rules for semi-structured and free text. Machine Learning, 34, 233-72
• Y. Song, E. Kim, G. G. Lee, and B. Yi. 2005. POSBIOTM-NER: a trainable biomedical named-entity recognition system. Bioinformatics, 21 (11): 2794-2796.
• G. Zhou, J. Su, J. Zhang, M. Zhang. 2005. Exploring Various Knowledge in Relation Extraction. ACL.
72IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
73IUI 2007 tutorial
POSTECH Chatbot Demo
74IUI 2007 tutorial
Emotion Recognition
• Emotion Recognition
• Why is Emotion Recognition important in dialog systems?– Emotion is a part of User Context.
– It has been recognized as one of the most significant factor of people to communicate with each other. [T. Polzin, 2000]
– Application : Affective HCI (Human-Computer Interface)– Home Networking, Intelligent Robot, ChatBot, …
““I feel blue I feel blue today.today.””
““Do you need a Do you need a cheercheer--up music? "up music? "
““what up?what up?””
75IUI 2007 tutorial
Traditional Emotion Recognition
USER : I am very happy.USER : I am very happy.
Facial Expression Analysis Speech Analysis Linguistic
Analysis
Classifier for Final Emotion
Decision
Emotion Hypothesis
Facial Expression Speech Text
76IUI 2007 tutorial
Emotional Categories
• Emotional Categories
System Categories
Emotional Speech DB
• Positive: Confident, encouraging, friendly, happy, interested• Negative: angry, anxious, bored, frustrated, sad, fear• Neutral• Ex) EPSaT (Emotional Prosody Speech and Transcription), SiTEC DB
Call Center• Positive, Non-Positive• Anger, Fear, Satisfaction, Excuse, Neutral• Ex) HMIHY, Stock Exchange Customer Service Center
TutorSystem
• Positive, Negative, Neutral• Ex) ITSpoke
Chat Messenger Neutral, Happy, Sad, Surprise, Afraid, Disgusted, Bored, …
77IUI 2007 tutorial
Emotional Features
• Speech-to-Emotion– Acoustic correlates related to prosody of speech have been used for
recognizing emotions. – Such as pitch, energy, and speech rate of the utterance,
– In general, the features extracted from speech play a significant role in recognizing emotion.
Feature-Set DescriptionAcoustic-Prosodic Fundamental Frequency(f0) – max, min,
mean, standard deviationEnergy – max, min, mean, standard deviationSpeaking Rate – voice frame/total frame
Pitch Contour ToBI Contour, nuclear pitch accent, phrase+boundary tones
Voice Quality Spectral tilt
78IUI 2007 tutorial
Emotional Features
• Text-to-Emotion– Basic Idea
– People tend to use specific words to express their emotions in spoken dialogs.– Because they have learned how some words are related to the corresponding
emotions.
– Psychologists have tried to identify the language of emotions by asking people to list the English words that describe specific emotions.
– They identified emotional keyword in spoken language.– It is highly domain dependent.
Feature-Set Description
Lexical
N-gram (Unigram, Bigram, Trigram)Non-speech human noise : laughter and sighsFilled Pause : “Oh”Emotional Keyword
Pragmatic Dialog Act, User Identifier
Context Past observances in the previous user turns
79IUI 2007 tutorial
Classifier
• Basic Algorithm within Emotion Recognizer– It is very similar to text categorization and topic detection.
– Most of emotion detection system uses the same basic algorithm used in the text categorization or topic detection.
• Emotional Keyword Extraction– Emotional Salience
– Using mutual information
– TF*IDF (Cosine Normalization)
• Emotion Classifier– Feature-Based Classification with combining several features.
– Using statistical classification methodologies– Linear discriminant classifier (LDC)– K-nearest neighborhood (k-NN)– Support Vector Machine (SVM)
∑ =×
×=
t
i iji
ijiji
nNtf
nNtfw
12
,
,,
)]/log([
)/log(
∑=
===k
jjnnjnn )e,w(i)w|e(P)wW;E(I)w(sal
1
80IUI 2007 tutorial
Advanced ChatBot
• Context-Sensitive ChatBot– Most of chatbot can not reflect the discourse context of previous user turns.
[ALICE, 2002; B. Shawar, 2002]
USER : Did you watch a Korean team’s game yesterday?SYSTEM : What game?.USER : How come you didn’t know that?SYSTEM : Well, I don’t know.
Context-Free ChatBot
USER : Did you watch a Korean team’s game yesterday?SYSTEM : What game?.USER : How come you didn’t know that?SYSTEM : I am not interested in sports.
Context-Sensitive ChatBotWhat’s better?
Obscure answering
More intelligent answering
81IUI 2007 tutorial
Discourse History in ChatBot
USER : I had a date last weekDialog Act = statement-non-opinionMain Goal = performAgent = chatDomain = frienddate = last weekevent = dateSYSTEM : Did you like her?
USER : Well, I think so, but.Dialog Act = statement-opinionMain Goal = judgeAgent = chatDomain = friendSYSTEM : but, what?
USER : I don’t know her real feeling.Dialog Act = statement-opinionMain Goal = knowAgent = chatDomain = friendtheme = real feelingSYSTEM : Why? What happen?
Utterance 1
Utterance 2
Utterance 3
DA = statement-non-opinionMA = performDOMAIN = friendDate = 1Theme = 1
DA = statement-opinionMA = judgeDOMAIN = friend
DA = statement-opinionMA = knowDOMAIN = friendTheme = 1
Context-Free Index Key
Discourse Coherence
θ>−−−= rform)opinion,penonentdge|statemopinion,jutP(statemen),MA|DA,MAP(DA t-t-tt 11
Context-Sensitive Index Key
Previous Semantics = “statement-non-opinion,perform”Previous Keyword = “date”Scenario Session = “2”DA = statement-opinionMA=judgeDOMAIN=friend
Previous Semantics = “statement-opinion,judge”Previous Keyword = “NULL”Scenario Session = “2”DA = statement-opinionMA=knowDOMAIN=friend
Previous Semantics = “<s>,<s>”Previous Keyword = “date”DA = statement-non-opinionMA = performDOMAIN = friendDate = 1Theme = 1
Abstraction of previous user turn
82IUI 2007 tutorial
References
• ALICE. 2002. A.L.I.C.E, A.I. Foundation. http://www.alicebot.org/• L. Holzman and W. Pottenger, 2003. Classification of Emotions in Internet
Chat: An Application of Machine Learning Using Speech Phonemes, Technical Report LU-CSE-03-002, Lehigh University.
• J. Liscombe, 2006. Detecting and Responding to Emotion inn Speech: Experiments in Three Domains, Ph.D. Thesis Proposal, Columbia University
• D. Litman and K. Forbes-Riley, 2005. Recognizing student emotions and attitudes on the basis of utterances in spoken tutoring dialogues with both human and computer tutors, Speech Communication, 48(5):559-590.
• C. M. Lee and S. S. Narayanan. 2005. Toward Detecting Emotions in Spoken Dialogos, IEEE Transactions on Speech and Audio Processing, 13(2):293-303.
• T. Polzin and A. Waibel. 2000. Emotion-sensitive human-computer interfaces. the ISCA Workshop on Speech and Emotion.
• B. Shawar and E. Atwell, 2002. A comparison between Alice and Elizabeth chatbot systems. School of Computing Research Report, University of Leeds
• X. Zhe and A. Boucouvalas, 2002. Text-to-Emotion Engine for Real Time Internet Communication, CSNDDSP.
83IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
84IUI 2007 tutorial
POSTECH multimodal Dialog System Demo
85IUI 2007 tutorial
Multi-Modal Dialog
• Task performance and user preference for multi-modal over speech interfaces [Oviatt et al., 1997]
– 10% faster task completion,– 23% fewer words,– 35% fewer task errors,– 35% fewer spoken disfluencies
What is a decent Japanese restaurant near here?.
Hard to represent using only uni-modal !!
86IUI 2007 tutorial
Multi-Modal Dialog
• Components of multi-modal dialog system [Chai et al., 2002]
Speech
Gesture
Spoken LanguageUnderstanding
GestureUnderstanding
MultimodalIntegrator
dialogManager
Face Expression
Uni-modal Understanding
Multi-modal Understanding
& reference analysis
DiscourseUnderstanding
Uni-modal interpretation frame
Uni-modal interpretation frame
Multi-modal interpretation frame
87IUI 2007 tutorial
References (1/2)
• R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics interface,” Computer Graphics Vol. 14, no. 3, 262-270.
• J. Chai, S. Pan, M. Zhou, and K. Houck, 2002, Context-based Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).
• J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.
• J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.
• P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I. Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.
88IUI 2007 tutorial
References (2/2)
• H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference on Multimodal Interfaces, (ICMI),
• M. Johnston, 1998. Unification-based multimodal parsing. Proceedings of the International Joint Conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics , 624-630.
• M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing and understanding. Proceedings of COLING-2000.
• M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for multimodal dialogue systems. In Proceedings of ACL-2002.
• S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.
89IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
90IUI 2007 tutorial
POSTECH conversational TTS demoKorean (Dialog)
91IUI 2007 tutorial
• Text-to-speech system [M. Beutnagel, et al., 1999; J. Schroeter, 2005]– Front end
– Text normalization: take raw text and convert things like numbers and abbreviations into their written-out word equivalents.
– Linguistic analysis: POS-tagging, grapheme-to-phoneme conversion– Prosody generation: pitch, duration, intensity, pause
– Back end– Unit selection: select the most similar units in speech DB to make actual sound
output
Conversational Text-to-Speech
Textnormalization
LinguisticAnalysis
ProsodyGeneration
UnitSelection
Text
SynthesisBack-end
Speech
(Symbolic linguistic representation)
92IUI 2007 tutorial
• Given an alphabet of spelling symbols (graphemes) and an alphabet of phonetic symbols (phonemes), a mapping should be achieved transliterating strings of graphemes into strings of phonemes [W. Daelemans, et al., 1996]
• Alignment
Multilingual Grapheme-to-Phoneme Conversion
_
|
_
e__yogggahPhonemes:
||||||||
ㅔㅇ_ㅛㄱㄱㅏㅎGraphemes:
<Rule Generation>
Alignment Rule extraction Rule pruning Rule association Dictionary
<G2P Conversion>
Text normalizerInput text Canonical form of graphemes Phonemes
93IUI 2007 tutorial
• Predicting break index from POS tagged/syntax analyzed sentence• Break index [J. Lee, et al., 2002]
– No break: phrase-internal word boundary and a juncture smaller than a word boundary
– Minor break: minimal phrasal juncture such as an AP (accentual phrase) boundary
– Major break: a strong phrasal juncture such as an IP (intonational phrase) boundary
Break Index Prediction
Probabilistic break index prediction
C4.5
Break index tagged POS tag sequence
Break index tagged POS tag sequence
POS tag sequence
Trigram (wtag wtag break wtag)
Decision tree for error correction
94IUI 2007 tutorial
• Using C4.5 (decision tree)• Assume linguistic information and lexical information have influence to
tone of syllable• IP tone label prediction [K. E. Dusterhoff, et al., 1999]
– Assign one tone among “L%”, “H%”, “LH%”, “HL%”, “LHL%” and “HLH%” tone to the last syllable of IP
– Features– POS, punctuation type, the length of phrase, onset, nucleus, coda
• AP tone label prediction– Assign one tone among “L” and “H” tone to each syllable of AP– Features
– POS, the length of phrase, the location in prosodic phrase
Pitch Prediction using K-ToBI
95IUI 2007 tutorial
• Index of units: pitch, duration, position in syllable, neighboring phones• Half-diphone synthesis [A. J. Hunt, 1996; A. Conkie, 1999]
– The diphone cuts the units at the points of relative stability (the center of a phonetic realization), rather than at the volatile phone-phone transition, where so-called coarticulatory effects appear.
Unit Selection
96IUI 2007 tutorial
References (1/2)
• M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal. 1999. The AT&T Next-Gen TTS System. Joint Meeting of ASA, EAA, and DAGA.
• A. Conkie. 1999. Robust Unit Selection System for Speech Synthesis. Joint Meeting of ASA, EAA, and DAGA.
• W. Daelemans. 1996. Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion. Progress in Speech Synthesis, Springer Verlag, pp77-90.
• K. E. Dusterhoff, A. W. Black, and P. Taylor. 1999. Using decision trees within the tilt intonation model to predict f0 contours. Eurospeech-99.
• A. J. Hunt, and A. W. Black. 1996. Unit Selection in a concatenation speech synthesis system using a large speech database. ICASSP-96, vol. 1, pp 373-376.
97IUI 2007 tutorial
• S. Kim. 2000. K-ToBI (Korean ToBI) Labelling Conventions. UCLA Working Papers in Phonetics 99.
• S. Kim, J. Lee, B. Kim, and G. G. Lee. 2006. Incorporating Second-Order Information Into Two-Step Major Phrase Break Prediction for Korean. ICSLP-06
• J. Lee, B. Kim, and G. G. Lee. 2002. Automatic Corpus-based Tone and Break-Index Prediction using K-ToBI Representation. ACM transactions on Asian language information processing (TALIP), Vol 1, Issue 3, pp207-224.
• J. Lee, S. Kim, and G. G. Lee. 2006. Grapheme-to-Phoneme Conversion Using Automatically Extracted Associative Rules for Korean TTS System. ICSLP-06
• J. Schroeter. 2005. Electrical Engineering Handbook, pp16(1)-16(12).
References (2/2)
98IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
99IUI 2007 tutorial
Statistical Machine TranslationPOSTECH Statistical MT System Demo
Korean-EngishJapanese-KoreanSpeech to Speech
100IUI 2007 tutorial
SMT Task
• SMT: Statistical Machine Translation• Task:
– Translate a sentence in a language into another language– using statistical features of data.
나는나는 생각한다생각한다, , 고로고로 나는나는 존재한다존재한다..
I think thus I am.I think thus I am.
P(I | P(I | 나는나는 ) = 0.7 , P( me | ) = 0.7 , P( me | 나는나는 ) = 0.2 , ) = 0.2 , ……
P(thinkP(think||생각하다생각하다) = 0.5, ) = 0.5, P(thinkP(think| | 생각생각) = 0.4 ,) = 0.4 ,……
……
101IUI 2007 tutorial
The Machine Translation Pyramid
Interlingua
Native semantics
Native syntax
Native sentence
Foreign sentence
Foreign syntax
Foreign semantics
Interlingua based system requires
syntactic analysis, semantic analysis, language generation …… .
that is, all other NLP techniques and linguistics.
Interlingua based system
102IUI 2007 tutorial
SMT in the Machine Translation Pyramid
Interlingua
Native semantics
Native syntax
Native sentence
Foreign sentence
Foreign syntax
Foreign semantics
Statistical system requires nothing but data and statistics.
Do not requires any other NLP techniques and linguistics.
Statistical system
103IUI 2007 tutorial
Statistical Model
• Statistical Modeling
KoreanKorean--EnglishEnglishParallel textParallel text English textEnglish text
KoreanKorean Broken Broken EnglishEnglish EnglishEnglish
Translation Translation modelmodel
Language Language modelmodel
SStatistical analysistatistical analysis SStatistical analysistatistical analysis
)(eP)|( ekP
104IUI 2007 tutorial
Statistical Model
• Fundamental models– Language Model
– Makes English fluently– Translation Model
– Makes translation correctly– Decoding Algorithm
– Finds best sentence
TranslationModel
LanguageModel
Output Input
DecodingAlgorithm
)()|(maxarg)|(maxarg
ePekPkePe
e
ebest
==
105IUI 2007 tutorial
Translation Model
• Give a probability to a word/phrase pair– For a given word/phrase, list all the possible translations.– For a good translation, give high probability– For a poor translation, give low probability
• Independent assumption– Word translations are independent one another.
– Probability of a sentence translation = Product of words translation probabilities
∏=i
ii ekPEKP )|()|(
106IUI 2007 tutorial
Decoding
• Search space – exponential to the length of sentence
– Pruning Reduces search space– Threshold pruning & Beam search algorithms
n: f: -----P: 1.0
n: If: *----P: 0.5
n: thinkf: -*---P: 0.4
n: amf: *---*P: 0.13
n: thinkf: **---P: 0.25
No word No word translatedtranslated
A word A word translatedtranslated
Two words Two words translatedtranslated
107IUI 2007 tutorial
Evaluation
• BLEU score– Most famous metric– Range 0~1.– Higher score means better translation
∑=
⋅=N
nnn pw
1)logexp(BPBLEU
BP: BP: factor related to the length of candidate translationfactor related to the length of candidate translation
ppnn: : nn--gram precision, ignoring duplicate countgram precision, ignoring duplicate count
N: N: maximum order of nmaximum order of n--gramgram
wwnn: weight: weight
⎩⎨⎧
≤>
= − rcerc
cr if if1
BP )/1(
c : length of candidate translationc : length of candidate translation
r : length of reference sentencer : length of reference sentence
108IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
109IUI 2007 tutorial
IBM Model
• Model 1– Source length only dependent on target length– Assume uniform probability for position alignment– Source word only dependent on aligned word
• Model 2– Target position depends on the source position
• Model 3– Add Fertility Model
• Model 4– Model re-ordering of phrases– deficient: alignment can generate source positions outside of length
• Model 5– Remove deficiency from model 4
110IUI 2007 tutorial
GIZA++
• GIZA– Part of the SMT toolkit EGYPT– An word alignment tool– An implementation of IBM Model 4.
• GIZA++– And extension of GIZA– Model 5, HMM alignment model …
111IUI 2007 tutorial
Phrase-based SMT
• Pharaoh: [Philipp Koehn, 2003]– An implementation of Statistical Phrase-based Machine Translation– Phrase:
– Not a syntactic phrase– A sequence of contiguous words
– SMT, but Translation unit is the phrase
112IUI 2007 tutorial
Pharaoh Overview
• Based on noisy channel model (Typical SMT )• Language model p(e) replaced with
– Word cost introduced to adjust output length– ω > 1 : prefer longer translation– ω = 1 : don’t care about length– ω < 1 : prefer shorter translation
)()()( elengthLM epep ω=
113IUI 2007 tutorial
Pharaoh Overview
• Translation model p(f|e) replaced with
• Input sentence f is segmented into a sequences of I phrases– Translation occurs phrase by phrase – And assume independence of each translation of phrase– Distortion probability d() is introduced.
– ai : start position of the foreign phrase that was translated into the ith English phrase
– bi: end position of the foreign phrase that was translated into the (i-1)th English phrase
∏=
−−=I
iiiii
IIbadekekp
1111 )()|()|( φ
Ik 1
ii ek →
|1|1
1)( −−−
−=− ii baii bad α
114IUI 2007 tutorial
Pharaoh Training
• Alignment, Intersection and union
K-E 생맥주
한 잔 주 세요 .
A
Draft
Beer
,
Please
.
E-k 생맥주
한 잔 주 세요 .
A
Draft
Beer
,
Please
.
Inter-sect 생맥주
한 잔 주 세요 .
A
Draft
Beer
,
Please
.
Inter-sect 생맥주
한 잔 주 세요
.
A
Draft
Beer ?
,
Please ?
. GIZA++ results
Intersection
HeuristicUnion
115IUI 2007 tutorial
Pharaoh Training
• Learning all phrase pairs that are consistent with the word alignment
• (A Draft | 생맥주) ( Beer | 한 잔 ) (, | 주) (Please | 세요) (. | .)• (A Draft Beer | 생맥주 한 잔) (Beer , | 한 잔 주) (, Please | 주 세요) (Please . | 세요 .)• (A Draft Beer , | 생맥주 한 잔 주 ) ( Beer , Please | 한 잔 주 세요 ) ( , Please | 주 세요 .)• (A Draft Beer , Please | 생맥주 한 잔 주 세요 ) ( Beer , Please . | 한 잔 주 세요 .)• (A Draft Beer , Please . | 생맥주 한 잔 주 세요 . )
Inter-sect 생맥주
한 잔 주 세요
.
A
Draft
Beer
,
Please
.
116IUI 2007 tutorial
Techniques to improve
• Pre-processing– Normalize the input text to the “easy to translate” form– Reordering, tagging, paraphrasing, …… .
ForeignForeign NormalizedNormalizedForeignForeign NativeNative
NormalizationNormalization TranslationTranslation
117IUI 2007 tutorial
Techniques to improve
• Post-processing– Translation may have some errors.– Perform error-correction decoding– Convert some trivial errors
– E.g. (morpheme connectivity check)
Foreign NativeNativewith errorwith error NativeNative
TranslationTranslation Error correctionError correction
118IUI 2007 tutorial
Add POS tag
• Approach– Add part-of-speech (POS) tags to the training data
• Effect– Distinguish some of the homonyms– Change spacing unit
• Why useful?– For many languages, automatic POS tagging is available.– Spacing unit is changed into unit of meaning
119IUI 2007 tutorial
Delete Useless words
• Approach– For some language pairs, there are useless words for translation.– Delete useless words to help word alignment
• Effects– Reduce the number of misaligned pairs
• Example: Korean-English translation– English : the, a, an, -es
Korean has a tendency not to distinguish number in noun– Korean : some kinds of post-positions ( 은, 는, 이, 가, 을, 를, …)
English does not have case-markers
120IUI 2007 tutorial
Using Dictionary
• Approach– Just append the dictionary to the end of parallel corpus
• Effects– Add one count for correct phrase pairs in the dictionary– Increase the coverage of vocabulary
• Why useful?– Usually, a dictionary is easily accessible
– Already built in web or other applications– Adding dictionary gives significant improvement.
121IUI 2007 tutorial
Dividing Language Model
KoreanKoreanBrokenBrokenEnglishEnglish
EnglishEnglish
Korean/EnglishKorean/EnglishBilingual TextBilingual Text
EnglishEnglishTextText
TranslationTranslationModelModelP(k|eP(k|e))
LanguageLanguageModel1Model1
LanguageLanguageModel2Model2
??
Select LM
122IUI 2007 tutorial
Contents
• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding
• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech
• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation
123IUI 2007 tutorial
Speech Translation
• ASR– Automatic Speech Recognizer– Generate texts of given speech signal.
• TTS– Text-To-Speech– Synthesis sounds of given text.
• Speech Translation Task– Translate speech signal in a language into another language– Combining ASR, TTS and Machine Translation
124IUI 2007 tutorial
Combining ASR, TTS and SMT
• Cascading approach– Connect ASR, SMT and TTS in cascading manner– The ASR result be a input for the SMT system.– Translation result from SMT system be a input for TTS system.– Simple!
ASR SMT TTSOriginalOriginalSpeechSpeech
RecognizedRecognizedTextText
TranslatedTranslatedTextText
TranslatedTranslatedSpeechSpeech
125IUI 2007 tutorial
Combining ASR, TTS and SMT
ASRASR
ASRASRResult1Result1
ASRASRResult2Result2
ASRASRResult3Result3
ASRASRResult4Result4
ASRASRResult nResult n
SMTSMTResult1Result1
SMTSMTResult2Result2
SMTSMTResult3Result3
SMTSMTResult4Result4
SMTSMTResult nResult n
HHighest score translationighest score translation
Speech SignalSpeech Signal
TTSTTS
Speech SignalSpeech Signal
SMTSMTSystemSystem
scoringscoring
scoringscoring
126IUI 2007 tutorial
References (1/2)
• P.F. Brown, S.A. Della Pietra, V.J. Della Pietra and R.L. Mercer. 1993. Mathematics of Statistical Machine Translation: Parameter Estimation. Computatitional linguistics, vol. 19, no. 2, pages 263-311.
• C. Callison-Burch, P. Koehn and M. Osborne. 2006, Improved Statistical Machine Translation Using Paraphrases. In proceedings NAACL.
• M. Collins, P. Koehn, and I. Kucerova. 2005. Clause Restructuring for Statistical Machine Translation. ACL.
• P. Koehn, F.J. Och and D. Marcu. 2003. Statistical Phrase-Based Translation. In proceedings of HLT, Pages 127-133.
• P. Koehn. 2004. Pharaoh: a Beam Search Decoder for Phrase-Based SMT. In Proceedings of AMTA pages 115-124.
• P. Koehn. 2004. Pharaoh, User Manual and Description for Version 1.2. http://www.isi.deu./licensed-sw/pharaoh/.
127IUI 2007 tutorial
References (2/2)
• P. Koehn, A. Axelrod, A. Birch Mayne, C. Callison-Burch, M. Osborne and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. IWSLT.
• J. Lee, D. Lee, and G. G. Lee. 2006. Improving phrase-based Korean-English statistical machine translation. ICSLP-06
• F.J. Och and H. Ney. Improved statistical alignment models. 38th annual meeting of the ACL, pages 440-447.
• K. Papineni, S.Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU a method for automatic evaluation of machine translation. 40th Annual Meeting of the ACL pages 311-318. Philadelphia, PA, Jul.
• R. Zhang, G. Kikui. 2006. Integration of Speech Recognition and Machine Translation: Speech Recognition word Lattice Translation. Speech Communication, Vol. 48, Issues 3-4.
128IUI 2007 tutorial
Thanks To
Minwoo Jung• Cheongjae Lee• SangKeun Jung• Seungwon Kim• Jinsik Lee• Jonghun Lee• Kyungdeok Kim• Sukwhan Kim• DonghHyeon Lee• HyungJong Noh• And others…
129IUI 2007 tutorial
Thank you!Any Question??