human-machine dialogue espere and reality
DESCRIPTION
Human-Machine Dialogue Espere and Reality. Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, CHINA 2014/8/15. Overview Core Technologies Speech-to-Text Text-to-Speech Natural Language Processing Dialogue Management Middlewares & Protocols Conclusion. OUTLINE. - PowerPoint PPT PresentationTRANSCRIPT
Report
Docum
ent
Human-Machine DialogueEspere and Reality
Dr. Zhang SenDr. Zhang Sen
Chinese Academy of SciencesBeijing, CHINA
23/04/22
Report
Docum
ent
2
OUTLINE
• Overview
• Core Technologies– Speech-to-Text– Text-to-Speech– Natural Language Processing– Dialogue Management– Middlewares & Protocols
• Conclusion
Report
Docum
ent
3
Overview
• Motivation and Goal
• State of the art
• Why so difficult?
• Application Areas
• My works
Report
Docum
ent
4
Motivation and Goal
• Machine is tool invented by human– Industry Revolution, free human’s manual labor
– Information Revolution, free human’s mental labor? – Fundamental functions required
• Espere and Goal (Bill Gates)– talk with machine via speech/NL freely– machine can understand/imitate human activities
• Machine’s intelligence– Turing test, classical and extended
Report
Docum
ent
5
Turing’s Question
• Alan M. Turing• “Computing Machinery and Intelligence”,
(Mind, 1950 - Vol. 59, No. 236, pp. 433-460)
– I propose to consider the question, “Can machines think?” This should begin with definitions of the meaning of the terms “machine” and “think”.
• To answer this question, Turing proposed the “Imitation Game” later named the “Turing Test”
Report
Docum
ent
6
Turing Test
Observer
Subject #1
Subject #2
Subject #2Which subject is the
machine?
Simple, operative objective and convincible
Report
Docum
ent
7
Conditions and Answers
• Conditions– classical Turing test: assumed communications
would be via typed text (keyboard)– extended Turing test: assumed communications
would be via speech input/output– assumed communications would be unrestricted
(as to subject, etc)
• The ability to communicate is equal to “thinking” and “intelligence” (Turing)
Report
Docum
ent
8
Turing Test - Today
• Today, great advances in HW/SW, even computer can defeat the greatest player in chess game, but machine is still unable to fool interrogator on unrestricted subjects.
• Turing predicted that test (classical) would passed in 50 years, but exactly speaking, not passed yet, not failed yet. The extended Turing test is harder and still has a long way to go. So did some AI experts’ predictions in 50s and 60s.
• Human-machine dialogue become possible, and can provide useful functions– Travel Reservations, Stock Brokerages, Banking, etc.
Report
Docum
ent
9
Impact and Influence
• Though Turing test not passed, it have promoted and boosted great advances in many areas: – Computer Science
– AI
– Cognitive Science
– Natural Language Processing (NLU, NLG, ...)
– MT
– Robot
– Speech-to-Text
– Text-to-Speech
– Computer Vision
– etc
Report
Docum
ent
10
Projects
• DARPA Projects– two times, 80s and 90s, ATIS (996 words, connected),
– communicator (>5000 words, continuous)
• MIT projects, Galaxy
• CMU, OGI, JANUS project
• Bell Lab, IBM, Microsoft, VUI, VoiceXML, SALT
• Verbmobil, DFKI (German), SUNDIAL
• Grenoble, INRIA (France), MIAMM, OZONE
• ATR, JSPS projects (Japan)
• CSTAR-I, II, III, S2S project
• etc
Report
Docum
ent
11
State of the Art
• Subject-restricted, small vocabulary, possible, but far from
satisfactory
• Metrics for the evaluation of H-M Dialogue Systems– CU Communicator 2002, the values are means– task of completion (70%), time to completion (260s)
– total turns to completion (37), response latency (2s)
– user words to task end (39), system words to task end (332)
– # of reprompts (3)
– WER (22% ? 30%)
– DARPA Communicator project proposed a set of metrics
including more than 18 items.
Report
Docum
ent
12
Overview Architecture
DM
NLU
Speech I/O
Applications
middleware
middleware
middleware
DBKB
Report
Docum
ent
13
Ozone’s Architecture
PPll aa tt ff oo rr mm
AArr cc hh ii tt ee cc tt uu rr ee
ll aa yy ee rr (( WWPP44 ))
DDee vv ii cc ee PP
ll aa tt ff oo rr mmAApp pp ll ii cc aa tt ii oo nn ssTT
oo pp ll aa yy ee rrSSee rr vv ii cc ee EE
nn aa bb ll ii nn gg
ll aa yy ee rr (( WWPP 22 ))
SS oo ff tt wwaa rr ee EE
nn vv ii rr oo nn mmee nn tt
ll aa yy ee rr (( WWPP 33 ))
MMii dd dd ll ee ww
aa rr ee
Ozone Applications
Situation Sensitivity
Ozone Services
Data processing
Compression/Decomp. Encryption/decryption Rendering/Composition Scaling
Multi-modality
Security
Screen / Speaker Camera/ Microphone Keyboard Pointer (mouse, touch) Sensor (Position/Identity) Actuator (Control) Clock
User+Env Interaction
Service Infrastructure
Service Disc/Lookup Service Naming scheme Service-Control
Communication Eventing & Transactions Service Composition Service Migration
Platform Infrastructure
Device Disc/Lookup Booting Device Mgmt Resource Mgmt Network Mgmt Power Mgmt Network Mobility
Seamless Operation
Interoperability
Extendibility
High Performance
Adaptability
Reconfigurability
Context-Awareness Mgmt Context Model Community Mgmt Preference Mgmt Profile Mgmt
Context Awareness
UI Mgmt Smart agent (context vs.
modality) Multi-modal widgets Perception QoS
User-Interface System
User Mgmt Key Mgmt Access Rights Mgmt Digital Rights Mgmt
Identity
Storage
Disk File-system DBMS
External services
Web services Video-on-Demand …
Application-related services
Video-conferencing Watch dogs (incoming
messages)
(Device-related) Functionalities
Networking
Network adaptero Streaming protocolso Network monitoringo Network control
Proprietary device-platform interface
Ozone-compliant middleware interface
Ozone-compliant device-platform interface
System Apps
Initial Access Content Browser
General Apps
VideoWatch MediaAlbum ...
Platform-relatedServices
Device-control
One per device Device descr.
One per functionality (each with specific type) Interactiono A/V Render + Captureo Multi-modal Interactiono Sensors (loc., ident.)o Actuators (control)o ...
Functionality-ctrl services
Data processingo Transcoder
Storageo Content Storeo Knowledge Store
Compute Infrastructure
Run-time Environment Networks on Silicon Reconf. Computing Elements
Multi-processors Memory Hierarchies Device Power-control
Ozone Run-timeEnvironment
Standardizedexecutionenvironment
Portable code
Ozone NetworkAbstraction
Addressing Streaming QoS
monitor+control Topology view ...
Device Platform
Middleware-relatedServices
Service Enabling
Examples: Preference service Community service Smart-agent service ...
Software Env.
Examples: Registry Stream Manager ...
SS ee rr vv ii cc ee ss
Content Disc/Lookup Content Naming scheme Application Startup ApplicationMigration Stream Mgmt Stream-Plug Model Replic/Synchronization
Appl & Content Infra
Report
Docum
ent
14
Ozone’s Architecture (WP2)
Application services
Ozone applications & services
Software Environment layer
ContextAwareness
User context
Ozone context
Multi-modal widgets
Dialog management
Smartagent User-
InterfaceMgmt
PerceptionQoS
Security
Speechrecognition
Animatedagent
User-interaction module
Gesturerecognition
Interaction services
Videobrowser
Authen-tication
Securityservices
Content-accessprotection
encryption
Report
Docum
ent
15
Galaxy Hub Architecture (MIT, CU)
Hub
ASR
TTS
AudioServer
Database
NLgenerator
NL parser
DM
MIT Galaxy hub architecture with CU communicator
Confidenceserver
WWW
Report
Docum
ent
16
Why So Difficult?
• Natural language variation– ambiguity at word, sentence levels
– NL as an open, changing set, numerical?
• Speech variation and communication channel distortion– non-stationary, rate, power, timbre, …
– what is the fundamental feature of speech?
• Computing power limitations– optimal search algorithms’ requirement
• Current computer architecture limitations– weak to deal with analogous, fuzzy values
• Limited knowledge on human intelligence– learning mechanism of human beings
Report
Docum
ent
17
Open Issues
• Can ASR hear everything?
• Can NLP understand everything heard?
• Can DM deal with multiple strands?
• Does TTS sound natural?
• In my opinion, Problems such as ASR,
NLP, TTS, MT, etc., have some common
characteristics. One solved, others too.
Report
Docum
ent
18
Main Methodologies
• Statistical approach– training problems, false sample problems
• Rule-based approach– rules’ selection and conflict
• DP-based search algorithms– Viterbi, F-B search, beam search
• Mathematical modeling– time-series finite state transition model
Report
Docum
ent
19
Application Areas
• Improve Existing Applications– Scheduling - Airlines, Hotels
– Financial - Banks, Brokerages
• Enabling New Applications– Complex Travel Planning
– Voice Web search and browsing
– speech-to-speech MT
– Catalogue Order
• Many applications require Text-to-Speech– role games
– speaking toys
Report
Docum
ent
20
Works in Waseda
• Project “Research on human-machine dialog through
spoken language”, JSPS sponsored, 1998-2000Improved DTW approach with regard to prominent acoustic features,
Proceedings of the ASJ, 1999
Re-estimation of LP coefficients in the sense of L∞ criterion,
IEEE ICSLP, 2000, Beijing, China
Visual approach for Automatic Pitch Period Estimation,
IEEE ICASSP, 2000, Istanbul, Turkey
Automatic Labeling Initials and Finals in Chinese Speech Corpus,
IEEE ICSLP 2000, Beijing, China
• A speech coding approach based on human hearing model,
Proceedings of the ASJ, 2000
Report
Docum
ent
21
Works in CSLR, CU
• Project “CU Communicator”,
DARPA sponsored and NSF supported, 2000-2001
• N-gram LM smoothing based on word class information
• Dynamic pronunciation modeling for ASR adaptation
Amdahl law, 50 most common words• What kind of pronunciation variations hard for tri-phones to model?
IEEE ICASSP 2001, Salt Lake city, USA
Report
Docum
ent
22
Works in INRIA-LORIA
• Project “Multidimensional Information Access using
Multiple Modalities”, EU IST sponsored,2002-2003
• Middleware between ASR engine and DM, XML
• Domain-specific N-gram LM generation based on a set
of French language rules, PERL
• HMM-based acoustic modeling improvement• Some issues on speech signal re-sampling at arbitrary rate,
IEEE ISSPA, 2003, Paris, FRANCE
• An Effective Combination of Different Order N-Grams, The 17th Pacific Asia
Conference on Language, Information and Computation, 2003, Singapore
• Comparison of speech signal resampling approaches, Proc. of ASJ, 2003, Tokyo, Japan
• Text-to-Pinyin conversion based on context knowledge and d-tree for Mandarin,
IEEE NLP-KE, 2003, Beijing, China
Report
Docum
ent
23
Spoken Language Toolkit
• Finished in 2003, speech signal analysis
module was integrated into Snorri, LORIA
• Functions: – speech signal analysis– speech-to-text– text-to-speech– text-to-grapheme
Report
Docum
ent
24
Snapshot of Toolkit (1)
Report
Docum
ent
25
Snapshot of Toolkit (2)
Report
Docum
ent
26
Snapshot of Toolkit (3)
Report
Docum
ent
27
Core Technologies
Based on the requirements analysis of human-machine
communication, at least the following technologies
should be included:
• Speech-to-Text
• Text-to-Speech
• Natural Language Processing
• Dialogue Management
• Middlewares & Protocols
Report
Docum
ent
28
Speech-To-Text
Report
Docum
ent
29
The Speech-to-Text Problem
Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence AA tractable reformulation of the problem is:
Language model
Acoustic model
Daunting search task
Report
Docum
ent
30
Speech Recognition Architecture
FrontEnd
Recognition
O1O2 OT
AnalogSpeech
ObservationSequence
W1W2 WT
Best WordSequence
Decoder
AcousticModel
DictionaryLanguage
Model
Report
Docum
ent
31
Front-End Processing Feature Extraction
Dynamic featuresK.F. Lee
Report
Docum
ent
32
Overlapping Sample Windows
Speech signal is non-stationary signalshort-term approximation: viewed as stationary signal
Report
Docum
ent
33
Cepstrum Computation
• Cepstrum is the inverse Fourier transform of the log spectrum
1,,1,0,)(log2
1)( LndeeSnc njj
IDFT takes form of weighted DCT in computation, see in HTK
Report
Docum
ent
34
Mel Cepstral Coefficients
• Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples:
Filter-bank, under 1k hz, linear, above 1k hz, log Motivated by human auditory response characteristicsMost common feature set for recognizers
Report
Docum
ent
35
Cepstrum as Vector Space Features
Overlap
Report
Docum
ent
36
Features Used in ASR
• LPC– Linear predictive coefficients
• PLP– Perceptual Linear Prediction
• Though MFCC has been successfully used,
what is the robust speech feature?
Report
Docum
ent
37
Acoustic Models
• Template-based AM, used in DTW, obsolete
• Acoustic states represented by Hidden Markov Models (HMMs)
– Probabilistic State Machines - state sequence unknown, only feature vector outputs observed
– Each state has output symbol distribution
– Each state has transition probability distribution
– Issues: what topology is proper? how many states in a model?
How many mixtures in a state?
normal silence connected
Report
Docum
ent
38
Limitations of HMM
• HMMs assume the duration follows an exponential
distribution• The transition probability depends only on the
origin and destination • All observation frames are dependent only on the
state that generated them, not on the neighboring
observation frames (observation frames dependent)
Paper: “Transition control in acoustic modeling and Viterbi search”
Report
Docum
ent
39
Basic Speech Unit Models
• Create a set of HMM’s representing the basic sounds (phones) of a language?– English has about 40 distinct phonemes
– Chinese has about 22 Initials + 37 Finials
– Need “lexicon” for pronunciations
– Letter to sound rules for unusual words
– Co-articulation effects must be modeled
• tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English)– e.g. pl-c+pr
Report
Docum
ent
40
Language Models
• What is a language model?– Quantitative ordering of the likelihood of word
sequences (statistical viewpoint)– A set of rule specifying how to create word
sequences or sentences (grammar viewpoint)• Why use language models?
– Not all word sequences equally likely– Search space optimization (*)– Improve accuracy (multiple passes)– Wordlattice to n-best
Report
Docum
ent
41
Finite-State Language Model
• Write Grammar of Possible Sentence Patterns• Advantages:
– Long History/ Context– Don’t Need Large Text Database (Rapid Prototyping)– Integrated Syntactic Parsing
• Problem:– Work to write grammars– Words sequences not enabled do not exist– Used in small vocabulary ASR, not for LVCASR
show me
display
any
the next
the last
page
picture
text file
Report
Docum
ent
42
Statistical Language Models• Predict next word based on current and history• Probability of next word is given by
– Trigram: P(wi | wi-1, wi-2)– Bigram: P(wi | wi-1)– Unigram: P(wi)
• Advantage:– Trainable on Large Text Databases– ‘Soft’ Prediction (Probabilities)– Can be directly combined with AM in decoding
• Problem:– Need Large Text Database for each Domain– Sparse problems, smoothing approaches
• backoff approach• word class approach
• Used in LVCASR
Report
Docum
ent
43
Statistical LM Performance
Report
Docum
ent
44
ASR Decoding Levels
/w/ -> /ah/ -> /ts/
/th/ -> /ax/
what's the
display
kirk's
willamette's
sterett's
location
longitude
lattitude
/w/ /ah/ /ts/
/th/ /ax/
States
Phonemes
Words
Sentences
AcousticModels
Dictionary
LanguageModel
Report
Docum
ent
45
Decoding Algorithms
• Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match)
• Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process,
the “principle of optimality” is divide and conquer.
• The DP-based search algorithms have been used in speech recognition decoder to return n-best paths or wordlattice through the acoustic model and the language model
• Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load.
• Issues: computation underflow, balance of LM, AM.
Report
Docum
ent
46
Viterbi Search
• Uses Viterbi decoding– Takes MAX, not SUM (Viterbi vs. Forward)– Finds the optimal state sequence, not optimal
word sequence– Computation load: O(T*N2)
• Time synchronous– Extends all paths at each time step– All paths have same length (no need to normalize
to compare scores, but A* decoding needs)
Report
Docum
ent
47
Viterbi Search AlgorithmFunction Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score))
then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s
Backtrace from highest prob state in final column of viterbi[] & return
Report
Docum
ent
48
Viterbi Search Trellis
W1
W2
0 1 2 3 t
Report
Docum
ent
49
Viterbi Search Insight
Word 1 Word 2
time t time t+1
Word 1
Word 2
S1S2S3
S1
S1 S1S2S2
S2S3
S3 S3
OldProb(S1) • OutProb • Transprob OldProb(S3) • P(W2 | W1)
scorebackptrparmptr
Report
Docum
ent
50
Bachtracking
• Find Best Association between Word and Signal• Compose Words from Phones Using Dictionary• Backtracking is to find the best state sequence
/th/
/e/
t1 tn
Report
Docum
ent
51
N-Best Speech Results
• Use grammar to guide recognition • Post-processing based on grammar/LM• Wordlattice to n-best conversion
“Get me two movie tickets…”“I want to movie trips…”“My car’s too groovy”ASR
SpeechWaveform
Grammar
N-Best Result
N=1N=2N=3
Report
Docum
ent
52
Complexity of Search
•Lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word, # of items in lexicon)
•Acoustic Models: HMMs that represent the basic sound units the system is capable of recognizing (# of models, # of states per model, # of mixtures per state)
•Language Model: determines the possible word sequences allowed by the system (fan-out, PP, entropy)
Report
Docum
ent
53
ASR As Modern AI
• Draws on wide range of AI techniques– Knowledge representation & manipulation
• AM and LM, lexicon, observation vector
– Machine Learning• Baum-Welch for HMMs
• Nearest neighbor & k-means clustering for signal id
– “Soft” probabilistic reasoning/Bayes rule• Manage uncertainty mapping in signal, phone, word
• ASR as an expert system
Report
Docum
ent
54
ASR Summary
• Performance criterion is WER (word error rate)
• Three main knowledge sources– Acoustic Model (Gaussian Mixture Models)– Language Model (N-Grams, FS Grammars)– Dictionary (Context-dependent sub-phonetic units)
• Decoding– Viterbi Decoder– Time-synchronous– A* decoding (stack decoding, IBM, X.D. Huang)
Report
Docum
ent
55
Text-to-Speech
Report
Docum
ent
56
Text-to-Speech• What is Text-to-speech?
– To produce spoken language based on input text and high-level prosodic parameters.
• Main approaches– Concatentative synthesis
• Glue waveforms together (Festival, MBROLA)
– Parameter-based synthesis• Klatt’s formant synthesis (MITalk, some clones)
– Articulatory synthesis (still under R&D)
• Basic units selection– di-(tri-)phone models: mid-point to mid-point
– syllable, sub-syllable (Initials, Finals)
Report
Docum
ent
57
Text-to-Speech Status
• State-of-the-art Text-to-Speech– Intelligible, but– More natural-sounding needed– Better prosody, personnel feeling needed– Nouns, spec. names handling not complete– Times, digitals handling not complete– Abbreviation handling not complete
• Some TTS systems– Festival, MBROLA, Jin-Sheng-Yu-Zheng, CTTS
Report
Docum
ent
58
Human Speech Production Levels
• World Knowledge (text normalization)
• Semantics (concept, thought, meaning)
• Syntax (grammar)• Word (word pronunciation)
• Phonology (intonation assignment)
• articulator movements, F0, amplitude, duration• Acoustics (synthesis)
Report
Docum
ent
59
Concatenative Synthesis
• Pre-recorded human speech– Cut up into units, code, store (indexed)– Diphones, triphone
• Given a phonemic transcription– Rules to select unit sequence– Rules to concatenate units based on some selection
criteria– Rules to modify duration, amplitude, pitch and sm
ooth spectrum across junctures
Report
Docum
ent
60
Concatenative Synthesis Issues
• Speech quality varies based on– Size and number of units (coverage)– Rules for selection and concatenation– Speech coding method used to decompose
acoustic signal into spectral, F0, amplitude parameters
– How to modify the original signal to produce the output to meet the target pattern?
Report
Docum
ent
61
Formant Synthesis
• Parameters of acoustic model:– Formant frequencies, bandwidths, amplitude, etc
• Phonemes have target values for parameters• Given a phonemic transcription of the input:
– Rules to select sequence of targets
– Rules to determine duration of target values
• Speech quality not natural– Acoustic model incomplete
– Human knowledge of linguistic and acoustic control rules incomplete (param. acquisition by short-term analysis)
Report
Docum
ent
62
Articulatory Synthesis
• Model articulators: tongue body, tip, jaw, lips, velum, vocal folds, etc. (by 3D X-ray)
• Rules to control timing of movements of each articulator
• Easy to model coarticulation since articulators modeled separately
• But sounds very unnatural– From vocal tract to acoustics not well understood– Knowledge of articulator control rules incomplete– Model parameters acquisition issue
Report
Docum
ent
63
TTS Front End
• Segmentation and combination
• Plain text or tagged, tags analysis
• Word to phoneme sequence conversion– English: pronunciation model and rules– Chinese: lexicon and d-tree
• Text analysis tools: pos tagger, morphological analyzer, little parsing
Report
Docum
ent
64
Text Normalization
• Context independent:– Mr., 22, $n, USA, VISA
• Context-dependent:– Dr., St., 1997, 3/16
• How to resolve abbreviation ambiguities?– Dr. (doctor or drive ?), PM (? or ?)– Application restrictions– Rule or corpus-based decision procedure
Report
Docum
ent
65
Duration Modeling
• How long should each phoneme be?– Context phonemes– Position within syllable, word– Number of syllables– Phrasing– Stress– Speaking rate– Speaking style
Report
Docum
ent
66
Pitch Modeling
• How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes?– Contour or target models for accents, phrase
boundaries (Fujisaki model, statistical model)– Rules to align phoneme string and smooth– How does F0 align with different phonemes?
Report
Docum
ent
67
Prosody Factors
Report
Docum
ent
68
Can Prosody be Modified?
• Model duration and pitch variation– Could extract pitch contour directly
• time-domain: auto-correlation, peak-detection• frequency-domain: FFT, WT• still under research (recent ICASSP papers)
– Common approach: TD-PSOLA• Time-domain pitch synchronous overlap and add
– Center frames around pitchmarks to next pitch period– Adjust prosody by combining frames at pitchmarks for
desired pitch and duration– Increase pitch by shrinking distance b/t pitchmarks– Can be squeaky
Report
Docum
ent
69
TD-PSOLA
Report
Docum
ent
70
Text-to-Speech Architecture
Text Analysis
Prosody Generation
Unit Concatenation
Input text
Speech Output
Prosody Templates
Unit Inventory
Prosody Model
word dictionary
Report
Docum
ent
71
Text-to-Speech Summary
• Intelligible, but not very natural
• Many TTS applications now (i.e.,e-dict)
• How to model prosody to meet target?
• Some other approaches– Large corpus-based concatenative synthesis– Synthesis-by-analysis
Report
Docum
ent
72
Natural Language Understandingin Human-Machine Dialog
Report
Docum
ent
73
NLU in H-M Dialog
• The task of NLU in human-machine dialog is to do discourse analysis and then send message to DM to act/respond.
• Not “fully” understand the meaning of the discourse or sentence, only parse it into some classes and determine their attributes, relationships.
Report
Docum
ent
74
Knowledge Issues for NLU
• Knowledge representation:– how to organize and describe knowledge
• Knowledge control:– how to apply knowledge
• Knowledge integration:– how to use the various knowledge sources
• Knowledge acquisition:– how to acquire the required knowledge and maintain
consistency of the knowledge base
Report
Docum
ent
75
Why Parsing Needed?
• Allow dialogue more flexible, more open
• Allow “wild card” descriptions:• I want to fly from “$X” to “$Y”
• Allow out-of sequence phrases• I want to go from Chicago to Dallas today
• Today I want to go to Dallas from Chicago
• Extract needed information• fill out slots and frames
• set values to global and local variables
Report
Docum
ent
76
Parsing in Dialog
• Knowledge for dialog parsing– Lexicon
– Parsing Rules (grammar)
– Ontology
• Linguistic analysis of discourse– Syntactic analysis (shallow parsing)
– Semantic analysis (deep parsing)
• Parsing methods– whole matching: driven by FSG
– partial matching: driven by SLM
Report
Docum
ent
77
Lexicon Structure• Lexicon - A List of Words and their Syntactical and
Semantic Attributes• Root or stem word form
– fox, run, Boston
• Optional forms plural, tenses– fox ,foxes
– run, ran, running
• part of speech– fox - noun
– run - verb
– Boston - proper noun
• Link to Ontology– fox - animal, brown, furry
– run - action, move fast
– Boston - city
Report
Docum
ent
78
Structural Parsing
Which is the biggest American city
WP VBD DT JJ NNP NN
NP
VPS
city
PLACE
biggestAmerican
Report
Docum
ent
79
Classes-Relationships Parsing
PERSON LOCATION DATE TIME PRODUCT NUMERICAL MONEY ORGANIZATION MANNER VALUE
DEGREE DIMENSION RATE DURATION PERCENTAGE COUNT
time of daymidnight
prime time
clock time
hockeyteam
team,squad
institution,establishment
financialinstitution
educationalinstitution
numerosity,multiplicity
integer,whole number
population denominatorthickness
width,breadth
distance,length
altitude wingspan
Slot filling method
Report
Docum
ent
80
Phoenix Parser
• W. Ward designed in CMU in 1990s
• Used in DARPA Communicator
• Parse a sentence into a sequence semantic frames
• parsing: pattern matching and slot filling– concept: a set of organized frames– frame: a set of organized slots– slot: patterns, attributes, context-free grammar– pattern: a set of constrains
Report
Docum
ent
81
Phoenix Parser Example
• “I want to go from Boston to Denver Tuesday morning”
• Phoenix parsing result:– Flight_Constraint: Depart_Location.City.Boston
– Flight_Constraint: Arrive_Location.City.Denver
– Flight_Constraint: [Date_Time].[Date].[Day_Name].tuesday
[Time_Range].[Period_Of_Day].morning
Report
Docum
ent
82
NLU Summary
• Semantic analysis is still very difficult in NLU
due to the ambiguity in NL• Today, slot-filling as a practical parsing technique
has been used in H-M dialogue systems and NLU• Knowledge base and its organization structure
have
heavy influence on the performance of parsing
Report
Docum
ent
83
Dialog Manager
Report
Docum
ent
84
Tasks of DM
• DM is the hub in the H-M dialog system,
it performs the following functions:– control the interaction b/t the user and the system
– decide and plan the system’s action at each step
– resolve the ambiguities in the interpretation from NLU
– estimate confidence in the extracted information
– integrate new input with dialog context/history
– prompt user for missing information
– send information to NLG for presentation to user
Report
Docum
ent
85
State-of-the-Art DM
• DM as a DSS– decision-making based on input, rules, context– decision-making method: d-tree
• Dialog modes– directed dialog: current– free dialog: future
• DM design– event driven (DARPA Communicator)
Report
Docum
ent
86
Directed Dialogue
• Computer asks all the questions– Usually presented as a menu or give some choices
– “Do you want your account balance, cleared checks, or deposits?”
• Computer always has the initiative– User just answers questions, never gets to ask any questions
• DM avoid asking open-ended questions– “What can I do for you?”
• Questions’ answers can be explicitly predicted– “Do you want to buy or sell stocks”
• All possible answers must be pre-defined by the application developer (grammars)
• The job could be done, but may be tedious and tiresome
Report
Docum
ent
87
CU Communicator DM
• Context– a set of frames and a set of global variables
• Event driven– an incoming parse will cause a set of actions,
– and modify the current context
• The DM attempts the following actions in order:– clarify if necessary
– sign off if all jobs done
– retrieve data and present to user
– prompt user for required information
Report
Docum
ent
88
Rules to Prompt
• The rules for deciding what to prompt for next are
based on the frame in focus or the last system prompt– if there are unfilled slots in the focus frame, then
prompt for the highest priority unfilled slot in
the frame
– if there are no unfilled slots in the focus frame,
prompt for the highest priority missing piece of
information in the context
– the system will prompt for whatever information
is missing until the frame is complete
Report
Docum
ent
89
Task Frame Example
Frame: Air
[Depart_Loc]+
Prompt: “where are departing from?”
[City_Name]*
Confirm: “you are departing from $([City_Name]),
is that correct?”
SQL: “dep_$[leg_num] in (select airport_code from
airport_codes where city is like ‘!%’ $(and
state_province like ‘[Depart_Loc].[State]’))”
[Airport_Code]*
Report
Docum
ent
90
Issues of DM
• Though a job can be done, but the dialog
process may be quite long and tiring
• The dialog process is controlled by the
system, users don’t have initiative
• Exceptions/strands handling if unexpected
information comes
• The work to create frames and grammars
Report
Docum
ent
91
Middlewares & Protocols
Report
Docum
ent
92
Middlewares
• Middlewares sit between sub-systems or layers
(e.g. MS ODBC is for VB applications and DBMS)
• Middlewares are responsible for the communication between sub-systems and facilitate them to work as an integrated system
• Middleware’s Design– the input/output of related sub-systems
– protocols for formatting information for communication
– format conversion
– Example: middleware for ASR and NLU
Report
Docum
ent
93
VoiceXML
• VoiceXML is a web-oriented voice-application markup language which was approved by W3C as standard.– Some dialog system developers have adopted VXML
in their project design and development
• Assume telephone as user’s input/output device• Assume voice or key input• Pre-recorded or TTS for output
Report
Docum
ent
94
Web applications using VoiceXML
VoiceXML uses a voice browser (on voice gateway) for audio
input and output
Users use a regular phone to access a VoiceXML
-based application
Report
Docum
ent
95
VoiceXML Evolution
1995
2003W3C
Standardization
VoiceXMLForum
AT&TPML
IBMSpeech ML
MotorolaVoxML
LucentPML
AT&T Bell LabsPML/PhoneWeb 1996--
8/1999 VoiceXML 0.9
3/2000 VoiceXML 1.0
4/2002 VoiceXML 2.0(work draft)
380+ othercompanies
Report
Docum
ent
96
ASR
DTMF recognizer
Languageunderstanding Context
interp-retation
Dialogmanager
WWW
Telefone
Mediaplanning
Prerecorded audio
LanguagegenerationTTS
lexiconlexicon
N-gram grammar ML
Speech recognition grammar ML
Naturallanguage
semantic ML
voiceXML
Call controlSpeech synthesis ML
W3C speech interface framework
Report
Docum
ent
97
Relations with other MLs
SGML XML
HTML
VoiceXML
WML
SALT
applications
Meta-language XHTML
Versions of MLs: SGML[ISO8879], XML1.0, HTML 4.0, XHTML 1.0, VoiceXML 2.0, WML 1.0
VXML:=VoiceXML
Report
Docum
ent
98
VoiceXML Architecture
Processes requests received from the VoiceXML Interpreter and responds with VoiceXML docu
ments
Interprets the VoiceXML documents it receives from
the document server
generates events in response touser actions and system events
Voice server
VoiceXML browser
Application
ASR engine TTS engine
DTMF
Report
Docum
ent
99
VoiceXML Example
<?xml version="1.0"?><vxml version="1.0">
<!--Example 1 for VoiceXML Review --> <form> <block> Hello, World! </block> </form></vxml>
Report
Docum
ent
100
VoiceXML Applications
• Query applications– Informational retrieval
• News, sports, hotel, stock quotes, traffic.– Telephone services
• Voice routing, voice dialing.
• Transaction applications – e-transactions (e-commerce, e-tailing, etc)
• Call center, account status, stock trading.– Intranet
• Inventory, ordering.
Report
Docum
ent
101
SALT• SALT: Speech Application Language Tags • SALT targets speech applications across a whole spectrum
of devices, including telephones, PDAs, tablet computers and desktop PCs
• SALT supports Multi-modal systems
• Assume input comes from speech recognition, keyboard or keypad, or mouse
• Output to screen or speaker (speech)
• Both VoiceXML and SALT are markup languages that describe a speech interface, the main difference is the assumption of device.
Report
Docum
ent
102
SALT Code
• <!—- Speech Application Language Tags -->
• <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt>
• <salt:prompt id="askDestCity"> Where would you like to go to? </salt:prompt>
• <salt:prompt id="sayDidntUnderstand" onComplete="runAsk()">
• Sorry, I didn't understand. </salt:prompt>
• <salt:listen id="recoOriginCity"
• onReco="procOriginCity()” onNoReco="sayDidntUnderstand.Start()">
• <salt:grammar src="city.xml" />
• </salt:listen>
• <salt:listen id="recoDestCity"
• onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()">
• <salt:grammar src="city.xml" /> </salt:listen>
Report
Docum
ent
103
Conclusions
Report
Docum
ent
104
• Human-Machine dialogue becomes possible in some restricted subjects now, such as in Stock Brokerages, Travel Agencies, etc., but far from convenient and satisfactory.
• Artificial Intelligence and Natural Language technology have made rapid advances and promoted Human-Machine dialogue’s R&D and many conversational applications.
• Machine intelligence is beyond Human-Machine dialogue.
• The research on Human-Machine dialogue will
benefit and enrich computer science.
Report
Docum
ent
105
References• Speech & Language Processing
– Jurafsky & Martin -Prentice Hall - 2000
• Spoken Language Processing, – X.. D. Huang, al et, Prentice Hall, Inc., 2000
• Statistical Methods for Speech Recognition
– Jelinek - MIT Press - 1999
• Foundations of Statistical Natural Language Processing
– Manning & Schutze - MIT Press - 1999
• Fundamentals of Speech Recognition– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993
• Dr. J. Picone - Speech Website
– www.isip.msstate.edu
Report
Docum
ent
106
Thanks