constructing accurate beliefs in task-oriented spoken dialog systems dan bohus computer science...
Post on 18-Dec-2015
217 views
TRANSCRIPT
constructing accurate beliefs in task-oriented spoken dialog systems
Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon [email protected] Pittsburgh, PA 15213
2
problem
spoken language interfaces lack robustness when faced with understanding errors
errors stem mostly from speech recognition typical word error rates: 20-30% significant negative impact on interactions
3
more concretely …
S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you
leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to
answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing
chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at
1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR
WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago
at 1:40pm arrives Seoul at ………
4
NONunderstanding
MISunderstanding
two types of understanding errors
S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you
leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to
answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN
P_M]S: traveling in the afternoon. Okay what day would you be departing
chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at
1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR
WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago
at 1:40pm arrives Seoul at ………
5
approaches for increasing robustness
gracefully handle errors through interaction
improve recognition
1. detect the problems2. develop a set of recovery strategies3. know how to choose between them (policy)
7
construct more accurate beliefs by integrating information over multiple turns in a conversation
today’s talk …
detection
misunderstandings
S: Where would you like to go?U: Huntsville
[SEOUL / 0.65]
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
U: no no I’m traveling to Birmingham[THE TRAVELING TO BERLIN P_M / 0.60]
8
belief updating: problem statement
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
[THE TRAVELING TO BERLIN P_M / 0.60]
given an initial belief Pinitial(C) over
concept C a system action SA a user response R
construct an updated belief Pupdated(C) ← f (Pinitial(C), SA, R)
9
outline
related work
a restricted version
data
user response analysis
experiments and results
current and future work
10
current solutions
most systems only track values, not beliefs new values overwrite old values
use confidence scores yes → trust hypothesis
explicit confirm +no → delete hypothesis“other” → non-
understanding
implicit confirm: not much“users who discover errors through incorrect implicit
confirmations have a harder time getting back on track”[Shin et al, 2002]
related work : restricted version : data : user response analysis : results : current and future work
11
confidence / detecting misunderstandings
traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others]
recently: detecting misunderstandings[Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others]
machine learning approach: binary classification in-domain, labeled dataset features from different knowledge sources
acoustic, language model, parsing, dialog management
~50% relative reduction in classification error
related work : restricted version : data : user response analysis : results : current and future work
12
detecting corrections
detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow]
machine learning approach binary classification in-domain, labeled dataset features from different knowledge sources
acoustic, prosody, language model, parsing, dialog management
~50% relative reduction in classification error
related work : restricted version : data : user response analysis : results : current and future work
13
integration
confidence annotation and correction detection are useful tools
but separately, neither solves the problem
bridge together in a unified approach to accurately track beliefs
related work : restricted version : data : user response analysis : results : current and future work
14
outline
related work
a restricted version
data
user response analysis
experiments and results
current and future work
related work : restricted version : data : user response analysis : results : current and future work
15
belief updating: general form
given an initial belief Pinitial(C) over concept C a system action SA a user response R
construct an updated belief Pupdated(C) ← f (Pinitial(C), SA, R)
related work : restricted version : data : user response analysis : results : current and future work
16
two simplifications
1. belief representation system unlikely to “hear” more than 3 or 4 values
for a concept within a dialog session in our data [considering only top hypothesis from recognition]
max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard
compressed beliefs: top-K concept hypotheses + other
for now, K=1
2. updates following system confirmation actions
related work : restricted version : data : user response analysis : results : current and future work
17
given an initial confidence score for
the current top hypothesis Confinit(thC) for concept C
a system confirmation action SA a user response R
construct an updated confi-dence score for that hypothesis Confupd(thC) ← f (Confinit(thC), SA, R)
belief updating: reduced version
{boston/0.65; austin/0.11; … }
ExplicitConfirm( Boston )
[NOW]
+
+
{boston/ ?}
related work : restricted version : data : user response analysis : results : current and future work
18
outline
related work
a restricted version
data
user response analysis
experiments and results
current and future work
related work : restricted version : data : user response analysis : results : current and future work
19
I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?
I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?
data
collected with RoomLine a phone-based mixed-initiative spoken dialog
system
conference room reservation
explicit and implicit confirmations confidence threshold model (+ some exploration)
unplanned implicit confirmations
related work : restricted version : data : user response analysis : results : current and future work
20
corpus
user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success
corpus 449 sessions, 8848 user turns orthographically transcribed manually annotated
misunderstandings corrections correct concept values
related work : restricted version : data : user response analysis : results : current and future work
21
outline
related work
a restricted version
data
user response analysis
experiments and results
current and future work
related work : restricted version : data : user response analysis : results : current and future work
22
user response types
following [Krahmer and Swerts, 2000]
study on Dutch train-table information system
3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER
cross-tabulated against correctness of system confirmations
related work : restricted version : data : user response analysis : results : current and future work
23
user responses to explicit confirmations
YES NO Other
CORRECT94%
[93%]0% [0%] 5% [7%]
INCORRECT 1% [6%]72%
[57%]27%
[37%]~10%[numbers in brackets from Krahmer&Swerts]
related work : restricted version : data : user response analysis : results : current and future work
24
other responses to explicit confirmations
~70% users repeat the correct value ~15% users don’t address the question
attempt to shift conversation focus
how often users correct the system?User does not
correctUser corrects
CORRECT 1159 0
INCORRECT 29 [10% of incor]
250[90% of incor]
related work : restricted version : data : user response analysis : results : current and future work
25
user responses to implicit confirmations
YES NO Other
CORRECT30% [0%]
7% [0%]63%
[100%]
INCORRECT 6% [0%]33%
[15%]61%
[85%][numbers in brackets from Krahmer&Swerts]
related work : restricted version : data : user response analysis : results : current and future work
26
ignoring errors in implicit confirmations
User does not correct
User corrects
CORRECT 552 2
INCORRECT 118 [51% of incor]
111[49% of incor]
explanation users correct later (40% of 118) users interact strategically / correct only if essential
~correct later
correct later
~critical 55 2
critical 14 47
how often users correct the system?
related work : restricted version : data : user response analysis : results : current and future work
27
outline
related work
a restricted version
data
user response analysis
experiments and results
current and future work
related work : restricted version : data : user response analysis : results : current and future work
28
machine learning approach
problem: Confupd(thC) ← f (Confinit(thC), SA,
R) need good probability outputs
low cross-entropy between model predictions and reality
logistic regression sample efficient stepwise approach → feature selection
logistic model tree for each action root splits on response-type
related work : restricted version : data : user response analysis : results : current and future work
29
features. target.
Initial initial confidence score of top hypothesis, # of initial hypotheses, concept type (bool / non-bool), concept identity;
System action indicators describing other system actions in conjunction with current confirmation;
User response
Acoustic / prosodic
acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause;
Lexical number of words, lexical terms highly correlated with corrections (MI);
Grammatical
number of slots (new, repeated), parse fragmentation, parse gaps;
Dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in.
target: was the top hypothesis correct?
related work : restricted version : data : user response analysis : results : current and future work
30
baselines
initial baseline accuracy of system beliefs before the update
heuristic baseline accuracy of heuristic update rule used by the
system
oracle baseline accuracy if we knew exactly what the user said
related work : restricted version : data : user response analysis : results : current and future work
31
results: explicit confirmation
Hard error (%) Soft error 31.15
8.41
3.57 2.71
30%
20%
10%
0%
0.51
0.19
0.6
0.4
0.2
0.0
0.12
initialheuristiclogistic model treeoracle
related work : restricted version : data : user response analysis : results : current and future work
32
results: implicit confirmation
Hard error (%) Soft error 30.40
23.37
16.15 15.33
30%
20%
10%
0%
0.610.67
0.6
0.4
0.2
0.0
0.43
initialheuristiclogistic model treeoracle
1.0
0.8
related work : restricted version : data : user response analysis : results : current and future work
33
results: unplanned implicit confirmation
Hard error (%) Soft error
15.4014.36
12.6410.37
20%
10%
0%
0.43 0.46
0.6
0.4
0.2
0.0
0.34
initialheuristiclogistic model treeoracle
related work : restricted version : data : user response analysis : results : current and future work
34
informative features
initial confidence score
prosody features
barge-in
expectation match
repeated grammar slots
concept identity
related work : restricted version : data : user response analysis : results : current and future work
35
summary
data-driven approach for constructing accurate system beliefs integrate information across multiple turns
bridge together detection of misunderstandings and corrections
performs better than current heuristics
user response analysis users don’t correct unless the error is critical
related work : restricted version : data : user response analysis : results : current and future work
36
outline
related work
a restricted version
data
user response analysis
experiments and results
current and future work
related work : restricted version : data : user response analysis : results : current and future work
37
current extensions
top hypothesis + other logistic regression model
belief representation
confirmation actions
system action
k hyps + other multinomial GLM
all actions confirmation (expl/impl) request unexpected
features added priors
related work : restricted version : data : user response analysis : results : current and future work
38
10%
20%
30%
2 hypotheses + other
4%
0%
8%
12%9.49%
6.08%
98.14%
20%
0%
40%
45.03%
19.23%
80.00%
10%
0%
20%
30%
16.17%
5.52%
30.83%
0%
26.16%
17.56%
30.46%
0%
12%
6.06%
21.45%
explicit confirmation implicit confirmation
request unexpected update
initialheuristiclmt(basic)lmt(basic+concept)oracle
4%
8%
15.15%
10.72%
15.49%
14.02%
unplanned impl. conf.
7.86%
22.69%12.95%
9.64%
25.66%
related work : restricted version : data : user response analysis : results : current and future work
39
other workd
ete
ctio
nst
rate
gie
sp
olic
y
misunderstandings non-understandings costs for errors rejection threshold adaptation nonu impact on performance
[Interspeech-05]
related work : restricted version : data : user response analysis : results : current and future work
comparative analysis of 10 recovery strategies [SIGdial-05]
impact of policy on performance
towards learning non-understanding recovery policies [SIGdial-05]
belief updating [ASRU-05]
transfering confidence annotators across domains [in progress]
RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05]
41
a more subtle caveat distribution of training data
confidence annotator + heuristic update rules
distribution of run-time data confidence annotator + learned model
always a problem when interacting with the world!
hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?
42
KL-divergence & cross-entropy KL divergence: D(p||q)
Cross-entropy: CH(p, q) = H(p) + D(p||q)
Negative log likelihood
)(
)(log)()||(
xq
xpxpqpD
)(log)(),( xqxpqpCH
)(log)( xqqLL
43
logistic regression regression model for binomial (binary) dependent
variables
fwefxP
1
1)|1( fw
xp
xp
)0(
)1(log
fit a model using max likelihood (avg log-likelihood) any stats package will do it for you
no R2 measure test fit using “likelihood ratio” test stepwise logistic regression
keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting
44
logistic regression
0 10% 20% 30% 40% 50%0
0.2
0.4
0.6
0.8
1
% Nonunderstandings (FNON)
P(T
as
k S
uc
ce
ss
= 1
)
45
logistic model tree
f
g
0 10% 20% 30% 40% 50%0
0.2
0.4
0.6
0.8
1
% Nonunderstandings (FNON)
P(T
as
k S
uc
ce
ss
= 1
)
0 10% 20% 30% 40% 50%0
0.2
0.4
0.6
0.8
1
% Nonunderstandings (FNON)
P(T
as
k S
uc
ce
ss
= 1
)
regression tree, but with logistic models on leaves
f=0 f=1
g>10g<=10