constructing accurate beliefs in task-oriented spoken dialog systems dan bohus computer science...

constructing accurate beliefs in task-oriented spoken dialog systems

Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon [email protected] Pittsburgh, PA 15213

2

problem

spoken language interfaces lack robustness when faced with understanding errors

errors stem mostly from speech recognition typical word error rates: 20-30% significant negative impact on interactions

3

more concretely …

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you

leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to

answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing

chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at

1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR

WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago

at 1:40pm arrives Seoul at ………

4

NONunderstanding

MISunderstanding

two types of understanding errors

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you

leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to

answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN

P_M]S: traveling in the afternoon. Okay what day would you be departing

chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at

1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR

WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago

at 1:40pm arrives Seoul at ………

5

approaches for increasing robustness

gracefully handle errors through interaction

improve recognition

1. detect the problems2. develop a set of recovery strategies3. know how to choose between them (policy)

6

six not-so-easy pieces …

detection

strategies

policy

misunderstandings non-understandings

7

construct more accurate beliefs by integrating information over multiple turns in a conversation

today’s talk …

detection

misunderstandings

S: Where would you like to go?U: Huntsville

[SEOUL / 0.65]

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

U: no no I’m traveling to Birmingham[THE TRAVELING TO BERLIN P_M / 0.60]

8

belief updating: problem statement

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

[THE TRAVELING TO BERLIN P_M / 0.60]

given an initial belief Pinitial(C) over

concept C a system action SA a user response R

construct an updated belief Pupdated(C) ← f (Pinitial(C), SA, R)

9

outline

related work

a restricted version

data

user response analysis

experiments and results

current and future work

10

current solutions

most systems only track values, not beliefs new values overwrite old values

use confidence scores yes → trust hypothesis

explicit confirm +no → delete hypothesis“other” → non-

understanding

implicit confirm: not much“users who discover errors through incorrect implicit

confirmations have a harder time getting back on track”[Shin et al, 2002]

related work : restricted version : data : user response analysis : results : current and future work

11

confidence / detecting misunderstandings

traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others]

recently: detecting misunderstandings[Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others]

machine learning approach: binary classification in-domain, labeled dataset features from different knowledge sources

acoustic, language model, parsing, dialog management

~50% relative reduction in classification error


12

detecting corrections

detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow]

machine learning approach binary classification in-domain, labeled dataset features from different knowledge sources

acoustic, prosody, language model, parsing, dialog management

~50% relative reduction in classification error


13

integration

confidence annotation and correction detection are useful tools

but separately, neither solves the problem

bridge together in a unified approach to accurately track beliefs


14

outline

related work


data





15

belief updating: general form

given an initial belief Pinitial(C) over concept C a system action SA a user response R

construct an updated belief Pupdated(C) ← f (Pinitial(C), SA, R)


16

two simplifications

1. belief representation system unlikely to “hear” more than 3 or 4 values

for a concept within a dialog session in our data [considering only top hypothesis from recognition]

max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard

compressed beliefs: top-K concept hypotheses + other

for now, K=1

2. updates following system confirmation actions


17

given an initial confidence score for

the current top hypothesis Confinit(thC) for concept C

a system confirmation action SA a user response R

construct an updated confi-dence score for that hypothesis Confupd(thC) ← f (Confinit(thC), SA, R)

belief updating: reduced version

{boston/0.65; austin/0.11; … }

ExplicitConfirm( Boston )

[NOW]

+

+

{boston/ ?}


18

outline

related work


data





19

I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?

I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?

data

collected with RoomLine a phone-based mixed-initiative spoken dialog

system

conference room reservation

explicit and implicit confirmations confidence threshold model (+ some exploration)

unplanned implicit confirmations


20

corpus

user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success

corpus 449 sessions, 8848 user turns orthographically transcribed manually annotated

misunderstandings corrections correct concept values


21

outline

related work


data





22

user response types

following [Krahmer and Swerts, 2000]

study on Dutch train-table information system

3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER

cross-tabulated against correctness of system confirmations


23

user responses to explicit confirmations

YES NO Other

CORRECT94%

[93%]0% [0%] 5% [7%]

INCORRECT 1% [6%]72%

[57%]27%

[37%]~10%[numbers in brackets from Krahmer&Swerts]


24

other responses to explicit confirmations

~70% users repeat the correct value ~15% users don’t address the question

attempt to shift conversation focus

how often users correct the system?User does not

correctUser corrects

CORRECT 1159 0

INCORRECT 29 [10% of incor]

250[90% of incor]


25

user responses to implicit confirmations

YES NO Other

CORRECT30% [0%]

7% [0%]63%

[100%]

INCORRECT 6% [0%]33%

[15%]61%

[85%][numbers in brackets from Krahmer&Swerts]


26

ignoring errors in implicit confirmations

User does not correct

User corrects

CORRECT 552 2

INCORRECT 118 [51% of incor]

111[49% of incor]

explanation users correct later (40% of 118) users interact strategically / correct only if essential

~correct later

correct later

~critical 55 2

critical 14 47

how often users correct the system?


27

outline

related work


data





28

machine learning approach

problem: Confupd(thC) ← f (Confinit(thC), SA,

R) need good probability outputs

low cross-entropy between model predictions and reality

logistic regression sample efficient stepwise approach → feature selection

logistic model tree for each action root splits on response-type


29

features. target.

Initial initial confidence score of top hypothesis, # of initial hypotheses, concept type (bool / non-bool), concept identity;

System action indicators describing other system actions in conjunction with current confirmation;

User response

Acoustic / prosodic

acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause;

Lexical number of words, lexical terms highly correlated with corrections (MI);

Grammatical

number of slots (new, repeated), parse fragmentation, parse gaps;

Dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in.

target: was the top hypothesis correct?


30

baselines

initial baseline accuracy of system beliefs before the update

heuristic baseline accuracy of heuristic update rule used by the

system

oracle baseline accuracy if we knew exactly what the user said


31

results: explicit confirmation

Hard error (%) Soft error 31.15

8.41

3.57 2.71

30%

20%

10%

0%

0.51

0.19

0.6

0.4

0.2

0.0

0.12

initialheuristiclogistic model treeoracle


32

results: implicit confirmation

Hard error (%) Soft error 30.40

23.37

16.15 15.33

30%

20%

10%

0%

0.610.67

0.6

0.4

0.2

0.0

0.43


1.0

0.8


33

results: unplanned implicit confirmation

Hard error (%) Soft error

15.4014.36

12.6410.37

20%

10%

0%

0.43 0.46

0.6

0.4

0.2

0.0

0.34



34

informative features

initial confidence score

prosody features

barge-in

expectation match

repeated grammar slots

concept identity


35

summary

data-driven approach for constructing accurate system beliefs integrate information across multiple turns

bridge together detection of misunderstandings and corrections

performs better than current heuristics

user response analysis users don’t correct unless the error is critical


36

outline

related work


data





37

current extensions

top hypothesis + other logistic regression model

belief representation

confirmation actions

system action

k hyps + other multinomial GLM

all actions confirmation (expl/impl) request unexpected

features added priors


38

10%

20%

30%

2 hypotheses + other

4%

0%

8%

12%9.49%

6.08%

98.14%

20%

0%

40%

45.03%

19.23%

80.00%

10%

0%

20%

30%

16.17%

5.52%

30.83%

0%

26.16%

17.56%

30.46%

0%

12%

6.06%

21.45%

explicit confirmation implicit confirmation

request unexpected update

initialheuristiclmt(basic)lmt(basic+concept)oracle

4%

8%

15.15%

10.72%

15.49%

14.02%

unplanned impl. conf.

7.86%

22.69%12.95%

9.64%

25.66%


39

other workd

ete

ctio

nst

rate

gie

sp

olic

y

misunderstandings non-understandings costs for errors rejection threshold adaptation nonu impact on performance

[Interspeech-05]


comparative analysis of 10 recovery strategies [SIGdial-05]

impact of policy on performance

towards learning non-understanding recovery policies [SIGdial-05]

belief updating [ASRU-05]

transfering confidence annotators across domains [in progress]

RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05]

40

thank you! questions …

41

a more subtle caveat distribution of training data

confidence annotator + heuristic update rules

distribution of run-time data confidence annotator + learned model

always a problem when interacting with the world!

hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?

42

KL-divergence & cross-entropy KL divergence: D(p||q)

Cross-entropy: CH(p, q) = H(p) + D(p||q)

Negative log likelihood

)(

)(log)()||(

xq

xpxpqpD

)(log)(),( xqxpqpCH

)(log)( xqqLL

43

logistic regression regression model for binomial (binary) dependent

variables

fwefxP

1

1)|1( fw

xp

xp

)0(

)1(log

fit a model using max likelihood (avg log-likelihood) any stats package will do it for you

no R2 measure test fit using “likelihood ratio” test stepwise logistic regression

keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting

44

logistic regression

0 10% 20% 30% 40% 50%0

0.2

0.4

0.6

0.8

1

% Nonunderstandings (FNON)

P(T

as

k S

uc

ce

ss

= 1

)

45

logistic model tree

f

g

0 10% 20% 30% 40% 50%0

0.2

0.4

0.6

0.8

1


P(T

as

k S

uc

ce

ss

= 1

)

0 10% 20% 30% 40% 50%0

0.2

0.4

0.6

0.8

1


P(T

as

k S

uc

ce

ss

= 1

)

regression tree, but with logistic models on leaves

f=0 f=1

g>10g<=10

46

user study 46 participants, 1st time users 10 scenarios, fixed order presented graphically (explained during briefing)

participants compensated per task success

constructing accurate beliefs in task-oriented spoken dialog systems dan bohus computer science...

Documents

chicago chicago s

chicago u

huntsville seoul s

detection misunderstandings

policy slide

understanding errors

sorry im

korean airlines flight