cost of misunderstandings modeling the cost of misunderstanding errors in the cmu communicator...

Cost of Misunderstandings

Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System

Presented by: Dan Bohus (dbohus@cs.cmu.edu)

Work by: Dan Bohus, Alex Rudnicky

Carnegie Mellon University, 2001

11-04-01 Modeling the cost of misunderstanding … 2

Outline Quick overview of previous utterance-level

confidence annotation work. Modeling the cost of misunderstandings in

spoken dialog systems. Experiments & results. Further analysis. Summary, further work, conclusion

Utterance-Level Confidence Annotation Overview Confidence annotation = data-driven classification

Corpus: 2 months, 131 dialogs, 4550 utterances.

Features: 12 features from decoder, parsing, dialog management levels.

Classifiers: Decision Tree, ANN, BayesNet, AdaBoost, NaiveBayes, SVM + Logistic Regression model (later on).

Confidence annotator performance

Baseline error rate: 32 % Garble baseline: 25 % Classifiers performance: 16 %

Differences between classifiers are statistically insignificant except for Naïve Bayes

On a soft-metric, logistic regression model clearly outperformed the others

But is this the right way to evaluate performance?

Judging Performance Classification Error Rate (FP+FN).

Assumes implicitly that FP and FN errors have same cost

But cost of misunderstanding in dialog systems is presumably different for FPs and FNs.

Build an error function which take into account these costs, and optimize for that.

Cost also depends on domain/system ~ not a problem dialog state

Problem Formulation (1) Develop a cost model which allows us to quantitatively assess

the costs of FP and FN errors. (2) Use the costs to pick the optimal tradeoff point on the classifier

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

Boosting threshold

False NegativesFalse PositivesTotal Error

The Cost Model Model the impact of the FPs and FNs on

the system performance Identify a suitable performance metric P Build a statistical regression model at the

dialog session level: P = f(FPs, FNs) P = k + CostFP*FP + CostFN*FN (Linear Regr)

Then we can plot f, and implicitly optimize for P

Measuring Performance User Satisfaction (i.e. 5-point scale)

Hard to get Very subjective ~ hard to make it consistent

across users

Concept transfer efficiency: CTC: correctly transferred concepts per turn ITC: incorrectly transferred concepts per turn

Completion

Detour : The Dataset 134 dialogs (2561 utterances), collected

using 4 scenarios Satisfaction scores only for 35 dialogs Corpus manually labeled at the concept

and level 4 labels: OK / RBAD / PBAD / OOD Aggregate utterance labels generated

Confidence annotator decisions logged Computed counts of FPs, FNs, CTCs,

ITCs for each session

Example

U: I want to fly from Pittsburgh to Boston

S: I want to fly from Pittsburgh to Austin C: [I_want/OK] [Depart_Loc/OK]

[Arrive_Loc/RBAD]

Only 2 relevantly expressed concepts If Accept: CTC = 1, ITC = 1 If Reject: CTC = 0, ITC = 0

Targeting Efficiency: Model 1

3 Successively refined models

CTC = FP + FN + TN + k CTC - correctly transferred concepts / turn TN – true negatives

Model R2 all R2 train R2 test

CTC=FP+FN+TN 0.81 0.81 0.73

CTC - ITC = (REC +) FP + FN + TN + k ITC - incorrectly transferred concepts /

turn REC – relevantly expressed concepts

CTC=FP+FN+TN 0.81 0.81 0.73

CTC-ITC=FP+FN+TN 0.86 0.86 0.78

CTC-ITC=REC+FP+FN+TN 0.89 0.89 0.83

CTC-ITC = REC+FPC+FPNC+FN+TN+k 2 types of FPs:

With concepts - FPC Without concepts - FPNC

CTC=FP+FN+TN 0.81 0.81 0.73

CTC-ITC=FP+FN+TN 0.86 0.86 0.78

CTC-ITC=REC+FP+FN+TN 0.89 0.89 0.83

CTC-ITC =REC+FPC+FPNC+FN+TN

0.94 0.94 0.90

Model 3 - Results

k 0.41

CREC 0.62

CFPNC -0.48

CFPC -2.12

CFN -1.33

CTN -0.55

CTC-ITC = REC+FPC+FPNC+FN+TN+k

Other models

Completion (binary) Logistic regression model Estimated model does not indicate a good fit

User satisfaction (5-point scale) Based on only 35 dialogs R2 = 0.61 (similar to literature – Walker et al) Explanation: subjectivity of metric + limited

dataset

Problem Formulation (1) Develop a cost model which allows us to

quantitatively assess the costs of FP and FN errors.

(2) Use the costs to pick the optimal tradeoff point on the classifier ROC.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

Boosting threshold

False NegativesFalse PositivesTotal Error

Tuning the Confidence Annotator

Using Model 3 CTC-ITC = REC+FPNC+FPC+FN+TN+k Drop k & REC, plug in the values Cost = 0.48FPNC+2.12FPC+1.33FN+0.56TN

Minimize Cost instead of Classification Error Rate (FP+FN), and we’ll implicitly maximize concept transfer efficiency.

Operating Characteristic

Further Analysis

Is CTC-ITC really modeling dialog performance ?

Mean = 0.71, Std.Dev = 0.28 Mean for completed dialogs = 0.82 Mean for uncompleted dialogs = 0.57 Difference between means significant at

very high level of confidence P-value = 7.23*10-9 (in t-test)

So, looks like CTC-ITC is okay, right ?

Further Analysis (cont’d)

Can we reliably extrapolate to other areas of the operating characteristic ?

Yes, look at the distribution of the FP and FN ratios across dialogs.

Impact of baseline error rate ? Compared models constructed based on

high and low error rates: For low error rate curve becomes

monotonically increasing This clearly indicates that “trust everything /

have no confidence ” is the way to go in this setting

Our explanation so far…

Ability to easily overwrite incorrectly captured information in the CMU Communicator

Relatively low error rates Likelihood of repeated misrecognition is low

Conclusion

Data-driven approach to quantitatively assess the costs of various types of misunderstandings.

Models based on efficiency fit data well; obtained costs confirm intuition.

For CMU Communicator, model predicts that total cost stays the same across a large range of the operating characteristic of the classifier.

Further Experiments But, of course, we can verify predictions

experimentally Collect new data with the system running

with a very low threshold. 55 dialogs collected so far. Thanks to those who have participated in

these experiments. “Help if you have the time” to the others …

www.cs.cmu.edu/~dbohus/scenarios.htm

Re-estimate models, verify predictions

Confusion Matrix

OK BAD

System says OK TP FP

System says BAD FN TN

FP = False acceptance FN = False detection/rejection Fallout = FP/(FP+TN) = FP/NBAD CDR = 1-Fallout = 1-(FP/NBAD)

cost of misunderstandings modeling the cost of misunderstanding errors in the cmu communicator...

cost fp

cost of misunderstandings

cost model model

fp cost fn

p slide

session slide

conclusion slide

classifiers performance

Documents

online supervised learning of non-understanding recovery...

belief updating in spoken dialog systems dialogs on dialogs...

a “k-hypotheses + other” belief updating model dan bohus...

ravenclaw an improved dialog management architecture for...

bohus collection 2014

kjerstin - a bohus inspired sweater -...

1dbohus/docs/proposal.doc · web view- thesis proposal -...

anvÄndarinstruktioner instructions d ......2018/09/30 ·...

an investigation into recovering from non-understanding...

cost hspm j712. cost concepts opportunity cost total cost...

mitre dialog management workshop – a review dan bohus...

alin - a ohus inspired hat - filcolana · alin - a ohus...

dan bohus activity argon microsoft.speech speech recognition...

“k hypotheses + other” belief updating in spoken dialog...

detecting misunderstandings in the cmu communicator spoken...

hjellegjerde bohus 2015

constructing accurate beliefs in task-oriented spoken dialog...

aspects of sustainable wastewater...

cost cost high cost

misunderstandings, corrections and beliefs in spoken...