cost of misunderstandings modeling the cost of misunderstanding errors in the cmu communicator...
Post on 21-Dec-2015
213 Views
Preview:
TRANSCRIPT
Cost of Misunderstandings
Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System
Presented by: Dan Bohus (dbohus@cs.cmu.edu)
Work by: Dan Bohus, Alex Rudnicky
Carnegie Mellon University, 2001
11-04-01 Modeling the cost of misunderstanding … 2
Outline Quick overview of previous utterance-level
confidence annotation work. Modeling the cost of misunderstandings in
spoken dialog systems. Experiments & results. Further analysis. Summary, further work, conclusion
11-04-01 Modeling the cost of misunderstanding … 3
Utterance-Level Confidence Annotation Overview Confidence annotation = data-driven classification
Corpus: 2 months, 131 dialogs, 4550 utterances.
Features: 12 features from decoder, parsing, dialog management levels.
Classifiers: Decision Tree, ANN, BayesNet, AdaBoost, NaiveBayes, SVM + Logistic Regression model (later on).
11-04-01 Modeling the cost of misunderstanding … 4
Confidence annotator performance
Baseline error rate: 32 % Garble baseline: 25 % Classifiers performance: 16 %
Differences between classifiers are statistically insignificant except for Naïve Bayes
On a soft-metric, logistic regression model clearly outperformed the others
But is this the right way to evaluate performance?
11-04-01 Modeling the cost of misunderstanding … 5
Judging Performance Classification Error Rate (FP+FN).
Assumes implicitly that FP and FN errors have same cost
But cost of misunderstanding in dialog systems is presumably different for FPs and FNs.
Build an error function which take into account these costs, and optimize for that.
Cost also depends on domain/system ~ not a problem dialog state
11-04-01 Modeling the cost of misunderstanding … 6
Problem Formulation (1) Develop a cost model which allows us to quantitatively assess
the costs of FP and FN errors. (2) Use the costs to pick the optimal tradeoff point on the classifier
ROC.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Boosting threshold
Err
or
Ra
te
False NegativesFalse PositivesTotal Error
11-04-01 Modeling the cost of misunderstanding … 7
The Cost Model Model the impact of the FPs and FNs on
the system performance Identify a suitable performance metric P Build a statistical regression model at the
dialog session level: P = f(FPs, FNs) P = k + CostFP*FP + CostFN*FN (Linear Regr)
Then we can plot f, and implicitly optimize for P
11-04-01 Modeling the cost of misunderstanding … 8
Measuring Performance User Satisfaction (i.e. 5-point scale)
Hard to get Very subjective ~ hard to make it consistent
across users
Concept transfer efficiency: CTC: correctly transferred concepts per turn ITC: incorrectly transferred concepts per turn
Completion
11-04-01 Modeling the cost of misunderstanding … 9
Detour : The Dataset 134 dialogs (2561 utterances), collected
using 4 scenarios Satisfaction scores only for 35 dialogs Corpus manually labeled at the concept
and level 4 labels: OK / RBAD / PBAD / OOD Aggregate utterance labels generated
Confidence annotator decisions logged Computed counts of FPs, FNs, CTCs,
ITCs for each session
11-04-01 Modeling the cost of misunderstanding … 10
Example
U: I want to fly from Pittsburgh to Boston
S: I want to fly from Pittsburgh to Austin C: [I_want/OK] [Depart_Loc/OK]
[Arrive_Loc/RBAD]
Only 2 relevantly expressed concepts If Accept: CTC = 1, ITC = 1 If Reject: CTC = 0, ITC = 0
11-04-01 Modeling the cost of misunderstanding … 11
Targeting Efficiency: Model 1
3 Successively refined models
CTC = FP + FN + TN + k CTC - correctly transferred concepts / turn TN – true negatives
Model R2 all R2 train R2 test
CTC=FP+FN+TN 0.81 0.81 0.73
11-04-01 Modeling the cost of misunderstanding … 12
Targeting Efficiency: Model 2
CTC - ITC = (REC +) FP + FN + TN + k ITC - incorrectly transferred concepts /
turn REC – relevantly expressed concepts
Model R2 all R2 train R2 test
CTC=FP+FN+TN 0.81 0.81 0.73
CTC-ITC=FP+FN+TN 0.86 0.86 0.78
CTC-ITC=REC+FP+FN+TN 0.89 0.89 0.83
11-04-01 Modeling the cost of misunderstanding … 13
Targeting Efficiency: Model 3
CTC-ITC = REC+FPC+FPNC+FN+TN+k 2 types of FPs:
With concepts - FPC Without concepts - FPNC
Model R2 all R2 train R2 test
CTC=FP+FN+TN 0.81 0.81 0.73
CTC-ITC=FP+FN+TN 0.86 0.86 0.78
CTC-ITC=REC+FP+FN+TN 0.89 0.89 0.83
CTC-ITC =REC+FPC+FPNC+FN+TN
0.94 0.94 0.90
11-04-01 Modeling the cost of misunderstanding … 14
Model 3 - Results
k 0.41
CREC 0.62
CFPNC -0.48
CFPC -2.12
CFN -1.33
CTN -0.55
CTC-ITC = REC+FPC+FPNC+FN+TN+k
11-04-01 Modeling the cost of misunderstanding … 15
Other models
Completion (binary) Logistic regression model Estimated model does not indicate a good fit
User satisfaction (5-point scale) Based on only 35 dialogs R2 = 0.61 (similar to literature – Walker et al) Explanation: subjectivity of metric + limited
dataset
11-04-01 Modeling the cost of misunderstanding … 16
Problem Formulation (1) Develop a cost model which allows us to
quantitatively assess the costs of FP and FN errors.
(2) Use the costs to pick the optimal tradeoff point on the classifier ROC.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Boosting threshold
Err
or
Ra
te
False NegativesFalse PositivesTotal Error
11-04-01 Modeling the cost of misunderstanding … 17
Tuning the Confidence Annotator
Using Model 3 CTC-ITC = REC+FPNC+FPC+FN+TN+k Drop k & REC, plug in the values Cost = 0.48FPNC+2.12FPC+1.33FN+0.56TN
Minimize Cost instead of Classification Error Rate (FP+FN), and we’ll implicitly maximize concept transfer efficiency.
11-04-01 Modeling the cost of misunderstanding … 18
Operating Characteristic
11-04-01 Modeling the cost of misunderstanding … 19
Further Analysis
Is CTC-ITC really modeling dialog performance ?
Mean = 0.71, Std.Dev = 0.28 Mean for completed dialogs = 0.82 Mean for uncompleted dialogs = 0.57 Difference between means significant at
very high level of confidence P-value = 7.23*10-9 (in t-test)
So, looks like CTC-ITC is okay, right ?
11-04-01 Modeling the cost of misunderstanding … 20
Further Analysis (cont’d)
Can we reliably extrapolate to other areas of the operating characteristic ?
11-04-01 Modeling the cost of misunderstanding … 21
Further Analysis (cont’d)
Can we reliably extrapolate to other areas of the operating characteristic ?
Yes, look at the distribution of the FP and FN ratios across dialogs.
11-04-01 Modeling the cost of misunderstanding … 22
Further Analysis (cont’d)
Impact of baseline error rate ? Compared models constructed based on
high and low error rates: For low error rate curve becomes
monotonically increasing This clearly indicates that “trust everything /
have no confidence ” is the way to go in this setting
11-04-01 Modeling the cost of misunderstanding … 23
Our explanation so far…
Ability to easily overwrite incorrectly captured information in the CMU Communicator
Relatively low error rates Likelihood of repeated misrecognition is low
11-04-01 Modeling the cost of misunderstanding … 24
Conclusion
Data-driven approach to quantitatively assess the costs of various types of misunderstandings.
Models based on efficiency fit data well; obtained costs confirm intuition.
For CMU Communicator, model predicts that total cost stays the same across a large range of the operating characteristic of the classifier.
11-04-01 Modeling the cost of misunderstanding … 25
Further Experiments But, of course, we can verify predictions
experimentally Collect new data with the system running
with a very low threshold. 55 dialogs collected so far. Thanks to those who have participated in
these experiments. “Help if you have the time” to the others …
www.cs.cmu.edu/~dbohus/scenarios.htm
Re-estimate models, verify predictions
11-04-01 Modeling the cost of misunderstanding … 26
Confusion Matrix
OK BAD
System says OK TP FP
System says BAD FN TN
FP = False acceptance FN = False detection/rejection Fallout = FP/(FP+TN) = FP/NBAD CDR = 1-Fallout = 1-(FP/NBAD)
top related