probability forecasting - a machine learning perspective
TRANSCRIPT
Reliable Probability Forecasting – a Machine Learning Perspective
David Lindsay
Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk
Overview What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Probability Forecasting
Qualified predictions important in many applications (especially medicine).
Most machine learning algorithms make “bare” predictions.
Those that do make qualified predictions make no claims of how effective the measures are!
Probability Forecasting: Generalisation of Pattern Recognition
Goal of pattern recognition = find the “best” label for each new test object.
Example Abdominal Pain Dataset:
Training Set to “learn” from
Label
Diagnosisiy
Object
Patient Details
ixName: DavidSex: MHeight: 6’2”
Appendicitis
Name: DaniilSex: MHeight: 6’4”
Dyspepsia
Name: MarkSex: MHeight: 6’1”
Non-specific
,...,Name: SianSex: FHeight: 5’8”
Dyspepsia
, , Name: WilmaSex: FHeight: 5’6”
?
Test Object, what is the true label?
True label unknown or withheld from learner
Probability Forecasting: Generalisation of Pattern Recognition Probability forecast – estimate the conditional probability
of a label given an observed objectˆ( | ) Pr( | )P y x y x
learner
Training set Name: Helen
Sex: FHeight: 5’6”
Name: HelenSex: FHeight: 5’6”
Name: HelenSex: FHeight: 5’6”
Name: HelenSex: FHeight: 5’6”
Testobject
?
Name: HelenSex: FHeight: 5’6”
ˆ(Dyspepsia | )P = 0.1Name: HelenSex: FHeight: 5’6”
ˆ(Appendicitis | )P = 0.7Name: HelenSex: FHeight: 5’6”
= 0.2ˆ(Non spec | )PName: HelenSex: FHeight: 5’6”
etc…
We want learner to estimate probabilities for all possible class labels:
Probability forecasting more formally… X object space, Y label space, Z = X Y example space
Our learner makes probability forecasts for all possible
labels 1 2 1 1 1 1 1 1 1
ˆ ˆ ˆ, , , , ( 1| ), ( 2 | ), , ( | )n n n n n n n nz z z x P y x P y x P y x Y
1 1ˆ |arg maxn ni
y P i x
Y
Use probability forecasts to predict label most likely label:
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Studies of Probability Forecasting
Probability forecasting is well studied area since 1970’s: Psychology Statistics Meteorology
These studies assessed two criteria of probability forecasts: Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically
useful
When an event is predicted with probability should have approx chance of being incorrect
Reliabilityp̂
ˆ1 p
a.k.a. well calibrated, Considered an asymptotic property. Dawid (1985) proved no deterministic learner
can be reliable for all data – still interesting to investigate
This property is often overlooked in practical studies!
Resolution
Probability forecasts are practically useful, e.g. they can be used to rank the labels in order of likelihood!
Closely related to classification accuracy - common focus of machine learning.
Separate from reliability, i.e. do not go “hand in hand” (Lindsay, 2004)
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Experimental design Tested several learners on many datasets
in the online setting:ZeroR = ControlK-Nearest NeighbourNeural NetworkC4.5 Decision TreeNaïve BayesVenn Probability Machine Meta Learner (see
later…)
The Online Learning Setting
2 7 6 1 7 ? ?
2 7 6 1 7 2 ?
Before
After
Update training data for learning machine for next trial
Learning machine makes prediction for new example. (label withheld)
Repeat process for all examples
Lots of benchmark data Tested on data available from the UCI Machine Learning
repository: Abdominal Pain: 6387 examples, 135 features, 9 classes,
Noisy Diabetes: 768 examples, 8 features, 2 classes Heart-Statlog: 270 examples, 13 features, 2 classes Wisconsin Breast Cancer: 685 examples, 10 features, 2
classes American Votes: 435 examples, 16 features, 2 classes Lymphography: 148 examples, 18 features, 4 classes Credit Card Applications: 690 examples, 15 features, 2
classes Iris Flower: 150 examples, 4 features, 3 classes And many more…
Programs
Extended the WEKA data mining system implemented in Java:Added VPM meta learner to existing library of
algorithmsAllow learners to be tested in online setting.
Created Matlab scripts to easily create plots (see later)
Results, papers and website All results that I discuss today can be found in my
3 tech reports: The Probability Calibration Graph - a useful
visualisation of the reliability of probability forecasts, Lindsay (2004), CLRC-TR-04-01
Multi-class probability forecasting using the Venn Probability Machine - a comparison with traditional machine learning methods, Lindsay (2004), CLRC-TR-04-02
Rapid implementation of Venn Probability Machines, Lindsay (2004), CLRC-TR-04-03
And on my web site: http://www.david-lindsay.co.uk/research.html
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Loss Functions
2
1 1
( ) ,ˆn
j i
s n i jy jiI p
Y
Square loss
,1 1
ˆ( ) logi
n
i jy ji j
l n I p
YLog loss
There are many other possible loss functions… Degroot and Feinberg (1982) showed that all loss
functions measure a mixture of reliability and resolution Log loss punishes more harshly: forced to spread its
bets
ROC Curves
Naïve Bayes on the Abdominal pain data set
1. Graph shows trade off between false and true positive predictions
2. Want curve to be as close to the upper left corner as possible (away from diagonal)
3. My results show that this graph tests resolution.
4. Area under curve provides measure of quality of probability forecasts.
Table comparing traditional scores
VPM C4.5
Naïve Bayes
VPM Naïve Bayes
10-NN
20-NN
C4.5
Neural Net
30-NN
VPM 1-NN
1-NN
0.76 (1)
0.72 (5)
0.75 (2)
0.54 (10)
0.55 (9)
0.57 (8)
0.75 (3)
0.74 (4)
0.61 (6)
0.59 (7)
0.49 (11)
0.8 (4)
1.3 (7)
0.6 (1)
2.6 (10)
2.2 (9)
3.3 (11)
0.72 (2)
0.73 (3)
0.9 (5)
2.1 (8)
1.1 (6)
0.54 (5)
0.50 (4)
0.44 (1)
1.0 (11)
0.96 (10)
0.67 (7)
0.45 (2)
0.47 (3)
0.58 (6)
0.73 (8)
0.74 (9)
40.7 (8)
29.2 (2)
28.9 (1)
33.4 (4)
33.4 (4)
39.6 (7)
30.5 (3)
34.3 (5)
41.6 (9)
34.6 (6)
55.6 (10)ZeroR
PCGROCArea
LogLoss
Sqr Loss
ErrorAlgorithm
Problems with Traditional Assessment Loss functions and ROC give more information
than error rate about the quality of probability forecasts.
But… loss functions = mixture of resolution and reliability ROC curve = measures resolution
Don’t have any method of solely assessing reliability
Don’t have method of telling if probability forecasts are over- or under- estimated
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Inspiration for PCG (Meteorology)
Murphy & Winkler (1977)Calibration data for precipitation forecasts
Reliable points lie close to diagonal
A PCG plot of ZeroR on Abdominal Pain
Predicted Probability
Empi
rical
freq
uenc
y of
bei
ng c
orre
ct
Line of calibration
PCG coordinates
Reliability PCG coordinates lie close to line of calibrationi.e. ZeroR may is not accurate but it is reliable!
Plot may not span whole axis – ZeroR makes no predictions with high probability
PCG a visualisation tool and measure of reliability
Total 2764.5
Mean 0.0483
Standard Deviation 0.0757
Max 0.4203
Min 4.9e-17
Naïve Bayes VPM Naïve Bayes
VPM is reliable as PCG follows the diagonal!
Total 496.7
Mean 0.0087
Standard Deviation 0.0112
Max 0.1017
Min 9.2e-8
Over and under estimates its probabilities – much like real doctors!
Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate)
Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)
Learners predicting like people!
Naïve Bayes People
Lots of psychological research people make unreliable probability forecasts
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Table comparing scores with PCG
838.1 (4)0.76 (1)0.8 (4)0.54 (5)40.7 (8)VPM C4.5
2764.5 (7)0.72 (5)1.3 (7)0.50 (4)29.2 (2)Naïve Bayes
496.7 (1)0.75 (2)0.6 (1)0.44 (1)28.9 (1)VPM Naïve Bayes
5062.9 (11)0.54 (10)2.6 (10)1.0 (11)33.4 (4)10-NN
4492.7 (10)0.55 (9)2.2 (9)0.96 (10)33.4 (4)20-NN
3481.2 (8)0.57 (8)3.3 (11)0.67 (7)39.6 (7)C4.5
1320.5 (6)0.75 (3)0.72 (2)0.45 (2)30.5 (3)Neural Net
921.2 (5)0.74 (4)0.73 (3)0.47 (3)34.3 (5)30-NN
554.6 (2)0.61 (6)0.9 (5)0.58 (6)41.6 (9)VPM 1-NN
4307.5 (9)0.59 (7)2.1 (8)0.73 (8)34.6 (6)1-NN
678.6 (3)0.49 (11)1.1 (6)0.74 (9)55.6 (10)ZeroR
PCGROCArea
LogLoss
Sqr Loss
ErrorAlgorithm
Correlations of scores
Inverse No-0.1ROC vs. Sqr Reliability
Direct Weak0.26PCG vs. ErrorDirect No0.04PCG vs. Sqr
ResolutionDirect Strong0.76PCG vs. Sqr
Reliability
InterpretationCorr. Coeff.ScoresInverse Moderate-0.52ROC vs. Error
Direct Strong0.67ROC vs. Sqr Resolution
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
What is the VPM meta-learner?
Volodya’s VPM1. Predicts a label2. Produces upper u and lower l bounds for predicted label only
My VPM extension1. Extracts more information 2. Produces probability forecast for all possible labels3. Predicts a label using these probability forecasts.4. Produces Volodya’s bounds as well!
Learner ΓVPM meta
learningframework
VPM “sits on top” of existing learner to complement predictions with probability estimates
Volodya’s original use of VPM
Online Trial Number
Erro
r rat
e an
d bo
unds
22.1%1414.1Low Error
28.9%1835Error
34.7%2216.5Up Error
Upper (red) and lower (green) bounds lie above and below the actual number of errors (black) made on the data.
Output from VPM compared with that of original underlying learner
Key: Predicted = underlined , Actual =
NANA7.6e-96.3e-10
4.0e-112.2e-91.3e-9
0.071.7e-13
2.9e-90.935831
NANA2.2e-42.2e-7
0.20.460.162.3e-5
0.170.019.4e-52490
NANA1.3e-44.1e-10
3.4e-34.2e-30.994.4e-5
3.3e-6
4.5e-63.08e-9
1653
Naïve Bayes
LowUpDysp.Renal.
PancrIntest obstr
CholiNon. Spec
Perf. Pept.
Div.Appx
BoundsProbability forecast for each class labelTrial #
0.410.680.010.010.00.010.010.420.00.010.535831
0.070.710.40.090.080.150.050.070.100.030.022490
0.080.820.090.010.040.00.730.080.030.00.031653
VPM Naïve Bayes
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
ZeroR
Heart Disease Lymphography Diabetes
• ZeroR outputs probability forecasts which are mere label frequencies
• ZeroR predicts the majority class label at each trial.• Uses no information about the objects in its learning – the
simplest of all learners.• Accuracy is poor, but reliability is good.
K-NN
10-NN 20-NN 30-NN
• K-NN finds subset of K closest (nearest neighbouring) examples in training data using a distance metric. Then counts the label frequencies amongst this subset.
• Acts like a more sophisticated version of ZeroR that uses information held in the object.
• Appropriate choice of K must be made to obtain reliable probability forecasts (depends on data).
Traditional Learners and VPM Traditional learners can be very unreliable (yet accurate) - depends on data. My research shows empirically that VPM is reliable. And it can recalibrate a learners original probability forecasts to make them
more reliable! Improvement in reliability often without detriment to classification accuracy.
Naïve Bayes
VPM Naïve Bayes
C4.5
VPM C4.5
Neural Net
VPM Neural Net
1-NN
VPM 1-NN
Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods:
square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint
Psychological Heuristics When faced with the difficult task of judging
probability, people employ a limited number of heuristics which reduce the judgements to simpler ones: Availability - An event is predicted more likely to
occur if it has occurred frequently in the past Representativeness - One compares the essential
features of the event to those of the structure of previous events
Simulation - The ease in which the simulation of a system of events reaches a particular state can be used to judge the propensity of the (real) system to produce that state.
Interpretation of reliable learners using heuristics ZeroR, K-NN and VPM learners are
reliable probability forecasters. Can identify heuristics in these learning
algorithms Remember psychological research states:
More heuristics More reliable forecasts
Psychological Interpretation of ZeroR The simplest of all reliable probability
forecasters uses 1 heuristic:The learner merely counts labels it has
observed so far, and uses the frequencies of labels as its forecasts (Availability)
Psychological Interpretation of K-NN More sophisticated than the ZeroR learner,
the K-NN learner uses 2 heuristics:Uses the distance metric to find subset of K
closest examples in training set. (Representativeness)
Then counts the label frequencies in the subset of K-nearest neighbours to makes its forecasts (Availability)
Psychological Interpretation of VPM Even more sophisticated the VPM meta-
learner uses all 3 heuristics:The VPM tries each new test example with all
possible classifications (Simulation)Then under each tentative simulation clusters
training examples which are similar into groups (Representativeness)
Finally the VPM calculates the frequency of labels in each of these groups to make its forecasts (Availability)
Theoretical justifications
ZeroR can be proven to be asymptotically reliable (but experiments show well in finite data)
K-NN has lots of theory Stone (1977) to support its convergence to true probability distribution
VPM has a lots of theoretical justification for finite data using martingales
Take home points Probability forecasting is useful for real life
applications especially medicine. Want learners to be reliable and accurate. PCG can be used to check reliability. ZeroR, K-NN and VPM provide consistently
reliable probability forecasts. Traditional learners Naïve Bayes, Neural Net
and Decision Tree can provide unreliable forecasts.
VPM can be used to improve reliability of probability forecasts without detriment to classification accuracy.
Supervision Alex Gammerman
Volodya VovkZhiyuan Luo
Mathematical Advice Daniil RiabkoVolodya Vovk
Teo Sharia
Proofreading Zhiyuan Luo
Siân Cox
Graphics & Design Siân Cox
Catering Siân Cox
Fin Acknowledgments