analysis of classification-based error functions
DESCRIPTION
Analysis of Classification-based Error Functions. Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006. Overview. Machine learning Teaching artificial neural networks with an error function Problems with conventional error functions CB algorithms Experimental results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/1.jpg)
Analysis of Classification-based
Error Functions
Mike Rimer
Dr. Tony Martinez
BYU Computer Science Dept.
18 March 2006
![Page 2: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/2.jpg)
Overview
Machine learning Teaching artificial neural networks with an
error function Problems with conventional error functions CB algorithms Experimental results Conclusion and future work
![Page 3: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/3.jpg)
Machine Learning
Goal: Automating learning of problem domains Given a training sample from a problem
domain, induce a correct solution-hypothesis over the entire problem population
The learning model is often used as a black box
input output f (x)
![Page 4: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/4.jpg)
Teaching ANNs with an Error Function
Used to train a multi-layer perceptron (MLP) to guide the gradient descent learning procedure
to an optimal state Conventional error metrics are sum-squared
error (SSE) and cross entropy (CE) SSE suited to function approximation CE aimed at classification problems CB error functions [Rimer & Martinez 06]
work better for classification
![Page 5: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/5.jpg)
SSE, CE
Attempts to approximate 0-1 targets in order to represent making a decision
OT
0 1
O2O1
ERROR 2ERROR 1
Pattern labeled as class 2
![Page 6: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/6.jpg)
Issues with approximating hard targets
Requires weights to be large to achieve optimality
Leads to premature weight saturation Weight decay, etc., can improve the situation
Learns areas of the problem space unevenly and at different times during training
Makes global learning problematic
![Page 7: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/7.jpg)
Classification-basedError Functions
Designed to more closely match the goal of learning a classification task (i.e. correct classifications, not low error on 0-1 targets), avoiding premature weight saturation and discouraging overfit
CB1 [Rimer & Martinez 02, 06] CB2 [Rimer & Martinez 04] CB3 (submitted to ICML ‘06)
![Page 8: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/8.jpg)
CB1
Only backpropagates error on misclassified training patterns
0 1Correct
T~T
otherwise0
)( and if
)( and if
maxmax
maxmax~max~
TkkkT
TTkkT
k ooTcoo
ooTcoo
0 1Misclassified
T ~T
ERROR
![Page 9: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/9.jpg)
CB2
Adds a confidence margin, μ, that is increased globally as training progresses
otherwise0
)( and if
)( and if
maxmax
maxmax~max~
TkkkT
TTkkT
k ooTcoo
ooTcoo
0 1 Misclassified
T ~T
ERROR
μ0 1
~T T
ERROR
μ
Correct, but doesn’t satisfy margin
0 1 Correct, and
satisfies margin
T~T
μ
![Page 10: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/10.jpg)
CB3
Learns a confidence Ci for each training pattern i as training progresses Patterns often misclassified have low confidence Patterns consistently classified correctly gain confidence
0 1Misclassified
T ~T
ERROR
0 1
~T T
ERROR
Ci
Correct with learned low confidence
0 1
~T T
ERROR
Ci
Correct with learned high confidence
![Page 11: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/11.jpg)
Neural Network Training
Influenced by: Initial parameter (weight) settings Pattern order presentation (stochastic training) Learning rate # of hidden nodes
Goal of training: High generalization Low bias and variance
![Page 12: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/12.jpg)
Experiments
Empirical comparison of six error functions SSE, CE, CE w/ WD, CB1-3
Used eleven benchmark problems from the UC Irvine Machine Learning Repository ann, balance, bcw, derm, ecoli, iono, iris, musk2, pima, sonar, wine
Testing performed using stratified 10-fold cross-validation
Model selection by hold-out set Results were averaged over ten tests LR = 0.1, M = 0.7
![Page 13: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/13.jpg)
Classifier output difference (COD)
Evaluation of behavioral difference of two hypotheses (e.g. classifiers)
T
xHxHIHHD Tx
T
))()((
),(ˆ 2121
T is the test set
I is the identity or characteristic function
![Page 14: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/14.jpg)
Robustness to initial network weights
Averaged 30 random runs over all datasets
algorithm%
Test acc St Dev Epoch
CB3 93.468 4.7792 200.67
CB2 92.839 4.0800 366.69
CB1 92.828 5.3290 514.14
CE 92.789 5.3937 319.57
CE w/ WD 92.251 5.4735 197.24
SSE 91.951 5.6131 774.70
![Page 15: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/15.jpg)
Robustness to initial network weights
Averaged over all tests Algorithm Test error COD
CB3 0.0653 0.0221
CB2 0.0716 0.0274
CB1 0.0717 0.0244
CE 0.0721 0.0248
CE w/ WD 0.0774 0.0255
SSE 0.0804 0.0368
COD
0.0200
0.0220
0.0240
0.0260
0.0280
0.0300
0.0320
0.0340
0.0360
0.0380
CB1 CB2 CB3 CE CE w/WD
SSE
![Page 16: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/16.jpg)
Robustness to pattern presentation order
Averaged 30 random runs over all datasets
algorithm%
Test acc St Dev Epoch
CB3 93.446 5.0409 200.46
CB2 92.641 5.4197 402.52
CB1 92.542 5.473 560.09
CE 92.290 5.6020 329.65
CE w/ WD 91.818 5.6278 221.21
SSE 91.817 5.6653 593.30
![Page 17: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/17.jpg)
Robustness to pattern presentation order
Averaged over all tests Algorithm Test error COD
CB3 0.0655 0.0259
CB2 0.0736 0.0302
CB1 0.0746 0.0282
CE 0.0771 0.0329
CE w/ WD 0.0818 0.0338
SSE 0.0818 0.0344
COD
0.0200
0.0220
0.0240
0.0260
0.0280
0.0300
0.0320
0.0340
0.0360
0.0380
CB1 CB2 CB3 CE CE w/WD
SSE
![Page 18: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/18.jpg)
Robustness to learning rate
Average of varying the learning rate from 0.01 – 0.3
Algorithm Test acc St Dev Epoch
CB3 93.175 3.514 334.8
CB2 92.285 3.437 617.8
SSE 92.211 3.449 525.7
CB1 91.908 3.880 505.4
CE 91.629 3.813 466.2
CE w/ WD 91.330 3.845 234.6
![Page 19: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/19.jpg)
Robustness to learning rate
90
90.5
91
91.5
92
92.5
93
93.5
94
0.01 0.06 0.11 0.16 0.21 0.26
Learning Rate
Tes
t A
ccu
racy
CB1
CB2
CB3
CE
CE WD
SSE
![Page 20: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/20.jpg)
Robustness to number of hidden nodes
Average of varying the number of nodes in the hidden layer from 1 - 30
Algorithm Test acc St dev Epoch
CB3 93.026 3.397 303.9
CB1 92.291 3.610 381.0
CB2 92.136 3.410 609.4
SSE 92.066 3.402 623.1
CE 91.956 3.563 397.0
CE w/ WD 91.74 3.493 190.6
![Page 21: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/21.jpg)
Robustness to number of hidden nodes
90
90.5
91
91.5
92
92.5
93
93.5
94
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# Hidden Nodes
Tes
t A
ccu
racy
CB1 CE
CB2
CB3
CE
CE WD
SSE
![Page 22: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/22.jpg)
Conclusion
CB1-3 are generally more robust than SSE, CE, and CE w/ WD, with respect to: Initial weight settings Pattern presentation order Pattern variance Learning rate # hidden nodes
CB3 is superior, most robust, with most consistent results
![Page 23: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/23.jpg)
Questions?
![Page 24: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/24.jpg)
0 0.9 1.8 2.7 3.6 4.5 5.4 6.3 7.2 8.1 9 9.9+
Epoch
300
600
900
0
10
20
30
40
50
60
70
80
90
100
SSE
![Page 25: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/25.jpg)
0 0.9 1.8 2.7 3.6 4.5 5.4 6.3 7.2 8.1 9 9.9+
Epoch
300
600
900
0
10
20
30
40
50
60
70
80
90
100
Cross-entropy
![Page 26: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/26.jpg)
0 0.9 1.8 2.7 3.6 4.5 5.4 6.3 7.2 8.1 9 9.9+
Epoch
300
600
900
0
10
20
30
40
50
60
70
80
90
100
Cross-entropy w/ weight decay
![Page 27: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/27.jpg)
0
0.5 1
1.5 2
2.5 3
3.5 4
4.5 5
5.5 6
6.5 7
7.5 8
8.5 9
9.5
Epoch
300
600
900
0
10
20
30
40
50
60
70
80
90
100
CB1
![Page 28: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/28.jpg)
0
0.5 1
1.5 2
2.5 3
3.5 4
4.5 5
5.5 6
6.5 7
7.5 8
8.5 9
9.5
Epoch
300
600
900
0
10
20
30
40
50
60
70
80
90
100
CB2
![Page 29: Analysis of Classification-based Error Functions](https://reader031.vdocuments.us/reader031/viewer/2022020921/56815a6b550346895dc7c54b/html5/thumbnails/29.jpg)
0
0.5 1
1.5 2
2.5 3
3.5 4
4.5 5
5.5 6
6.5 7
7.5 8
8.5 9
9.5
Epoch
300
600
900
0
10
20
30
40
50
60
70
80
90
100
CB3