1 part ii: practical implementations.. 2 modeling the classes stochastic discrimination
TRANSCRIPT
1
Part II: Practical Implementations.
2
Modeling the Classes
Stochastic Discrimination
3
Algorithm for Training a SD Classifier
Generate projectable weak model
Evaluate model w.r.t. training set, check
enrichment
Check uniformity w.r.t. existing collection
Add to discriminant
4
Dealing with Data Geometry:
SD in Practice
5
2D Example
• Adapted from [Kleinberg, PAMI, May 2000]
6
• An “r=1/2” random subset in the feature space that covers ½ of all the points
7
• Watch how many such subsets cover a particular point, say, (2,17)
(2,17)
8
It’s in 1/2 modelsY = ½ = 0.5
It’s in 2/3 modelsY = 2/3 = 0.67
It’s in 3/4 modelsY = ¾ = 0.75
It’s in 4/5 modelsY = 4/5 = 0.8
It’s in 5/6 modelsY = 5/6 = 0.83
It’s in 0/1 modelsY = 0/1 = 0.0
In Out In
In In In
9
It’s in 6/8 modelsY = 6/8 = 0.75
It’s in 7/9 modelsY = 7/9 = 0.77
It’s in 8/10 modelsY = 8/10 = 0.8
It’s in 8/11 modelsY = 8/11 = 0.73
It’s in 8/12 modelsY = 8/12 = 0.67
It’s in 5/7 modelsY = 5/7 = 0.72
In In
In Out Out
Out
10
• Fraction of “r=1/2” random subsets covering point (2,17) as more such subsets are generated
11
• Fractions of “r=1/2” random subsets covering several selected points as more such subsets are generated
12
• Distribution of model coverage for all points in space, with 100 models
13
• Distribution of model coverage for all points in space, with 200 models
14
• Distribution of model coverage for all points in space, with 300 models
15
• Distribution of model coverage for all points in space, with 400 models
16
• Distribution of model coverage for all points in space, with 500 models
17
• Distribution of model coverage for all points in space, with 1000 models
18
• Distribution of model coverage for all points in space, with 2000 models
19
• Distribution of model coverage for all points in space, with 5000 models
20
• Introducing enrichment:
For any discrimination to happen, the models must have some difference in coverage for different classes.
21
• Enforcing enrichment (adding in a bias): require each subset to cover more points of one class than another
Class distribution A biased (enriched) weak model
22
• Distribution of model coverage for points in each class, with 100 enriched weak models
23
• Distribution of model coverage for points in each class, with 200 enriched weak models
24
• Distribution of model coverage for points in each class, with 300 enriched weak models
25
• Distribution of model coverage for points in each class, with 400 enriched weak models
26
• Distribution of model coverage for points in each class, with 500 enriched weak models
27
• Distribution of model coverage for points in each class, with 1000 enriched weak models
28
• Distribution of model coverage for points in each class, with 2000 enriched weak models
29
• Distribution of model coverage for points in each class, with 5000 enriched weak models
30
• Error rate decreases as number of models increases
Decision rule: if Y < 0.5 then class 2 else class 1
31
• Sparse Training Data:
Incomplete knowledge about class distributions
Training Set Test Set
32
• Distribution of model coverage for points in each class, with 100 enriched weak models
Training Set Test Set
33
• Distribution of model coverage for points in each class, with 200 enriched weak models
Training Set Test Set
34
• Distribution of model coverage for points in each class, with 300 enriched weak models
Training Set Test Set
35
• Distribution of model coverage for points in each class, with 400 enriched weak models
Training Set Test Set
36
• Distribution of model coverage for points in each class, with 500 enriched weak models
Training Set Test Set
37
• Distribution of model coverage for points in each class, with 1000 enriched weak models
Training Set Test Set
38
• Distribution of model coverage for points in each class, with 2000 enriched weak models
Training Set Test Set
39
• Distribution of model coverage for points in each class, with 5000 enriched weak models
Training Set Test Set
No discrimination!
40
• Models of this type, when enriched for training set, are not necessarily enriched for test set
Training Set Test Set
Random model with 50% coverage of space
41
• Introducing projectability:
Maintain local continuity of class interpretations.
Neighboring points of the same class should share similar model coverage.
42
• Allow some local continuity in model membership, so that interpretation of a training point can generalize to its immediate neighborhood
Class distribution A projectable model
43
• Distribution of model coverage for points in each class, with 100 enriched, projectable weak models
Training Set Test Set
44
• Distribution of model coverage for points in each class, with 300 enriched, projectable weak models
Training Set Test Set
45
• Distribution of model coverage for points in each class, with 400 enriched, projectable weak models
Training Set Test Set
46
• Distribution of model coverage for points in each class, with 500 enriched, projectable weak models
Training Set Test Set
47
• Distribution of model coverage for points in each class, with 1000 enriched, projectable weak models
Training Set Test Set
48
• Distribution of model coverage for points in each class, with 2000 enriched, projectable weak models
Training Set Test Set
49
• Distribution of model coverage for points in each class, with 5000 enriched, projectable weak models
Training Set Test Set
50
• Promoting uniformity:
All points in the same class should have equal likelihood to be covered by a model of each particular rating.
Retain models that cover the points whose coverage by current collection is less
51
• Distribution of model coverage for points in each class, with 100 enriched, projectable, uniform weak models
Training Set Test Set
52
• Distribution of model coverage for points in each class, with 1000 enriched, projectable, uniform weak models
Training Set Test Set
53
• Distribution of model coverage for points in each class, with 5000 enriched, projectable, uniform weak models
Training Set Test Set
54
• Distribution of model coverage for points in each class, with 10000 enriched, projectable, uniform weak models
Training Set Test Set
55
• Distribution of model coverage for points in each class, with 50000 enriched, projectable, uniform weak models
Training Set Test Set
56
The 3 necessary conditions
Complementary Information
Discriminating Power
Generalization Power
Enrichment:
Projectability:Uniformity:
57
Extensions and Comparisons
58
Alternative Discriminants
• [Berlind 1994]
• Different discriminants for N-class problems
• Additional condition on symmetry
• Approximate uniformity
• Hierarchy of indiscernibility
59
Estimates of Classification Accuracies
• [Chen 1997]
• Statistical estimate of classification accuracy
under weaker conditions:
Approximate uniformity
Approximate indiscernibility
60
• For n classes, define n discriminants Yi, one for each class i vs the others
• Classify an unknown point to the class i for which the computed Yi is the largest
Multi-class Problems
61
[Ho & Kleinberg ICPR 1996]
62
63
64
65
Open Problems
• Algorithm for uniformity enforcementDeterministic methods?
• Desirable form of weak modelsFewer, more sophisticated classifiers?
• Other ways to address the 3-way trade-offEnrichment / Uniformity / Projectability
66
Random Decision Forest
• [Ho 1995, 1998]
• A structured way to create models: fully split a tree, use leaves as models
• Perfect enrichment and uniformity for TR
• Promote projectability by subspace projection
67
Compact Distribution Maps
• [Ho & Baird 1993, 1997]
• Another structured way to create models
• Start with projectable models by coarse quantization of feature value range
• Seek enrichment and uniformity
Signature of 2 types of events and measurements from a new observation
Signal IndexSignal Level
68
SD & Other Ensemble Methods
• Ensemble learning via boosting:
A sequential way to promote uniformity of ensemble element coverage
• XCS (a genetic algorithm)
A way to create, filter, and use stochastic models that are regions in feature space
69
XCS Classifier System
• [Wilson,95]Recent focus of GA community
Good performance
Reinforcement Learning + Genetic Algorithms
Model: set of rules
Environment
Set of Rules
input class
ReinforcementLearning
GeneticAlgorithms
reward
updatesearch
if (shape=square and number>10) then class=redif (shape=circle and number<5) then class=yellow
70
Multiple Classifier Systems:Examples in Word Image Recognition
71
Complementary Strengths of Classifiers
The case for classifier combination
… decision fusion
… mixture of experts
… committee decision making
Rank of true class out of a lexicon of 1091 words, by 10 classifiers for 20 images
72
Classifier Combination Methods
• Decision Optimization:
find consensus among a given set of classifiers
• Coverage Optimization:
create a set of classifiers that work best with a given decision combination function
73
Decision Optimization
• Develop classifiers with expert knowledge• Try to make the best use of their decisions
via majority/plurality vote, sum/product rule, probabilistic methods, Bayesian methods, rank/confidence score combination …
• The joint capability of the classifiers set an intrinsic limit on the combined accuracy
• There is no way to handle the blind spots
74
Difficulties in Decision Optimization
• Reliability versus overall accuracy
• Fixed or trainable combination function
• Simple models or combinatorial estimates
• How to model complementary behavior
75
Coverage Optimization
• Fix a decision combination function• Generate classifiers automatically and systematically
via training set sub-sampling (stacking, bagging, boosting),subspace projection (RSM), superclass/subclass decomposition (ECOC), random perturbation of training processes, noise injection …
• Need enough classifiers to cover all blind spots(how many are enough?)
• What else is critical?
76
Difficulties inCoverage Optimization
• What kind of differences to introduce:– Subsamples? Subspaces? Super/Subclasses?– Training parameters? – Model geometry?
• 3-way tradeoff: – discrimination + diversity + generalization
• Effects of the form of component classifiers
77
Dilemmas and Paradoxes in Classifier Combination
• Weaken individuals for a stronger whole?
• Sacrifice known samples for unseen cases?
• Seek agreements or differences?
78
Stochastic Discrimination
• A mathematical theory that relates several key concepts in pattern recognition:
– Discriminative power … enrichment– Complementary information … uniformity– Generalization power … projectability
• It offers a way to describe complementary behavior of classifiers
• It offers guidelines to design multiple classifier systems (classifier ensembles)