machine learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...outline 1 introduction 2...
TRANSCRIPT
Machine LearningEnsemble Learning
Hamid Beigy
Sharif University of Technology
Fall 1395
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 19
Table of contents
1 Introduction
2 Diversity measures
3 Design of ensemble systems
4 Building ensemble based systems
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 2 / 19
Outline
1 Introduction
2 Diversity measures
3 Design of ensemble systems
4 Building ensemble based systems
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 3 / 19
Introduction
1 In our daily life1 Asking different doctors opinions before undergoing a major surgery2 Reading user reviews before purchasing a product.3 There are countless number of examples where we consider the decision of mixture of
experts.
2 Ensemble systems follow exactly the same approach to data analysis.
Problem (Ensemble learning)
Given training data set S = {(x1, t1), (x2, t2), . . . , (xN , tN)} drawn from common instancespace X , and
A collection of inductive learning algorithms,
Return a new classification algorithm for x ∈ X that combines outputs from collection ofclassification algorithms
3 Desired PropertyGuarantees of performance of combined prediction.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 3 / 19
Why we combine classifiers?
Reasons for using ensemble based systems1 Statistical reasons
1 A set of classifiers with similar training data may have different generalizationperformance.
2 Classifiers with similar performance may perform differently in field (depends on testdata).
3 In this case, averaging (combining) may reduce the overall risk of decision.4 In this case, averaging (combining) may or may not beat the performance of the best
classifier.
2 Large volumes of data
1 Usually training of a classifier with a large volumes of data is not practical.2 A more efficient approach is to
Partition the data into smaller subsetsTraining different Classifiers with different partitions of dataCombining their outputs using an intelligent combination rule
3 To little data1 We can use resampling techniques to produce non-overlapping random training data.2 Each of training set can be used to train a classifier.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 4 / 19
Why we combine classifiers? (cont.)
Reasons for using ensemble based systems1 Data fusion
1 Multiple sources of data (sensors, domain experts, etc.)2 Need to combine systematically, for example a neurologist may order several tests
MRI scan, EEG recording, Blood test3 A single classifier cannot be used to classify data from different sources
(heterogeneous features).2 Divide and conquer
1 Regardless of the amount of data, certain problems are difficult for solving by aclassifier.
2 Complex decision boundaries can be implemented using ensemble Learning.
an intelligent combination rule often proves to be amore efficient approach.
Too Little Data: Ensemble systems can also be usedto address the exact opposite problem of having too lit-tle data. Availability of an adequate and representativeset of training data is of paramount importance for aclassification algorithm to successfully learn the under-lying data distribution. In the absence of adequate train-ing data, resampling techniques can be used for drawingoverlapping random subsets of the available data, eachof which can be used to train a different classifier, creat-ing the ensemble. Such approaches have also proven tobe very effective.
Divide and Conquer: Regardless of the amount ofavailable data, certain problems are just too difficult fora given classifier to solve. More specifically, the decisionboundary that separates data from different classes maybe too complex, or lie outside the space of functions
that can be implemented by the chosen classifier model.Consider the two dimensional, two-class problem with acomplex decision boundary depicted in Figure 1. A lin-ear classifier, one that is capable of learning linearboundaries, cannot learn this complex non-linearboundary. However, appropriate combination of anensemble of such linear classifiers can learn this (or anyother, for that matter) non-linear boundary.
As an example, let us assume that we have access toa classifier model that can generate elliptic/circularshaped boundaries. Such a classifier cannot learn theboundary shown in Figure 1. Now consider a collectionof circular decision boundaries generated by an ensem-ble of such classifiers as shown in Figure 2, where eachclassifier labels the data as class1 (O) or class 2 (X),based on whether the instances fall within or outside ofits boundary. A decision based on the majority voting ofa sufficient number of such classifiers can easily learnthis complex non-circular boundary. In a sense, the clas-sification system follows a divide-and-conquer approachby dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one ofthe simpler partitions. The underlying complex decisionboundary can then be approximated by an appropriatecombination of different classifiers.
Data Fusion: If we have several sets of data obtainedfrom various sources, where the nature of features aredifferent (heterogeneous features), a single classifiercannot be used to learn the information contained in allof the data. In diagnosing a neurological disorder, forexample, the neurologist may order several tests, suchas an MRI scan, EEG recording, blood tests, etc. Eachtest generates data with a different number and type offeatures, which cannot be used collectively to train asingle classifier. In such cases, data from each testingmodality can be used to train a different classifier,whose outputs can then be combined. Applications inwhich data from different sources are combined to makea more informed decision are referred to as data fusionapplications, and ensemble based approaches have suc-cessfully been used for such applications.
There are many other scenarios in which ensemblebase systems can be very beneficial; however, discussionon these more specialized scenarios require a deeperunderstanding of how, why and when ensemble systemswork. The rest of this paper is therefore organized as fol-lows. In Section 2, we discuss some of the seminal workthat has paved the way for today’s active research area inensemble systems, followed by a discussion on diversity,a keystone and a fundamental strategy shared by allensemble systems. We close Section 2 by pointing outthat all ensemble systems must have two key compo-nents: an algorithm to generate the individual classifiers
23THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE
OO
OO
OOOO
OOOO
OO
OO
OOOO
XXXX
XXXX
XXXX
XX
XXXX
XX
XXXXXX
XX
XX
XXXXXX
XX
XXXX
XXXX
XXXX
XXXXXX
XX
XX
XX
XXXX
XX
XX
XX
XXXXXX
OO
OO OO
OO
OOOO
OOOOOO
OOOO
OOOO
OO
OOOO
OOOOOO
OOOOOOOO
OO
OO
OO
OOOO
OOOO
OO OOOOOOOO
OO OOOO
OOOO OO
OOOOOO
OOOOOO
OOOO
OO
OO
OO
OOOO
OO
OO
OO
OOOOOOOO
OO
OOOO OOOO OO
OOOOOOOO
OOOO
OOOO
OO
OO OO OOOOOO
OO
OO
OO
OO
OO
OO
OO
XX
XX
XX XX
XX
XXXX
XX
OO
OO
OOOO
OO
OO
OOOO
Obs
erva
tion/
Mea
sure
men
t/Fea
ture
2
Training Data Examplesfor Class 1
Observation/Measurement/Feature 1
Training Data Examplesfor Class 2
Complex DecisionBoundary to Be Learned
OO
Figure 1. Complex decision boundary that cannot be learnedby linear or circular classifiers.
OO
OO
OOOO
OOOO
OO
OO
OOOO
XXXX
XXXX
XXXX
XX
XXXX
XX
XXXXXX
XX
XX
XXXXXX
XX
XXXX
XXXX
XXXX
XXXXXX
XX
XX
XX
XXXX
XX
XX
XX
XXXXXX
OO
OO OO
OO
OOOO
OOOOOO
OOOO
OOOO
OO
OOOO
OOOOOO
OOOOOOOO
OO
OO
OO
OOOO
OOOO
OO OOOOOOOO
OO OOOO
OOOO OO
OOOOOOOO
OOOOOO
OOOO
OO
OO
OO
OOOO
OO
OO
OO
OOOOOOOO
OO
OOOO OOOO OO
OOOOOOOO
OOOO
OOOO
OO
OO OOOO
OOOO
OO
OO
OO
OO
OO
OO
OO
XX
XX
XX XX
XX
XXXX
XX
OO
OO
OOOO
OO
OO
OOOO
Observation/Measurement/Feature 1
Obs
erva
tion/
Mea
sure
men
t/Fea
ture
2
OOOO
OOOOOOOO OOOOOOOO
OOOOOOOO
OOOOOOO
OOOO
OOOO
XXXX
XXXXXXXXXXXX
XXXXXXXXXXXX
OOOOOOO
OOOOOOOO
OOO OOOO
OOO
OOO
OO
OOOOOOOO
OO
OOOO
OOOOOOOO
OOOOOOO
XXXX
XXXX
XXXX
XXXXOOOO
OOOO OOOOOOOOOOO
OOOOOOOO
OOOOOOOO
XXXXXXXX
XOOOO
OOOO
OOOO
OOOOOOOOO
OOOO
OOOOOOOO
OOOOOOOO
OOOO
XXXXXXXX
XXXXXXXX
XXXX
XXXXOOOO
OOOO
OOOOOOOO
OOOO
OOOOOOOOOOOOOOOO
OOOO
OOOOOOOOOOO
OOOOO
OOOOOOOOOOOOO
OOOOO
OOOOOOOO
OOOOOOOOOOOO
OOOOOOOOOO
OOOOO
OOOOOO XXXX
XXXX
XXXX
OOOOOOOO
XXXXXXXX
XXXXXXXXXXXXXXXXXXXXX
XXXXXXX
XXXX
OOOO
OOO
OOOOOOOO
OOOO
OOOO
OOOOOOOOOOOOOOOO
OOOO
OOOOOOOO OOOOOOOO OOOO
OOOOOOOOOOOOOOO
OOOO
OOOOOOOOOOOOOOO
OOOOOOOOO
OOOOOOOO
OOOOOOOOOOOO
OOOOOOOOOOOO XXXX
XXXX XXOOO
OOOOOOOO
Decision Boundaries Generated by Individual Classifiers
XXXXXXXXXX
XXXXXXXX
XXXX XXXXXXXX
XXXX
XXXXXXX
XXXX
XXX
XXXXXXXX
XXXX
XXXXXXXXXXXXXXXXXXOOO
XXXXOOXXXXXXXX
XXXXXXXX XX XXXX
XXXXXXXX
XXXX
XXXXXXXX XXXXXXX
XXXXXXX
XXXXXXXXX
XXXXXXX
XXXX
XXXX
XXX
Figure 2. Ensemble of classifiers spanning the decisionspace.
an intelligent combination rule often proves to be amore efficient approach.
Too Little Data: Ensemble systems can also be usedto address the exact opposite problem of having too lit-tle data. Availability of an adequate and representativeset of training data is of paramount importance for aclassification algorithm to successfully learn the under-lying data distribution. In the absence of adequate train-ing data, resampling techniques can be used for drawingoverlapping random subsets of the available data, eachof which can be used to train a different classifier, creat-ing the ensemble. Such approaches have also proven tobe very effective.
Divide and Conquer: Regardless of the amount ofavailable data, certain problems are just too difficult fora given classifier to solve. More specifically, the decisionboundary that separates data from different classes maybe too complex, or lie outside the space of functions
that can be implemented by the chosen classifier model.Consider the two dimensional, two-class problem with acomplex decision boundary depicted in Figure 1. A lin-ear classifier, one that is capable of learning linearboundaries, cannot learn this complex non-linearboundary. However, appropriate combination of anensemble of such linear classifiers can learn this (or anyother, for that matter) non-linear boundary.
As an example, let us assume that we have access toa classifier model that can generate elliptic/circularshaped boundaries. Such a classifier cannot learn theboundary shown in Figure 1. Now consider a collectionof circular decision boundaries generated by an ensem-ble of such classifiers as shown in Figure 2, where eachclassifier labels the data as class1 (O) or class 2 (X),based on whether the instances fall within or outside ofits boundary. A decision based on the majority voting ofa sufficient number of such classifiers can easily learnthis complex non-circular boundary. In a sense, the clas-sification system follows a divide-and-conquer approachby dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one ofthe simpler partitions. The underlying complex decisionboundary can then be approximated by an appropriatecombination of different classifiers.
Data Fusion: If we have several sets of data obtainedfrom various sources, where the nature of features aredifferent (heterogeneous features), a single classifiercannot be used to learn the information contained in allof the data. In diagnosing a neurological disorder, forexample, the neurologist may order several tests, suchas an MRI scan, EEG recording, blood tests, etc. Eachtest generates data with a different number and type offeatures, which cannot be used collectively to train asingle classifier. In such cases, data from each testingmodality can be used to train a different classifier,whose outputs can then be combined. Applications inwhich data from different sources are combined to makea more informed decision are referred to as data fusionapplications, and ensemble based approaches have suc-cessfully been used for such applications.
There are many other scenarios in which ensemblebase systems can be very beneficial; however, discussionon these more specialized scenarios require a deeperunderstanding of how, why and when ensemble systemswork. The rest of this paper is therefore organized as fol-lows. In Section 2, we discuss some of the seminal workthat has paved the way for today’s active research area inensemble systems, followed by a discussion on diversity,a keystone and a fundamental strategy shared by allensemble systems. We close Section 2 by pointing outthat all ensemble systems must have two key compo-nents: an algorithm to generate the individual classifiers
23THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE
OO
OO
OOOO
OOOO
OO
OO
OOOO
XXXX
XXXX
XXXX
XX
XXXX
XX
XXXXXX
XX
XX
XXXXXX
XX
XXXX
XXXX
XXXX
XXXXXX
XX
XX
XX
XXXX
XX
XX
XX
XXXXXX
OO
OO OO
OO
OOOO
OOOOOO
OOOO
OOOO
OO
OOOO
OOOOOO
OOOOOOOO
OO
OO
OO
OOOO
OOOO
OO OOOOOOOO
OO OOOO
OOOO OO
OOOOOO
OOOOOO
OOOO
OO
OO
OO
OOOO
OO
OO
OO
OOOOOOOO
OO
OOOO OOOO OO
OOOOOOOO
OOOO
OOOO
OO
OO OO OOOOOO
OO
OO
OO
OO
OO
OO
OO
XX
XX
XX XX
XX
XXXX
XX
OO
OO
OOOO
OO
OO
OOOO
Obs
erva
tion/
Mea
sure
men
t/Fea
ture
2
Training Data Examplesfor Class 1
Observation/Measurement/Feature 1
Training Data Examplesfor Class 2
Complex DecisionBoundary to Be Learned
OO
Figure 1. Complex decision boundary that cannot be learnedby linear or circular classifiers.
OO
OO
OOOO
OOOO
OO
OO
OOOO
XXXX
XXXX
XXXX
XX
XXXX
XX
XXXXXX
XX
XX
XXXXXX
XX
XXXX
XXXX
XXXX
XXXXXX
XX
XX
XX
XXXX
XX
XX
XX
XXXXXX
OO
OO OO
OO
OOOO
OOOOOO
OOOO
OOOO
OO
OOOO
OOOOOO
OOOOOOOO
OO
OO
OO
OOOO
OOOO
OO OOOOOOOO
OO OOOO
OOOO OO
OOOOOOOO
OOOOOO
OOOO
OO
OO
OO
OOOO
OO
OO
OO
OOOOOOOO
OO
OOOO OOOO OO
OOOOOOOO
OOOO
OOOO
OO
OO OOOO
OOOO
OO
OO
OO
OO
OO
OO
OO
XX
XX
XX XX
XX
XXXX
XX
OO
OO
OOOO
OO
OO
OOOO
Observation/Measurement/Feature 1
Obs
erva
tion/
Mea
sure
men
t/Fea
ture
2OOOO
OOOOOOOO OOOOOOOO
OOOOOOOO
OOOOOOO
OOOO
OOOO
XXXX
XXXXXXXXXXXX
XXXXXXXXXXXX
OOOOOOO
OOOOOOOO
OOO OOOO
OOO
OOO
OO
OOOOOOOO
OO
OOOO
OOOOOOOO
OOOOOOO
XXXX
XXXX
XXXX
XXXXOOOO
OOOO OOOOOOOOOOO
OOOOOOOO
OOOOOOOO
XXXXXXXX
XOOOO
OOOO
OOOO
OOOOOOOOO
OOOO
OOOOOOOO
OOOOOOOO
OOOO
XXXXXXXX
XXXXXXXX
XXXX
XXXXOOOO
OOOO
OOOOOOOO
OOOO
OOOOOOOOOOOOOOOO
OOOO
OOOOOOOOOOO
OOOOO
OOOOOOOOOOOOO
OOOOO
OOOOOOOO
OOOOOOOOOOOO
OOOOOOOOOO
OOOOO
OOOOOO XXXX
XXXX
XXXX
OOOOOOOO
XXXXXXXX
XXXXXXXXXXXXXXXXXXXXX
XXXXXXX
XXXX
OOOO
OOO
OOOOOOOO
OOOO
OOOO
OOOOOOOOOOOOOOOO
OOOO
OOOOOOOO OOOOOOOO OOOO
OOOOOOOOOOOOOOO
OOOO
OOOOOOOOOOOOOOO
OOOOOOOOO
OOOOOOOO
OOOOOOOOOOOO
OOOOOOOOOOOO XXXX
XXXX XXOOO
OOOOOOOO
Decision Boundaries Generated by Individual Classifiers
XXXXXXXXXX
XXXXXXXX
XXXX XXXXXXXX
XXXX
XXXXXXX
XXXX
XXX
XXXXXXXX
XXXX
XXXXXXXXXXXXXXXXXXOOO
XXXXOOXXXXXXXX
XXXXXXXX XX XXXX
XXXXXXXX
XXXX
XXXXXXXX XXXXXXX
XXXXXXX
XXXXXXXXX
XXXXXXX
XXXX
XXXX
XXX
Figure 2. Ensemble of classifiers spanning the decisionspace.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 5 / 19
Outline
1 Introduction
2 Diversity measures
3 Design of ensemble systems
4 Building ensemble based systems
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 6 / 19
Diversity
1 Strategy of ensemble systemsCreation of many classifiers and combine their outputs in asuch a way thatcombination improves upon the performance of a single classifier.x
2 RequirementThe individual classifiers must make errors on different inputs.
3 If errors are different then strategic combination of classifiers can reduce total error.
4 SolutionWe need classifiers whose decision boundaries are adequately different from others.Such a set of classifiers is said to be diverse.
5 Classifier diversity can be obtained
Using different training datasets for training different classifiers.Using unstable classifiers.Using different training parameters(such as different topologies for NN).Using different featuresets (such as random subspace method).
6 ReferenceG. Brown, J. Wyatt, R. Harris, and X. Yao, ’’Diversity creation methods : a surveyand categorization”, Information fusion, Vo. 6, pp. 5-20, 2005.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 6 / 19
Classifier diversity using different training sets
filtering of the noise. The overarching principal in ensem-ble systems is therefore to make each classifier as uniqueas possible, particularly with respect to misclassifiedinstances. Specifically, we need classifiers whose decisionboundaries are adequately different from those of others.Such a set of classifiers is said to be diverse.
Classifier diversity can be achieved in several ways.The most popular method is to use different trainingdatasets to train individual classifiers. Such datasets areoften obtained through resampling techniques, such asbootstrapping or bagging, where training data subsetsare drawn randomly, usually with replacement, from theentire training data. This is illustrated in Figure 3, whererandom and overlapping training data subsets are select-ed to train three classifiers, which then form three differ-ent decision boundaries. These boundaries are combinedto obtain a more accurate classification.
To ensure that individual boundaries are adequatelydifferent, despite using substantially similar training
data, unstable classifiers are used as base models, sincethey can generate sufficiently different decision bound-aries even for small perturbations in their trainingparameters. If the training data subsets are drawn with-out replacement, the procedure is also called jackknifeor k-fold data split: the entire dataset is split into kblocks, and each classifier is trained only on k-1 of them.A different subset of k blocks is selected for each classi-fier as shown in Figure 4.
Another approach to achieve diversity is to use dif-ferent training parameters for different classifiers. Forexample, a series of multilayer perceptron (MLP) neuralnetworks can be trained by using different weight initial-izations, number of layers/nodes, error goals, etc. Adjust-ing such parameters allows one to control the instabilityof the individual classifiers, and hence contribute totheir diversity. The ability to control the instability ofneural network and decision tree type classifiers makethem suitable candidates to be used in an ensemble
25THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE
∑
Classifier 1→ Decision Boundary 1 Classifier 2 → Decision Boundary 2 Classifier 3 → Decision Boundary 3
Feature 1
Ensemble Decision Boundary
Feat
ure
2
Feature 1Feature 1
Feat
ure
2
Feature 1 Feature 1
Feat
ure
2
Feat
ure
2
Feat
ure
2
Figure 3. Combining classifiers that are trained on different subsets of the training data.Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 7 / 19
Diversity measures
1 Pairwise measures (assuming that we have T classifiers)
We can calculate T (T−1)2 pair-wise diversity measures.
hj is correct hj is incorrect
hi is correct a b
hi is incorrect c d
For a team of T classifiers, the diversity measures (dij) are averaged over all pairs
Dij =2
T (T − 1)
T−1∑i=1
T∑j=1
dij
2 Pair-wise diversity measures
1 Correlation diversity is measured as the correlation between two classifier outputs.
ρij =ad − bc√
a+ b)(c + d)(a+ c)(b + d
When classifiers are uncorrelated, maximum diversity is obtained and ρ = 0.2 Q-Statistic defined as
Qij = (adbc)/(ad + bc)
Q is positive when the same instances are correctly classified by both classifiers; and isnegative, otherwise.Maximum diversity is, once again, obtained for Q = 0.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 8 / 19
Diversity measures
1 Pairwise measures (assuming that we have T classifiers)qwq
We can calculate T (T−1)2 pair-wise diversity measures, and average them.
hj is correct hj is incorrect
hi is correct a b
hi is incorrect c d
2 Pair-wise diversity measures
1 Disagreement measure is the probability that the two classifiers will disagree,
Dij = b + c
The diversity increases with the disagreement value.2 Double fault measure is the probability that both classifiers are incorrect,
DFij = d .
The diversity increases with the double fault value.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 9 / 19
Diversity measures
1 Non-pairwise measures (assuming that we have T classifiers)
1 Entropy measure makes the assumption that the diversity is highest if half of theclassifiers are correct, and the remaining ones are incorrect.
E =1
N
N∑i=1
1
T −⌈T2
⌉ min{ξi , (T − ξi )}
where ξi is the number of classifiers that that misclassifies instance xi .Entropy varies between 0 and 1, where 1 indicates highest diversity.
2 Kohavi–Wolpert variance
KW =1
NT 2
N∑i=1
ξi (T − ξi )
Kohavi–Wolpert variance follows a similar approach to the disagreement measure.3 Measure of difficulty is
θ =1
T
T∑t=0
(zt − z̄)
where z =[0, 1
T , 2T , . . . , 1
]and z̄ s mean of z .
z is the fraction of classifiers that misclassify xi .How Measure of difficulty shows the diversity?
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 10 / 19
Diversity measures
Comparison of different diversity measures
Machine Learning
8
Diversity Measures (2)
! Non-Pairwise measures (assuming that we have T classifiers)
" Entropy Measure :
# Makes the assumption that the diversity is highest if half of the classifiers are correct and the remaining ones are incorrect.
" Kohavi-Wolpert Variance
" Measure of difficulty
! Comparison of different diversity measures
ReferenceL. I. Kuncheva and C. J. Whitaker, Measures of diversity in classifier ensembles andtheir relationship with the ensemble accuracy, Machine Learning, Vol. 51, pp.181-207, 2003.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 11 / 19
Outline
1 Introduction
2 Diversity measures
3 Design of ensemble systems
4 Building ensemble based systems
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 12 / 19
Design of ensemble systems
Two key components of an ensemble system
1 Creating an ensemble by creating weak learners.
1 Bagging2 Boosting3 Stacked generalization4 Mixture of experts
2 Combination of classifiers outputs (trainable vs. fixed rule).
1 Majority Voting2 Weighted Majority Voting3 Averaging
What is weak learners?
Definition (Weak learner)
A weak learner does not guarantee to do better than random guessing.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 12 / 19
Design of ensemble systems
In ensemble learning, a rule is needed to combine outputs of classifiers.
1 Classifier selection1 Each classifier is trained to become an expert in some local area of feature space.2 Combination of classifiers is based on the given feature vector.3 Classifier that was trained with the data closest to the vicinity of the feature vector is
given the highest credit.4 One or more local classifiers can be nominated to make the decision.
2 Classifier fusion1 Each classifier is trained over the entire feature space.2 Classifier Combination involves merging the individual waek classifier design to obtain
a single Strong classifier.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 13 / 19
Outline
1 Introduction
2 Diversity measures
3 Design of ensemble systems
4 Building ensemble based systems
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 14 / 19
Bagging
Bootstrap Aggregating (Bagging)1 Create T bootstrap samples S [1], S [2], . . . , S [T ].2 Train distinct inducer on each S [t] to produce T classifiers.3 Classify new instance by classifier vote (majority vote).
Application of bootstrap sampling1 Given set S containing N training examples2 Create S [t] by drawing N examples at random with replacement from S3 S [t] of size N: expected to leave out 75%− 100% of examples from S . (show it)
Variations
1 Random forestsCan be created from decision trees, whose certain parameters vary randomly.
Pasting small votes (for large datasets)1 RVotes : Creates the data sets randomly2 IVotes : Creates the data sets based on the importance of instances, easy to hard
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 14 / 19
Boosting
Schapire proved that a weak learner can be turned into a strong learner thatgenerates a classifier that can correctly classify all but an arbitrarily small fractionof the instances.
In boosting, the training data are ordered from easy to hard. Easy samples areclassified first, and hard samples are classified later.
Boosting algorithm1 Create the first classifier same as Bagging2 The second classifier is trained on training data only half of which is correctly
classified by the first one and the other half is misclassified.3 The third one is trained with data that two first disagree.
Variations1 AdaBoost.M12 AdaBoost.R
ReferenceRobert E. Schapire, The strength of weak learnability, Machine Learning, Vol. 5,pp. 197-227 (1990).
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 15 / 19
Stacked Generalization
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 16 / 19
Mixture Models
1 Train multiple learners1 Each uses subsample of S2 May be ANN, decision tree, etc.
2 Gating Network usually is NN
Machine Learning
20
Mixture Models
! Intuitive Idea
" Train multiple learners
# Each uses subsample of D
# May be ANN, decision tree, etc.
" Gating Network usually is NN GatingNetwork
x
g1
g2
ExpertNetwork
y1
ExpertNetwork
y2
ΣΣΣΣ
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 17 / 19
Mixture Models
Cascade learners in order of complexity
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 18 / 19
References
1 Robi Polikar, Ensemble based system in decision making, IEEE Circuits andSystems Magazine, Vol. 6, No. 3, pp. 21 - 45 (2006).
2 T. G. Dietterich, Machine Learning Research: four current directions, AI Magazine.18(4), 97-136 (1997).
3 T. G. Dietterich, Ensemble Methods in Machine Learning, Lecture Notes inComputer Science, Vol. 1857, pp 1-15 (2000).
4 Ron Meir, Gunnar Ratsch, An introduction to Boosting and Leveraging, LectureNotes in Computer Science, Vol. 2600, pp 118-183 (2003).
5 David Opitz, Richard Maclin, Popular Ensemble Methods: An Empirical Study,journal of artificial intelligence research, pp. 169-198 (1999).
6 L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms. New York,NY: Wiley Interscience, 2005.
Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 19 / 19