documentpo

11
Ensemble Classification Techniques By Swapnajit Chakraborti

Upload: william-bernard

Post on 12-Jan-2016

212 views

Category:

Documents


0 download

DESCRIPTION

sxxs

TRANSCRIPT

Ensemble Classification

TechniquesBy

Swapnajit Chakraborti

Basic Idea• K Base classifiers

• Composite classification model

• New tuple is classified based upon result of all K classifiers

Pros and Cons• More accurate, less chance of error

• Can be run in parallel, suitable for big data

• More diversity across classifiers, hence more robust

• Address class imbalance problem, each classifier can choose different method of handling class imbalance, would be tool dependent

• Decision boundaries are much better

Ensemble Techniques• Majority votes

• Bagging

• Boosting (AdaBoost)

• Random forest

• Stacking

Majority votes• Partition the dataset into K groups

• Use them to train the K classifiers

• Use the class predicted by the majority of classifiers as the class of the new tuple

Bagging• Bootstrap Aggregation

• Create subsets for training using random sampling of tuples with replacement

• The size of each subset is same as original set

• The tuples may not be exactly same as sampling done with replacement

• Weight of each training tuple is same

• Each iteration is used to train one classifier

• Majority vote is used for predicting the class of a new tuple

Boosting• Weights are assigned to each training tuple

• Each iteration trains and tests one classifier

• For subsequent iteration, the weights of misclassified tuples are increased so that next classifier can focus more on them and improve its accuracy

AdaBoost (Adaptive Boosting)• Initially each training tuple weight is 1/d, where d is number

of tuples in original dataset

• Sampling with replacement

• At iteration i, train model i,

• Calculate error of model i, using the dataset used for training as test-set

• The weight of a correctly classified tuple is decreased

• The weight of an incorrectly classified tuple is increased

• Repeat next iteration, i+1

Prediction by AdaBoost• Not by majority votes

• Weight is assigned to each classifier’s vote

• Weight of a vote is inversely proportional to its error rate

• For each class, the weights of all classifiers are summed up

• The class with the highest sum is the predicted class

Pros and Cons• Problem of overfitting

• Errors of many classifiers are accumulated

• Bagging is immune to overfitting

Random forest• Comparable in accuracy to AdaBoost

• More robust to errors and outliers

• The trees i.e. individual classifiers should have minimal overlap or correlation

• No overfitting problem