ensemble(methods( -...

Ensemble Methods

Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ensemble-‐machine-‐learning-‐tutorial/

Outline

•  The NeHlix Prize – Success of ensemble methods in the NeHlix Prize

•  Why Ensemble Methods Work •  Algorithms

– Bagging – Random forests – AdaBoost

Defini<on

Ensemble Classifica.on Aggrega<on of predic<ons of mul<ple classifiers with the goal of improving accuracy.

Teaser: How good are ensemble methods?

Let’s look at the Ne,lix Prize Compe55on…

•  Supervised learning task –  Training data is a set of users and ra<ngs (1,2,3,4,5 stars) those users have given to movies.

–  Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars

•  $1 million prize for a 10% improvement over NeHlix’s current movie recommender/classifier (MSE = 0.9514)

Began October 2006

Just three weeks a`er it began, at least 40 teams had bested the NeHlix classifier.

Top teams showed about 5% improvement.

from h8p://www.research.a8.com/~volinsky/neHlix/

However, improvement slowed…

Today, the top team has posted a 8.5% improvement. Ensemble methods are the best performers…

“Thanks to Paul Harrison's collabora<on, a simple mix of our solu<ons improved our result from 6.31 to 6.75”

Rookies

“My approach is to combine the results of many methods (also two-‐way interac<ons between them) using linear regression on the test set. The best method in my ensemble is regularized SVD with biases, post processed with kernel ridge regression”

Arek Paterek

h8p://rainbow.mimuw.edu.pl/~ap/ap_kdd.pdf

“When the predic<ons of mul.ple RBM models and mul.ple SVD models are linearly combined, we achieve an error rate that is well over 6% be8er than the score of NeHlix’s own system.”

U of Toronto

h8p://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf

Gravity

home.mit.bme.hu/~gtakacs/download/gravity.pdf

“Our common team blends the result of team Gravity and team Dinosaur Planet.” Might have guessed from the name…

When Gravity and Dinosaurs Unite

And, yes, the top team which is from AT&T… “Our final solu.on (RMSE=0.8712) consists of blending 107 individual results. “

BellKor / KorBell

The winner was an ensemble of ensembles (including BellKor).

Some Intui<ons on Why Ensemble Methods Work…

Intui<ons

•  U<lity of combining diverse, independent opinions in human decision-‐making – Protec<ve Mechanism (e.g. stock porHolio diversity)

•  Viola.on of Ockham’s Razor –  Iden<fying the best model requires iden<fying the proper "model complexity"

See Domingos, P. Occam’s two razors: the sharp and the blunt. KDD. 1998.

Majority vote Suppose we have 5 completely independent classifiers… –  If accuracy is 70% for each

•  10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) •  83.7% majority vote accuracy

– 101 such classifiers •  99.9% majority vote accuracy

Intui<ons

Strategies

Bagging – Use different samples of observa<ons and/or predictors (features) of the examples to generate diverse classifiers

– Aggregate classifiers: average in regression, majority vote in classifica<on

Boos<ng – Make examples currently misclassified more important (or less, in some cases)

Bagging (Construc<ng for Diversity)

1. Use random samples of the examples to construct the classifiers

2. Use random feature sets to construct the classifiers •  Random Decision Forests

•  Bagging: Bootstrap Aggrega<on

Leo Breiman

•  Bootstrap: consider the following situa<on: – A random sample from unknown probability distribu<on F

– We wish to es<mate parameter – We build es<mate – What is the s.d. of ?

x = (x1, . . . , xN )

θ = t(F )θ = s(x)

θ

Examples: 1) estimate mean and sd of expected prediction error 2) estimate point-wise confidence bands in smoothing

•  Bootstrap: –  It is completely automa<c – Requires no theore<cal calcula<ons – Not based on asympto<c results – Available regardless of how complicated the es<mator θ

•  Bootstrap algorithm: 1.  Select B independent bootstrap samples

each consis<ng of N data values drawing with replacement from

2.  Evaluate the bootstrap replica<on corresponding to each bootstrap sample

3.  Es<mate the standard error of using the sample standard error of the B es<mates

x∗1, . . . ,x∗B

x

θ∗(b) = s(x∗b), b = 1, . . . , B

θ

•  Bagging: use bootstrap to improve predic5ons 1.  Create bootstrap samples, es<mate model from

each bootstrap sample 2.  Aggregate predic<ons (average if regression,

majority vote if classifica<on)

•  This works best when perturbing the training set can cause significant changes in the es<mated model

•  For instance, for least-‐squares, can show variance is decreased while bias is unchanged

Random forests

•  At every level, choose a random subset of the variables (predictors, not examples) and choose the best split among those a8ributes

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Random forests

•  Let the number of training points be M, and the number of variables in the classifier be N.

For each tree, 1.  Choose a training set by choosing N <mes with

replacement from all N available training cases. 2.  For each node, randomly choose m variables on

which to base the decision at that node.

Calculate the best split based on these.

Random forests

•  Grow each tree as deep as possible no pruning!

•  Out-‐of-‐bag data can be used to es<mate cross-‐valida<on error – For each training point, get predic<on from averaging trees where point is not included in bootstrap sample

•  Variable importance measures are easy to calculate

Boos<ng 1.  Create a sequence of classifiers, giving higher influence to

more accurate classifiers 2.  At each itera<on, make examples currently misclassified

more important (get larger weight in the construc<on of the next classifier).

3.  Then combine classifiers by weighted vote (weight given by classifier accuracy)

AdaBoost Algorithm 1. Ini<alize Weights: each case gets the same weight:

2. Construct a classifier using current weights. Compute its error:

3. Get classifier influence, and update example weights

4. Goto step 2…

wi = 1/N, i = 1, . . . , N

εm =�

i wi × I{yi �= gm(xi)}�i wi

αm = log�

1− εm

εm

�wi ← wi × exp {αmI{yi �= gm(xi)}}

Final predic<on is weighted vote, with weight αm

AdaBoost Algorithm

Dra

ft:D

oN

otD

istrib

ute

152 a course in machine learning

Algorithm 31 AdaBoost(W , D, K)1: d(0) ← � 1

N , 1N , . . . , 1

N � // Initialize uniform importance to each example2: for k = 1 . . . K do3: f (k) ← W(D, d(k-1)) // Train kth classifier on weighted data4: yn ← f (k)(xn), ∀n // Make predictions on training data5: �(k) ← ∑n d(k-1)

n [yn �= yn] // Compute weighted training error

6: α(k) ← 12 log

�1−�(k)

�(k)

�// Compute “adaptive” parameter

7: d(k)n ← 1

Z d(k-1)n exp[−α(k)ynyn], ∀n // Re-weight examples and normalize

8: end for9: return f (x) = sgn

�∑k α(k) f (k)(x)

�// Return (weighted) voted classifier

questions that you got right, you pay less attention to. Those that yougot wrong, you study more. Then you take the exam again and repeatthis process. You continually down-weight the importance of questionsyou routinely answer correctly and up-weight the importance of ques-tions you routinely answer incorrectly. After going over the exammultiple times, you hope to have mastered everything.

The precise AdaBoost training algorithm is shown in Algorithm 11.2.The basic functioning of the algorithm is to maintain a weight dis-tribution d, over data points. A weak learner, f (k) is trained on thisweighted data. (Note that we implicitly assume that our weak learnercan accept weighted training data, a relatively mild assumption thatis nearly always true.) The (weighted) error rate of f (k) is used to de-termine the adaptive parameter α, which controls how “important” f (k)

is. As long as the weak learner does, indeed, achieve < 50% error,then α will be greater than zero. As the error drops to zero, α growswithout bound. What happens if the weak learn-

ing assumption is violated and � isequal to 50%? What if it is worsethan 50%? What does this mean, inpractice?

After the adaptive parameter is computed, the weight distibutionis updated for the next iteration. As desired, examples that are cor-rectly classified (for which ynyn = +1) have their weight decreasedmultiplicatively. Examples that are incorrectly classified (ynyn = −1)have their weight increased multiplicatively. The Z term is a nom-ralization constant to ensure that the sum of d is one (i.e., d can beinterpreted as a distribution). The final classifier returned by Ad-aBoost is a weighted vote of the individual classifiers, with weightsgiven by the adaptive parameters.

To better understand why α is defined as it is, suppose that ourweak learner simply returns a constant function that returns the(weighted) majority class. So if the total weight of positive exam-ples exceeds that of negative examples, f (x) = +1 for all x; otherwisef (x) = −1 for all x. To make the problem moderately interesting,suppose that in the original training set, there are 80 positive ex-amples and 20 negative examples. In this case, f (1)(x) = +1. It’sweighted error rate will be �(1) = 0.2 because it gets every negative

Classifica.ons (colors) and Weights (size) aYer 1 itera+on Of AdaBoost

3 itera+ons 20 itera+ons

from Elder, John. From Trees to Forests and Rule Sets -‐ A Unified Overview of Ensemble Methods. 2007.

AdaBoost

•  Advantages – Very li8le code – Reduces variance

•  Disadvantages – Sensi<ve to noise and outliers. Why?

How was the NeHlix prize won?

Gradient boosted decision trees… Details: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

Sources •  David Mease. Sta<s<cal Aspects of Data Mining. Lecture.

h8p://video.google.com/videoplay?docid=-‐4669216290304603251&q=stats+202+engEDU&total=13&start=0&num=10&so=0&type=search&plindex=8

•  Die8erich, T. G. Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edi<on, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. h8p://www.cs.orst.edu/~tgd/publica<ons/hbtnn-‐ensemble-‐learning.ps.gz

•  Elder, John and Seni Giovanni. From Trees to Forests and Rule Sets -‐ A Unified Overview of Ensemble Methods. KDD 2007 h8p://Tutorial. videolectures.net/kdd07_elder_`fr/

•  NeHlix Prize. h8p://www.neHlixprize.com/ •  Christopher M. Bishop. Neural Networks for Pa8ern Recogni<on. Oxford University Press. 1995.

ensemble(methods( -...

Documents