analysing ensemble methods based on tree-structured base … 2017/m... · 2018. 1. 4. ·...

Master’s ThesisAPPLIED PROBABILITY AND STATISTICS

Analysing Ensemble Methods Based onTree-Structured Base Learners

with a 0/1 Bias Variance Decomposition

MADS LINDSKOUSEPTEMBER 1TH, 2017

Aalborg University • Department of Mathematical Sciences • Skjernvej 4A • 9220 Aalborg

Department of Mathematical Sciences

Mathematics and Statistics

Skjernvej 4A

Title:Analysing Ensemble Methods Basedon Tree-Structured Base Learnerswith a 0/1 Bias Variance Decomposi-tion

Type:Master Thesis

Author:Mads Lindskou

Supervisor:Rasmus Plenge Waagepetersen

Circulation:2

Pages:64

Submitted:September 1th, 2017

Abstract:

In this thesis I provide an R interface for abias variance decomposition of the 0/1 lossfunction. This is used to analyse the bias andvariance of decision tree learners. Further-more, I use the ensemble methods bagging, ran-dom forests and boosting with decision treesas base learners to analyse the extent of im-provement by using such methods over singletrees. Finally, I provide an original data drivenexample of a binary classification problem inforensic science. I show that the two state-of-the art methods, random forests and Ad-aBoost, are extremely powerful in terms of seg-regating Greenlanders from non-Greenlandersbased on 40 important gene locations. Further-more, the single tree-structured CART proce-dure was able to produce some useful insightsinto how such a segregation is conducted fromthe random forest which is some kind of a blackbox.

The contents of this report is confidential.

i

Contents

1 Introduction 11.1 Statistical Modelling - The Two Cultures . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Purpose of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Reading Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Assessing Model Accuracy 42.1 A Unified Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Implementing the Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . 8

3 Decision Trees 113.1 C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Continuous Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Golf Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Bias Variance Study of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Bagging 234.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Random Forests 255.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Strength and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Out of Bag Samples and Variable Importance . . . . . . . . . . . . . . . . . . . . . . 305.5 Bias Variance Study of Forest-RI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Boosting 336.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Bounding the Training Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 Choosing the Base Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.5 The Margin Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.6 Bias Variance Study of AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ii

Contents

7 Forensic Genetics - Data Driven Example 447.1 Analysis of GRLDNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.3 CART Decision Trees for GRLDNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Bibliography 50

A Hoeffding’s Inequality 52

B Vignette of ClassifyR 54

C Code Chunks from ClassifyR 58C.1 bivar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58C.2 catch ellipsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61C.3 bivar methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61C.4 bivars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62C.5 predict <function> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

iii

1 Introduction

The terms statistical learning, data mining and machine learning are often used interchangeably; sowhat is the difference? The short answer is none, since they are concerned with the same question;how do we learn from data? The primary reason why these three subjects are effectively the sameis that they cover almost exactly the same material and use almost the exact same techniques.

As an example, if we examine the table of contents in The Elements of Statistical Learning byTrevor Hastie and Friedman [26], Machine Learning: A Probabilistic Perspective, by Murphy [18]and Top 10 algorithms in data mining by Wu and Kumar [30] we essentially see the same topics.Throughout this thesis I will use the term "statistical learning".

1.1 Statistical Modelling - The Two Cultures

According to Efron and Hastie, [12], statistics have developed immensely over the past 60 yearsdue to computational improvements facilitated by computers. Furthermore, they sort some of themajor statistical topics in increasing order by year as

• Classic statistical inference: Bayesian, frequentist and Fisherian inference.

• Early computer-age developments (from 1950s to 1990s): Ridge regression, generalized linearmodels, regression trees, survival analysis etc.

• Twenty-first century topics: Random forests, boosting, neural networks, support-vectormachines etc.

In 2001, Breiman [8] challenged the classical statistical inference and to some extend the earlycomputer-age developments. He thinks of data as being generated by a black box in which avector of input variables x go in one side, and on the other side the output variables y come out.Inside the black box, nature functions to associate the input variables with the output variablesas depicted in Figure 1.1 (b). The two goals of a data analysis is prediction and information whereBreiman distinguishes between the data modelling culture and the algorithmic modelling culture. Inthe data modelling culture (classic statistical inference and the early computer-age developments)the black box is filled with models like linear regression and the Cox model where it is assumedthat nature behaves in a specific way, see Figure 1.1 (a). These "traditional" models are validatedby goodness-of-fit and residual examination. On the other hand, in the algorithmic modellingculture, the inside of the black box is considered to be complex and unknown as depicted in 1.1 (c).In this approach, the model validity is measured by predictive accuracy. Breiman’s key point is,that it is nowhere written on a stone tablet what kind of model should be used to solve problemsinvolving data and argues that within the field of statistics there is a predominant interest in thedata modelling culture. What Breiman calls the algorithmic modelling culture is exactly what isalso known as statistical learning.

1

1.2. Statistical Learning

linear regressionCox model

(a)

nature

(b)

unknown

(c)

xy

random forests etc.

xyxy

Figure 1.1: (a): The data modelling culture. (b): The true relation between x and y. (c): Thealgorithmic modelling culture.

1.2 Statistical Learning

Assume a learning set of observations D = (xi, yi)Ni=1 is given where the p−dimensional inputs,xi = (xi1, xi2, . . . , xip), are drawn from the input space X = S1 × S2 × · · · Sp and the outputs, yi,are drawn from the output space Y .1 In statistical sense xi is an observed value from the randomvector (X1, X2, . . . , Xp) and yi is an observed value from the random vector Y .

Statistical learning is usually divided into supervised or unsupervised learning. In the supervisedapproach, the goal is to learn a model yD : X → Y , where D is the learning set used to build themodel yD. In general If Y = y1, y2, . . . , yJ is a finite set of classes or labels, yD is called a classifierand when Y ∈ R, yD is called a regressor. The problem of learning a classifier is referred to asclassification and the problem of learning a regressor is referred to as regression.

In unsupervised learning we are only given inputs xiNi=1 and the goal is to find “interestingpatterns”. Compared to supervised learning, there is no obvious error metric, hence it is a muchharder task to evaluate a models performance. A common problem in unsupervised learning isclustering where the problem is to collect observations with similar inputs.

In statistical learning one speaks of a “learner” as the algorithm producing a model. One ofthe simplest (supervised) classification algorithms in statistical learning is the k-nearest neighbor(KNN) which is used as a running example. The idea is to memorize the learning set and thenpredict the class of any new observation on the basis of the classes of its closest neighbors in thelearning set. Denote by Nk(x) the neighborhood of x defined by the k closest points in D. Thenthe k− nearest neighbor classifier is defined as

yD(x) = arg maxc∈Y

∑xi∈Nk(x)

1 [yi = c] (1.1)

with 1 [·] denoting the indicator function. This amounts to the majority vote in the neighborhoodof x. If a tie occurs, one possibility is to assign a class at random between these ties.

The k−nearest neighbor is also applicable for regression; here the prediction is defined to bethe average value of the k nearest neighbors as

yD(x) =1

k

∑xi∈Nk(x)

yi. (1.2)

It appears that k−nearest neighbor fits have a single parameter, the number of neighbors k.Choosing k = 1 will most likely lead to overfitting such that yD is too flexible and capturesunwanted noise. On the other hand, choosing k too large will result in underfitting such that yDfails to capture to relationship between the inputs and the output. Hence, the task is to find thevalue of k such that yD is a good approximation of the underlying true model and prevent bothunder,- and overfitting.

1Inputs and outputs are sometimes referred to as features and targets respectively.

2

1. Introduction

1.3 Purpose of the Thesis

One of the major challenges in this thesis was to decide upon a unified and logic notationthroughout the entire report. This turned out to be difficult for several reasons; for one, I have triedto tie together an introduction to several big topics using 30 different literatures. Furthermore,I have focused on classification for which there is no easy way to assess model accuracy interms of bias and variance. Thus, I have spend many days and hours programming the R

package ClassifyR which is an implementation of a unified framework of the bias variancedecomposition that generalizes to the classification problem.

In this thesis, I have tried to facilitate the theoretical foundation of state of the art modelsin statistical learning. Furthermore, I provide an original data driven example of a binaryclassification problem in forensic science.

1.4 Reading Guide

Throughout this thesis, I use the numeric Bibtex style, and a complete bibliography can be foundafter the appendices. Figures, tables, equations etc. are sequentially numbered based on thechapters. Hence, the second figure in Chapter 2 is referred to as Figure 2.2 for example. Everydefinition is ended with a filled diamond () whereas an ended proof is indicated with a filledsquare ().

Acknowledgement

I would like to thank my supervisor Professor Rasmus Plenge Waagepetersen, Aalborg University,for always taking the time to listen and discuss interesting details. His scientific and professionalapproach to mathematics has influenced my work and thinking dramatically.

3

2 Assessing Model Accuracy

Recall the definition of yD, used to predict output values from the inputs where D is used toconstruct the model. Often times it is convenient to regard the learning set as random whenpredicting an outcome and we let YD(x) denote the random prediction given X = x when D

is regarded as a random learning set drawn independently from the universe Ω = X × Y .Furthermore, YD(X) will denote the random prediction where D and X are both random.

Suppose we wish to predict a (real valued) future observation Y from a random vector X ∈ Xbased on yD and the information X = x where

Y = f(x) + ε,

for some function f and where ε is an error term independent of all previous error terms withmean zero and variance σ2. The goal is to assess the future error we commit by predicting Y basedon the value yD(x). In order to accomplish this we need a loss function L(y, yD(x)) to measureerrors committed when predicting Y from yD(x), where the most common loss function is thesquared loss function L(y, yD(x)) = (y − yD(x))2.

2.1 Definition: The expected prediction error of yD at input x is defined as

Err(yD(x)) = Ex[L(Y, yD(x))] := E[L(Y, yD(x)) | X = x]

That is, Err(yD(x)) is the average error we commit when we predict Y from yD(x) based on D

and the input x. When we estimate the expected prediction error, we average over all x.Define by E(yD, D

′) the average prediction error of the model yD over some set D′ (possiblydifferent from the learning set D used to produce the model), defined as

E(yD, D′) =

1

N ′

∑(x,y)∈D′

L(y, yD(x)),

where N ′ is the size of D′. The first and simplest estimate of the expected prediction error over allobservations is the training error

Etrain(yD) = E(yD, D).

In general this is a poor estimate, since it does not account for new unseen observations; i.e. thetraining error is based on all observations used for building the model.

The second estimate is known as the test error where the learning set D is divided into twodisjoint set Dtrain and Dtest called the training set and test set. The test error is then defined as

Etest(yD) = E(yDtrain, Dtest).

The idea is to account for new unseen observations in Dtest as opposed to the training error. Asa rule of thumb, Dtrain is usually taken as 70% of the samples in D and Dtest as the remaining30%, [17]. While being an unbiased estimate of the expected prediction error, the test error has thedrawback that it reduces the effective sample size on which the model yDtrain is learned.

4

2. Assessing Model Accuracy

The optimal predictor is the model minimizing the expected prediction error. If L is the squaredloss function, it is well known that the optimal predictor of the expected prediction error isgiven by y∗(x) = E [Y | X = x]. The k−nearest neighbor in (1.2) attempt to directly implementthe optimal predictor using the training data with expectation approximated by an average andconditioning is relaxed to conditioning on some region "close" to the input.

Since the model yD depends on the given data, D, a more natural error function than theexpected prediction error is the expected generalisation error.

2.2 Definition: The expected generalization error at input x is defined as

Err(YD(x)) = Ex[L(Y, YD(x))] = Ex[Ex[L(Y, YD(x)) | D]].

Hence, the expected generalisation error takes into account the randomness in the learning setD. A commonly used estimate of the expected generalisation error over all observations is theK-fold cross validation (CV) error where the learning set D is randomly divided into K disjoint setsD1, D2, . . . , DK . The K−fold cross validation error is then defined as

ECV(yD) =1

K

K∑k=1

E(yD\Dk, Dk).

The idea is, that since each model yD\Dkis build using almost all D, they should be close to the

model yD learned on the entire learning set. As a result the estimates E(yD\Dk, Dk) should also all

be close to expected generalisation error. The number of folds, K, is usually fixed to 10 and whenK = N the estimate is also known as leave-one-out cross validation error (LOOCV); denote this byELOOCV(yD). A second estimate of the expected generalisation error is the bootstrap error where Bdifferent training sets, D(1)

train, D(2)train, . . . , D

(B)train, from D is created by the bootstrap method [26]: a

bootstrap replicate is created by taking N samples with replacement from D, with each observationhaving probability 1/N of being selected at each turn. The B′th bootstrap replicates then serve asa sample from the random learning set D. The estimate is then defined as

Eboot(yD) =1

B

B∑b=1

E(yD\Db, Db).

Consider the squared loss function. Then the expected generalisation error can be decomposed asfollows

Ex[L(Y, YD(x))] = Ex[Y 2] + Ex[YD(x)2]− 2Ex[Y YD(x)]

= Varx[Y ] + Ex[Y ]2 + Varx[YD(x)

]+ Ex[YD(x)]2 − 2Ex[Y YD(x)]

= σ2 + Varx[YD(x)

]+ f(x)2 + Ex[YD(x)]2 − 2f(x)Ex[YD(x)]

= σ2 + Varx[YD(x)

]+(

Ex[YD(x)]− f(x))2

= σ2 + Varx[YD(x)

]+ Bias(YD(x))2, (2.1)

whereEx[Y YD(x)] = f(x)Ex[YD(x)],

since ε was assumed to be independent of all previous error terms. This decomposition isreferred to as the bias-variance decomposition. The variance σ2 is irreducible, but the bias andvariance of YD(x) will depend on how complex the model is. In other words, there is a trade offbetween bias and variance dependent on the model complexity. For the k−nearest neighbor thenumber of neighbors, k, can be seen as the complexity. Notice, in order to minimize the expected

5

2.1. A Unified Bias-Variance Decomposition

generalization error it suffices to minimize the expected prediction error and then average theresult over all learning sets.

Assume now that Y is a categorial random variable and let L(y, yD(x)) = 0 if y = yD(x) andL(y, yD(x)) = 1 otherwise; this is the 0/1 loss function. Then

Err(yD(x)) =∑y∈Y

L(y, yD(x))P (Y = y | X = x).

Since yD(x) can only attain a single value in Y , L(y, yD(x)) = 1 for all y′s except when y = yD(x).Thus

Err(yD(x)) = 1− P (Y = yD(x) | X = x), (2.2)

The optimal model is the one minimizing (2.2), hence

y∗(x) = arg maxy∈Y

P (Y = y | X = x).

This solution is also known as the Bayes classifier and the expected prediction error of the Bayesclassifier is known as Bayes rate. Again we see that the k−nearest neighbor in (1.1) attempts toimplement the Bayes classifier. However, the bias-variance decomposition cannot be automaticallyextended to the classification problem where the 0/1 loss function is usually applied. In analogywith the bias-variance decomposition for the squared error loss, similar decompositions have beenproposed in the literature for the expected generalization error based on the 0/1 loss function.By redefining the concepts of bias and variance of a model, [10] provided a unified bias-variancedecomposition for both the squared and 0/1 loss functions.

2.1 A Unified Bias-Variance Decomposition

In this section, the dependency of x in the model yD(x) is suppress when it is clear that x is theinput used to predict y from yD(x). In order to generalize the bias-variance decomposition thefollowing definitions are needed.

2.3 Definition: Let y∗ be the optimal predictor for the expected prediction error. Then a model yD atx has

i) main prediction defined asymain = arg min

y′Ex[L(YD, y

′)],

ii) bias defined asB(x) = L(y∗, ymain),

iii) variance defined asV (x) = Ex[L(YD, ymain)],

iv) and noise defined asN(x) = Ex[L(Y, y∗)].

In words, the main prediction of a model at x is the value whose average loss relative to all otherpredictions of Y , based on x, is minimum. For the squared loss function we have

ymain = arg miny′

Ex[Y 2D] + (y′)2 − 2y′Ex[YD],

which is minimized by setting ymain = Ex[YD]. That is, ymain is the mean of the predictions. Forthe 0/1 loss function

ymain = arg maxy′

Px(YD = y′),

6


which is minimized when ymain is the most frequent prediction of Y i.e. the mode. Hopefullymost of the y′Ds are identical since we want the prediction of y, based on x, to be the sameregardless of the given data set. The bias is the loss incurred by the main prediction relative tothe optimal prediction and the variance is the average loss incurred by predictions relative to themain prediction. Finally, the noise is the unavoidable component of loss where Y = y is the truelabel of x. Bias typically occurs when a model is underfitted and variance occurs when a model isoverfitted. The following result will justify the definitions in Definition 2.3.

2.1 Proposition: For the squared loss function the expected generalization error for y at x decomposesas

Ex[L(Y, YD(x))] = c1N(x) +B(x) + c2V (x), (2.3)

with c1 = c2 = 1.

Proof: When L is the squared loss function the optimal prediction is given by y∗ = Ex[Y ] andymain = Ex[YD]. By inserting these into equation (2.3) it follows that

Err(YD(x)) = Ex[(Y − Ex[Y ])2] + (Ex[Y ]− Ex[YD])2 + Ex[(YD − Ex[YD])2],

corresponding to the decomposition in (2.1).

The decomposition in (2.3) also holds for the 0/1 loss function in the binary classification problemonly with different values of c1 and c2. From now on, if not otherwise stated, the loss function isimplicitly given by the 0/1 loss function and Y is a discrete set of values/labels.

2.2 Proposition: In the binary classification problem, i.e. |Y| = 2, (2.3) hold with c1 = 2Px(YD = y∗)− 1

and c2 = 1 if B(x) = 0 (ymain = y∗) and c2 = −1 otherwise.

In order to prove Proposition 2.2, the following Lemma is needed.

2.3 Lemma: In the binary classification problem

i) Ex[L(Y, YD) | D is fixed] = L(yD, y∗)+c0N(x) with c0 = 1 if yD = y∗ and c0 = −1 otherwise.

ii) Ex[L(YD, y∗)] = B(x) + c2V (x) with c2 = 1 if ymain = y∗ and c2 = −1 otherwise.

Proof: vspacei) The result is trivial for yD = y∗. Assume now that yD 6= y∗. Then y = yD implies that y 6= y∗

and vice versa. Hence

Ex[L(Y, YD) | D is fixed] = Px(Y 6= yD)

= 1− P (Y 6= y∗ | X = x)

= L(yD, y∗) + c0Ex[L(Y, y∗)]

= L(yD, y∗) + c0N(x),

with c0 = −1.

ii) The result is trivial for ymain = y∗. Assume that ymain 6= y∗, then yD = y∗ implies thatyD 6= ymain and vice versa. Hence

Ex[L(YD, y∗)] = 1− Px(YD = y∗)

= L(ymain, y∗) + c2Ex[L(YD, ymain)]

= B(x) + c2V (x),

with c2 = −1.

7

2.2. Implementing the Bias-Variance Decomposition

Proof (of Proposition 2.2): Using Lemma 2.3 it follows that

Ex[L(Y, YD)] = Ex[Ex[L(Y, YD) | D]]

= Ex[L(YD, y∗) + c0N(x)]

= Ex[c0]N(x) +B(x) + c2V (x),

whereEx[c0] = Px(YD = y∗)− Px(YD 6= y∗) = 2Px(YD = y∗)− 1 = c1,

which ends the proof.

Interestingly, the variance is additive when B(x) = 0 but subtractive when B(x) = 1 for 0/1 lossin binary classification. Thus, if a model is biased at x, increasing variance decreases loss. Thisbehaviour is substantially different from the squared loss function. However, it helps explaininghow highly unstable models like decision trees, discussed in Chapter 3, can produce excellentresults in practise, even given very limited quantities of data. We can define the net variance

Vn(x) = Vu(x)− κ(x)Vb(x),

where Vu(x) = (1 − B(x))V (x) is the unbiased variance, Vb(x) = B(x)V (x) the biased variance andκ(x) = 1 in the binary case. Notice that Vn(x) = c2V (x). The net variance takes into account thecombined effect of the unbiased and biased variances.

The decomposition in (2.3) and the net variance decomposition is also valid in multiclassproblems, for specific coefficients c1, c2 and κ(x).

2.4 Proposition: In the multiclass problem, i.e. |Y| > 2, (2.3) holds with

c1 = Px(YD = y∗)− Px(YD 6= y∗)Px(Y = yD | Y 6= y∗)

and c2 = 1 if B(x) = 0 (ymain = y∗) and c2 = −Px(YD = y∗ | YD 6= ymain) otherwise.

The proof of Proposition 2.4 is similar to that of Proposition 2.2. For details, see [10]. Therelationship Vn(x) = c2V (x) = Vu(x)−κ(x)Vb(x) is also valid in the multiclass problem, only withκ(x) = Px(YD = y∗ | YD 6= ymain). From Proposition 2.4 it is seen that expected generalizationerror is only reduced by a proportion, κ(x), of the variance when B(x) = 1 in the multiclassproblem.

The expected generalisation error is defined for a single observation in the population. Thegeneralization to the entire population is given by the expected loss

E[EX [L(Y, YD(X))]

]= c1E[N(X)] + E[B(X)] + E[Vn(X)] .

where E[N(X)] is the average noise, E[B(X)] the average bias, E[Vu(X)] the average unbiased variance,E[Vb(X)] the average biased variance and E[Vn(X)] the average net variance. The expected loss isestimated by either ECV(yD) or Eboot(yD) as discussed in the beginning of the chapter.

In summary, variance hurts when B(x) = 0, but it helps when B(x) = 1. Nonetheless, toobtain low overall expected loss, we want the bias to be small, hence we seek to reduce both thebias and unbiased variance. Ultimately, the combined effect of variances, the net variance, is themost interesting source of variance.

2.2 Implementing the Bias-Variance Decomposition

As per my knowledge, no R [20] interface for the bias-variance decomposition described inSection 2.1 exists. I have written the R package ClassifyR [16] which is freely available onGitHub via https://github.com/Lindskou/ClassifyR that implements the bias-variance

8

https://github.com/Lindskou/ClassifyR


decomposition for the 0/1 loss function. In the current version, 1.0.1, ClassifyR depends on [29,2, 11] and suggests [15, 25, 27] dependening on the choice of model.

The main function in ClassifyR is bivar (for “bias-variance”), which produces the bias-variance decomposition described in Section 2.1 under the assumption that N(x) = 0; that isy∗ = y for all realizations (x, y) of (X,Y ). It is easy to install the package using the librarydevtools [28] with the following commands

# If devtools is not installed

install.packages("devtools")

# Dependencies

install.packages(c("Rcpp", "tidyr", "dplyr", "doParallel", "ggplot2"))

devtools::install_github("Lindskou/ClassifyR")

# For more information on bivar

?bivar

Dependencies must be installed manually, since install_github is not acting like install.librarywhich installs packages from CRAN1 including all dependencies automatically. The code forClassifyR can also be seen in Appendix C. In the current version, the models (called methodsin bivar)

method package type

1 C5.0 C50 Tree

2 rpart rpart Tree

3 knn class Non-parametric method

4 J48 RWeka Tree (C4.5)

5 bagging adabag Ensemble

6 randomForest randomForest Ensemble

7 AdaBoost internal Ensemble

are accessible. However, bivar makes heavy use of R’s closures, see Section C.5 in Appendix C.Closures are functions written by other functions. In bivar, closures allows for a streamlinedsyntax for all methods, reducing the number of lines needed to program bivar and making iteasier to extend bivar to include other methods in the future. One neat feature of ClassifyRis the function tuning_params which prints important tuning parameters for a specific method;for example

> tuning_params("rpart")

parameter type usage

1 minsplit numeric bivar(..., control = rpart.control(minsplit = 20))

2 cp numeric bivar(..., control = rpart.control(cp = 0.01))

3 maxdepth numeric bivar(..., control = rpart.control(maxdepth = 30))

See Appendix B for a vignette of ClassifyR.Assume we make the call bivar(form, method, data) where form is a formula of the

form y ~ x1 + x2 indicating the ouput on the left and the inputs on the right of the tilde. Thenthe function works by first dividing (randomly) D into a training set Dtrain and a test set Dtest

and then create 100 different training sets, D(1)train, D

(2)train, . . . , D

(100)train, from Dtrain by the bootstrap

method. A model is then learned on each bootstrap replicate with the model specified in method,using parallel computing to reduce the run time considerably. For each bootstrap replicate aprediction for each element in Dtest is carried out. Thus we obtain the predictions y j

D(i)train

, for

i = 1, 2, . . . , 100 and j = 1, 2, . . . ,M with M = |Dtest|. For multiclass problems κ(x) is estimatedfor each xj in Dtest by the empirical probabilities

κj =1

100

100∑i=1

1

[y jD

(i)train

= yj | y jD

(i)train

6= y jmain

],

1CRAN is short for The Comprehensive R Archive Network

9

2.2. Implementing the Bias-Variance Decomposition

where yj is the true class value of observation xj and where y jmain is the mode of all predictionsof yj . Finally the average bias, average unbiased variance, average biased variance, average netvariance and the expected error are estimated as

b =1

M

M∑j=1

1[yj 6= y jmain]

v =1

100M

100∑i=1

M∑j=1

1

[y jD

(i)train

6= y jmain

]

vu =1

100M

100∑i=1

M∑j=1

1

[y jD

(i)train

6= y jmain and yj = y jmain

]

vb =1

100M

100∑i=1

M∑j=1

1

[y jD

(i)train

6= y jmain and yj 6= y jmain

]vn = vu − vb,κ

error = b+ vn

respectively where

vb,κ =1

100M

100∑i=1

M∑j=1

κj1

[y jD

(i)train

6= y jmain and yj 6= y jmain

].

Consider the data set iris with 150 observations (rows) and 5 variables (columns) namedSepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The speciesare setosa, versicolor, and virginica and the problem is to predict the specie of a new unseenobservation with the aforementioned variables. Using k−nearest neighbor (method = "knn"),we must determine the optimal value k, by reducing the error. From Figure 2.1 we see that theerror decreases as a function of k for k = 1, 2, . . . , 15. Choosing k = 1 will most likely result inunderfitting and cause high loss as seen in Figure 2.1. The smallest error is obtained at k = 14.

Figure 2.1 is also a part of the example code in bivar; type ?bivar in R to see the code.

0

2

4

6

4 8 12

k

Err

or (

%)

b

error

v

v_b

v_n

v_u

Figure 2.1: Example of k−nearest neighbor applied to bivar for k = 1, 2, . . . , 15.

In the rest of the thesis, I will only plot the error, bias and netvariance when using bivar sincethese are ultimately the interesting quantities.

10

3 Decision Trees

I will, to some extend, employ the notation used in [17] which is based on the earlier work ofBreiman, Friedman, Stone, and Olshen in the popular book Classification and regression trees [7].

Assume first that the output space Y = y1, y2, . . . , yJ is a finite set of values and consider theclassifier y : X → Y . Then another way of looking at supervised learning problems is to considerthe input space as a union of disjoint sets

X = X (1) ∪ X (2) ∪ · · · ∪ X (J),

whereX (i) = x ∈ X | y(x) = yi. Learning a classifier can then be regarded as learning a partitionof X matching as closely as possible to some optimal partition. In regression problems the inputspace can be partitioned such that X (i) = x ∈ X | y(x) ∈ Ii where Ii is some interval andIi ∩ Ij = ∅ for all i 6= j. From a geometrical point of view, the principle of a tree-structured modelis to recursively partition the input space X into subspaces such that the model predictions basedon the terminal subspaces are optimal. To be more specific, following definitions are introduced:

3.1 Definition: A tree is an undirected graph T = (V,E) in which any two nodes (or vertices)t1, t2 ∈ V are connected by exactly one path consisting of edges from E.

3.2 Definition: A rooted tree is a tree in which one node has been designated as the root, t0, and everyedge is directed away from the root.

3.3 Definition: Consider a rooted tree. If there exist and edge from t1 to t2 the node t1 is said to bethe parent of node t2 while t2 is said to be the child of node t1.

3.4 Definition: In a rooted tree, a node is said to be internal if it has one or more children and terminalif it has no children. Terminal nodes are also known as leaves. A rooted tree is called binary if allinternal nodes have exactly two children.

3.5 Definition: The height of a tree is the number of edges on the longest path from the root to a leafand the depth of a node is the number of edges from the root to the node.

From now on, all trees discussed are rooted trees. With the above terms, a decision tree can bedefined as a model y : X → Y represented by a tree, where any node t represents a subspaceXt ⊆ X of the input space, with root t0 corresponding to X itself. In order to grow a tree, thesubspaces of Xt are constructed using a split st dividing Xt into disjoint subspaces correspondingto each of its children. Each internal node is labelled with a split, which induces q questions ofthe form “x ∈ Xti” for i = 1, 2, . . . , q if t1, t2, . . . , tq are children of node t. Denote by yt(x) theprediction of Y based on Xt. If t is a leaf we label it with yt(x). As an example, consider Figure 3.1(left) which illustrates a binary decision tree y with 9 nodes of height 3, with Y = y1, y2, y3, y4and where the input space is partitioned as in Figure 3.1 (right). Node t0 is the root and the firstsplit is conducted based on the question “X1 ≤ r1”; if this is true we go to node t1 and otherwisewe go to node t2. This procedure is followed along all internal nodes until a leaf is reached.

11

t0

X1 ≤ r1t1

X2 ≤ r2t3

y1

t4

y2

t2

X1 ≤ r3t5

y3

t6

X2 ≤ r4t7

y4

t8

y3

X2

X1

r1 r3

r2

r4

X (1)

X (2)

X (3)

X (4)

Figure 3.1: Left: A binary decision tree y(x) of height 3, made of 9 nodes of which 4 are leaves,depicted with grey dots. Right: A partition of a two-dimensional input space correspondingto the decision tree (left) with splits s0 = ”X1 ≤ r1”, s1 = ”X2 ≤ r2”, s2 = ”X1 ≤ r3” ands6 = ”X2 ≤ r4”.

For example, at leaf t5 we predict the output of an observation with X1 ∈ (r1, r3] as y3 and atleaf t4 we predict the output of an observation with X1 ≤ r1 and X2 > r2 as y2. The recursivepartitioning of the input space Xt0 is as follows

Xt0 = Xt1 ∪ Xt2= (Xt3 ∪ Xt4) ∪ (Xt5 ∪ Xt6)

= (Xt3 ∪ Xt4) ∪ (Xt5 ∪ (Xt7 ∪ Xt8)) ,

where

X (1) = Xt3 ,

X (2) = Xt4 ,

X (3) = Xt5 ∪ Xt8 and

X (4) = Xt7 .

Learning a decision tree ideally amounts to determine the tree-structure producing the partitionwhich is closest to some optimal partition. However, such a partition is unknown, and theobjective is instead to find a model that partitions the learning setD as well as possible (in the senseof best prediction accuracy of the model) over the observed outputs. There may exist many treespartitioning D equally well and the convention is to apply Occam’s razor principle1 and searchfor the simplest solution (fewest internal nodes) while minimizing the expected generalisationerror. It makes sense, since a smaller tree is easier to interpret than a high and complex tree. Also,for high trees the input space is partitioned into more subsets than it is for small trees, and thushigh trees are prone to overfitting and high variance as shown in Section 3.4.

Define, broadly for now, an impurity measure as a function that evaluates the goodness of anynode t and assume that the smaller function values, the purer the node and the better predictionsyt(x) for all x where (x, y) ∈ Dt = (x, y) ∈ D | x ∈ Xt. Starting from the root (representingthe entire learning set D) a greedy approach towards growing a tree, is then to iteratively dividenodes into purer nodes until all leaves can not be any purer or some stopping criteria is met. Apure node, say t, is one for which Xt only contains inputs x such that the corresponding output y is

1Occam’s razor is a problem-solving principle attributed to William of Ockham (c. 1287–1347), who was an EnglishFranciscan friar, scholastic philosopher, and theologian. His principle can be interpreted as stating among competinghypotheses, the one with the fewest assumptions should be selected. In science, Occam’s razor is used as a heuristic guidein the development of theoretical models.

12

3. Decision Trees

the same for all x (or y is in a range for regressors). Specific impurity functions are given in thefollowing sections.

In order to prevent high and complex trees, we need stopping rules telling us when to stopsplitting nodes. Such stopping rules varies from algorithm to algorithm but the most commonones are:

i) Set t as a leaf if t is pure.

ii) Set t as a leaf if Dt contains less than Nmin samples

iii) Set t as a leaf if the depth dt is greater than or equal to a threshold dmax

iv) Set t as a leaf if the total decrease in impurity is less than a fixed thresshold β.

v) Set t as a leaf if there is no split such that the induced subsets corresponding to the childrenof t contains at least Nleaf samples each.

The above criterias are all user-defined, and choosing appropiate values is usually performedusing a model selection procedure such as bivar or cross validation.

Another way of finding an optimal tree, is that of pruning where a tree is first grown as highas possible with no stopping rules other than leaves having a minimum number of observationsNmin. The procedure is then to remove the nodes, starting from the leaves, that degrade theexpected generalisation error. An example of pruning is given in Section 3.1.3.

3.1 C4.5

The C4.5 algorithm is used to generate (possibly non-binary) classification trees; it is an extensionof the earlier ID3 (Iterative Dichotomiser 3) invented by Ross Quinlann. C4.5 has severaladvantages over ID3 such as the ability to handle continuous input variables. The implementationof C4.5 in Weka2 is accessible from R via the function J48 found in the RWeka package. RossQuinlann went on to create C5.0, which improves on C4.5 in a number of ways such as speed andmemory usage. The R implementation of C5.0 is found in the C50 package via the function C5.0.Both C4.5 and C5.0 became very popular when Top 10 algorithms in data mining [30] was publishedin 2008.

For now, all input variables are assumed to be discrete. ID3, C4.5 and C5.0 all uses Shannonentropy (or just entropy) as an impurity measure. The entropy of a discrete random variable Y isdefined as

H(Y ) = −∑y

p(y) log2(p(y)),

where p(y) is the probability mass function of Y .3 The entropy can be seen as the expected valueof the information − log2(p(Y )) and where 0 · log2(0) is defined to be zero. The logarithm of theprobability distribution is useful as a measure of information about the random output variableY because it is monotone, additive for independent sources and non-negative. In a decision treewe seek to reduce the amount of information, or degree of surprise, about the output variable ateach node such that at node t, Dt contains as few unique values of the output as possible. Thisdecision, of reducing the information, is justified by the following proposition.

3.1 Proposition: Assume Y takes on a finite number, J , of values. Then H(Y ) ≤ log2(J) withequality iff p(yj) = 1/J for j = 1, 2, . . . , J . Furthermore, H(Y ) has minimum at points(1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1).

2Weka is a collection of statistical learning algorithms for data mining tasks, programmed in Java.3Here the random variable Y is not necessarily an output as discussed in previous sections. However, the goal is in

fact to consider the entropy of an output variable which is the reason to keep this notation.

13

3.1. C4.5

Proof: Since − log2(·) is a convex function, we have from Jensen’s inequality4

− log2

J∑j=1

1

p(yj)p(yj)

≤ J∑j=1

(− log2

(1

p(yj)

)p(yj)

)= −H(Y ).

Since− log2(·) is strictly convex, the inequality above is strict unless p(y1) = p(y2) = · · · = p(yJ) =

1/J . It is trivial to see, thatH(Y ) is minimized at points (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1).

In other words, H(Y ) is maximized when p(y) is the uniform distribution; that is, when alloutcomes are equally likely to occur. On the other hand, H(Y ) is minimized when a singleoutcome of Y is certain. A related quantity to the entropy is the conditional entropy defined as

H(Y |X) =∑x

p(x)H(Y |X = x),

which quantifies the amount of information needed to describe the random variable Y given thatthe value of another random variable X is known. We can write the conditional entropy as

H(Y |X) = −∑x

p(x)∑y

p(y|x) log (p(y|x))

=∑x

∑y

p(x, y) log

(p(x)

p(x, y)

).

3.2 Proposition: For discrete random variables Y and X it holds that H(Y |X) ≤ H(Y ).

Proof: Using Jensen’s inequality we have by direct calculations

H(Y |X) =∑x

∑y

p(x, y) log

(p(x)

p(x, y)

)=∑y

p(y)∑x

p(x, y)

p(y)log

(p(x)

p(x, y)

)

≤∑y

p(y) log

(∑x

p(x, y)

p(y)

p(x)

p(x, y)

)

=∑y

p(y) log

(1

p(y)

)= H(Y )

which finishes the proof.

Define byIG(Y |X) = H(Y )−H(Y |X)

the information gain in Y given that we know X . In context of decision trees, an input variableX and and output Y are typically dependent, hence the entropy of Y will become smaller if welearn about the value of X . In the worst case X and Y will be independent and H(Y ) = H(Y |X)

which implies that X gives no information about Y . In all other cases we have by Proposition 3.2that H(Y )−H(Y |X) ≥ 0 and in particular if H(Y |X) is much smaller than H(Y ) the informationgain will be large implying that X contains much information about Y .

4For a convex function f it holds that f(E[Z]) ≤ E[f(Z)] for any random variable Z.

14

3. Decision Trees

Denote by

p(yj ;Dt) =|(x, y) ∈ Dt | y = yj|

|Dt|(3.1)

the empirical probability of the outcome variable Y in node t based on Dt and denote by

H(y;Dt) = −J∑j=1

p(yj ;Dt) log2 (p(yj ;Dt))

the empirical estimate of H(Y ) in node t based on Dt. In order to minimize H(y;Dt)

we must recursively partition the input space such that each partition contains as fewunique values of the output variable. Notice that H(y;Dt) also has minimum at points(1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1).

In C4.5, we split on a node t according to the i′th input variable based on the improvement inentropy, represented by the information gain which is estimated by

IG(y, i;Dt) = H(y;Dt)−∑v∈Si

|Di,vt ||Dt|

H(y;Di,vt ),

where Di,vt = (x, y) ∈ Dt | xi = v.5 The term

∑v∈Si

|Di,vt ||Dt|

H(y;Di,vt )

is the estimated conditional entropy of Y given Xi based on Dt. This estimate is also an averageof the entropy in subsequent nodes if we choose to split by input i. In the extreme case whereH(y;Dt) = H(y;Di,v

t ) for all v ∈ Si the information gain is zero, and thus the i′ th input willnot reduce the entropy in subsequent nodes, corresponding to a situation in which Y and Xi areindependent. For each node we seek the input maximizing information gain.

Since the information gain favours inputs with many classes, Ross Quinlann suggested toreplace the information gain with the gain ratio defined as

GR(y, i;Dt) =IG(y, i;Dt)

H(xi;Dt),

whereH(xi;Dt) is the empirical estimate of the entropy of the random variableXi in node t basedon Dt. The rationale of the gain ratio is as follows. Suppose that Si is a discrete set with the samenumber of elements as Dt. Then Xi is ultimately the best input to split on, since it divides theentire data into |Dt| values. At the same time H(xi;Dt) will be maximized, and dividing with thisquantity in the gain ratio will penalize such inputs with many classes.

The goal of C4.5 is then to ask the right questions such that the gain ratio is maximized alongthe tree. In Algorithm 3.1 the pseudo code for the discrete version of C4.5 is contained, and thesteps are explained in the following, succeeded by an illustrative example. First an empty tree iscreated, and if D is pure or some other stopping criteria is met, we return the empty tree and thedecision is trivial. If D is not pure, we seek the input leading to the highest gain ratio, and wecreate new data sets. Then we call the algorithm recursively applied to these new datasets andattach nodes along the process. Finally, for leave t we predict an outcome based on x by yt(x) asthe majority class in Dt.

5Recall that Si denotes the i′th space used to construct the input space X .

15

3.1. C4.5

Algorithm 3.1 C4.5 - Discrete versionInput: A data set D where all inputs are discrete

1: Tree = 2: if D is pure w.r.t y OR other stopping criteria met then3: return Tree4: end if5: gains = 6: for i = 1, 2, . . . , p do7: gains = gains ∪GR(y, i;D)8: end for9: Split on the i′th input variable if GR(y, i;D) = max(gains)

10: Create a decision node that tests xi in the root of Tree11: for all v ∈ Si do12: Dv = D ∩ all observations in D where xi = v13: end for14: for all Dv do15: Treev = C4.5(Dv, y)16: Attach Treev to the corresponding node of Tree17: end for18: for all Leaves t in Tree do19: The prediction yt(x) is the majority class in Dt

20: end for21: return Tree

3.1.1 Continuous Inputs and Outputs

Assume that Xi is a continuous random variable with observed values xi1, xi2, . . . , xiN. Theremay be several of these observations having the same value. Let v1, v2, . . . , vm be a sorted set ofdistinct observed values from Xi and define the j′th threshold by

uj =vj+1 + vj

2,

for j = 1, 2, . . . ,m − 1. Then, for all thresholds we can split the data according to the question”Xi ≤ uj” which is either true or false. We split based on the threshold inducing the highest gainratio [19]. For the entropy (or information gain), it is not necessary to examine all thresholds. Ifall observations where xi takes on the value vj or vj+1 belong to the same class of the output, thethreshold uj cannot lead to a partition of data that has the maximum gain ratio [13], which canspeed up the running time considerably.

Another approach, and maybe sufficient in some situations, for splitting nodes on continuousinputs is to discretize the inputs into a given number of classes.

3.1.2 Golf Data Example

The golf data set, appearing in [19, 30], is summarised in Table 3.1. The problem is how to decidewhether or not to play golf, given certain weather conditions. In this example I will use the inputsoutlook, temp, humid and wind to predict play. I have written the R functions entropy, IGand GR as seen below, facilitating the calculations of gain ratio:

entropy = function(D, target = "play")

if( nrow(D) == 0) return(0)

p = table(D[[target]])/nrow(D)

sum(ifelse(p==0,0,-p*log2(p)))

16

3. Decision Trees

day outlook temp humid wind play

1 sunny hot high weak no2 sunny hot high strong no3 overcast hot high weak yes4 rain mild high weak yes5 rain cool normal weak yes6 rain cool normal strong no7 overcast cool normal strong yes8 sunny mild high weak no9 sunny cool normal weak yes

10 rain mild normal weak yes11 sunny mild normal strong yes12 overcast mild high strong yes13 overcast hot normal weak yes14 rain mild high strong no

Table 3.1: golf data.

IG = function(D, input, target = "play")

H_input = lapply(split(D, D[[input]]),

function(x) nrow(x) * entropy(x,target) )

H_input = sum(unlist(H_input)) /nrow(D)

entropy(D,target) - H_input

GR = function(D, input, target = "play")

IG(D, input, target) / entropy(D, target = input)

Using these functions we find that

> GR(golf, input = "outlook", target = "play")

[1] 0.1564276

> GR(golf, input = "temp", target = "play")

[1] 0.01877265

> GR(golf, input = "humid", target = "play")

[1] 0.1518355

> GR(golf, input = "wind", target = "play")

[1] 0.04884862

Hence, we split the root according to outlook. Denote the induced sub-datasets by D_s, D_rand D_o where the subscripts indicate "sunny", "rainy" and "overcast" respectively. From Table3.1 it is seen than D_o is pure and we decide to play golf when the weather is overcast. For D_swe have

> GR(D_s, input = "temp", target = "play")

[1] 0.3751495

> GR(D_s, input = "humid", target = "play")

[1] 1

> GR(D_s, input = "wind", target = "play")

[1] 0.02057066

and therefore we split on humid which implies a pure split. Finally, for D_r we have

17

3.1. C4.5

> GR(D_r, input = "temp", target = "play")

[1] 0.02057066

> GR(D_r, input = "humid", target = "play")

[1] 0.02057066

> GR(D_r, input = "wind", target = "play")

[1] 1

which implies a pure split on wind. The final decision tree is depicted in Figure 3.2.

sunny

normal high

overcast rainy

strong weak

outlook

humid

yes no

yes

wind

yes no

Figure 3.2: Decision tree obtained from C4.5 applied to data golf.

3.1.3 Tree Pruning

Small trees with few splits might lead to better interpretation, lower variance and preventoverfitting at the cost of a little bias. To accomplish this, C4.5 uses pessimistic pruning. Assume weend up with E incorrect decisions out of N possible at a leaf. Then Ross Quinlan, in [19], regarde = E/N as the estimated probability of observing an error in one trial. It should be noted thatRoss Quinlan is very aware, that it is not realistic to regard e in this way, however he proceeds inthis heuristic way after all. By doing so, an upper bound on the true probability, p, is given by theupper bound of the approximate Gaussian 100(1− α)%confidence interval defined as as

emax = e+ zα/2

√e(1− e)√N

where zα/2 is the 100(1 − α/2)′th percentile of the standard normal distribution. If zα/2 = 0.67

then α = 0.50 and p < emax with probability 0.75 which is the default in C4.5. The pessimisticpruning is best explained by an example. Consider the tree in Figure 3.3 where (+j,−k) indicatesthat j observations are predicted correct and k are falsely predicted at the leaf at consideration.Let e1 = 6/13, e2 = 3/7 and e3 = 3/5 be the empirical errors at the leaves. Then

e1,max = e1 + 0.67×√e1(1− e1)√

13= 0.5569

e2,max = e2 + 0.67×√e2(1− e2)√

7= 0.5576

e3,max = e3 + 0.67×√e3(1− e3)√

5= 0.7511

and the average of these upper bounds is 0.6219. On the other hand, if the internal node t wasreplaced by a leaf having 13 correct predictions and 12 false, (+13;−12), the upper bound of the

18

3. Decision Trees

Root

prune node t

internal node t

(+7;−6)

yt1

(+4;−3)

yt2

(+2;−3)

yt3

Root

yprune

(+13;−12)

(+2;−3)

Figure 3.3: Example of pessimistic pruning.

error on this single leaf would be 0.5489 and we decide to remove the three leaves and replacethem with a single leaf. The estimate yprune is the most frequent class among the remaining 25

observations.

3.2 CART

Classification and regression trees (CART) [7] invented by Breiman, Friedman, Stone, and Olshenis another popular tree-based model. For classification, CART works in much the same way asC4.5. It uses the Gini index

G(Y ) = 1−∑y∈Y

p(y)2

as impurity per default; one can choose the entropy instead. The Gini index has minimum andmaximum at the same points as the entropy and in [21] it was found that in only 2% of the casesin a large number of studies, the Gini index and entropy lead to different decisions. Anotherdifference is that CART only uses binary splits.

For a categorical input variable with many classes, CART has significantly more computation-ally challenges finding a good split, than for continuous inputs. To find an exact optimal binarysplit for a categorical input with q classes, CART needs to consider 2q−1 − 1 splits; observe thatwe can assign q distinct values to the left and right nodes in 2q ways and two out of these config-urations will lead to an empty node which we disregard. Finally we divide by 2, i.e. (2q − 2)/2,since all values moved to the left could as well have been moved to the right and vice versa.Thus computations becomes prohibitive for large q. However, for binary outputs y ∈ 0, 1, thiscomputation simplifies: Assume we have five input classes x1, x2, x3, x4 and x5 and let η be suchthat η(xj) is the number of observations where the j′th input class has output label 1. Also, denoteby n(xj) the number of observations having x = xj and order the input classes according to thefractions π(xj) := η(xj)/n(xj). Assume that π(x3) ≤ π(x5) ≤ π(x1) ≤ π(x2) ≤ π(x4). Then, theoptimal split, among all 24 − 1 = 31 possible, is one of the following

x3, x1, x2, x4, x5,x3, x5, x1, x2, x4,x1, x3, x5, x2, x4,x1, x2, x3, x5, x4.

Thus, the number of splits to consider is reduced from 31 to 4. In general, this result reduces thenumber of splits to consider to q − 1. The proof is given in [7]; it is approximately five pageslong, and I will omit it here. For non-binary outputs, no such simplifications are possible althoughvarious approximations have been proposed.

19

3.2. CART

For regression problems, CART works by finding the best binary partition in terms ofminimum sum of squares using a greedy algorithm which is very fast. Finally, CART prunesusing a cost-complexity model whose parameters are estimated by cross validation. For moredetails on CART, see [7].

Using the CART algorithm from package rpart with Nleaf = 1 on the golf data introducedin Subsection 3.1.2, we obtain the binary decision tree seen in Figure 3.4 in which we see thatoutlook is a binary split as opposed to the tree in Figure 3.2. Hence, CART is allowed to usethe categorical input outlook more than once, which is not allowed in C4.5 for obvious reasons.The tree in Figure 3.4 is quite complex for a dataset with 14 observations and this is only due tothe stopping rule Nleaf = 1; if this is not set, the algorithm produces an empty tree in which thedecision is to play golf always. The default value of Nleaf is 7.

sunny, rainy

high

sunny rain

strong weak

normal

strong

rain sunny

weak

overcast

outlook

humid

outlook

no

wind

no yes

wind

outlook

no yes

yes

yes

Figure 3.4: Decision tree obtained from CART applied to the golf dataset.

20

3. Decision Trees

3.3 Consistency

Let y∗ denote the optimal model in either classification or regression defined in Chapter 2 anddenote by yD some model dependent on the learning set D.

3.6 Definition: A model yD is said to be consistent if

Err(YD(x))→ Err(y∗(x))

almost surely as the size N of the learning set D tends to infinity. That is, the expectedgeneralization error converges to the smallest possible error.

In [7], Leo Breiman showed that decision tree models constructed using the estimates in (3.1) areconsistent.

3.4 Bias Variance Study of Decision Trees

In this section, I study the bias and variance behaviour of C5.0 decision trees, based on bivar, onsix different data sets from the UCI Machine Learning Repository6. Information about the six datasets are given in Table 3.2 where half of the data sets (Image, Ionospehere and Sonar) has fairlymany inputs and two of the data sets (Ecoli and Image) deals with non-binary classification. It

Data set Size Inputs Output classes

(a) Ecoli 336 7 8(b) Image 2310 19 7(c) Ionospehere 351 34 2(d) Liver 345 6 2(e) Diabetes 768 8 2(f) Sonar 208 60 2

Table 3.2: Summary of six data sets from the UCI repository.

is a common understanding in the field of statistical learning, that bias is reduced when decisiontrees grows large and variance is increased. Using bivar on the data sets from Table 3.2, producesthe six plots in Figure 3.5. The general trend in these plots is in agreement with the commonunderstanding of the behaviour of bias and variance for decision trees, since we see that biasdecreases and the net variance increases, except for (e) Diabeteswhere the net variance increasesfaster than the bias is reduced. This trend is an expression of overfitting. Notice, for an example,that the test error in (f) Sonar is minimed at height 4 after which the error is neither decreasingor increasing. Hence, applying Occam’s razor principle in order to search for a simple, small andinterpretable tree also has advantages in terms of lowering the variance of future predictions; atleast for this particular data set.

6The UCI Machine Learning Repository is a large collection of data sets used in many academic fields, wherestatistical learning (or machine learning) has been applied to understand data. The repository is found at https://archive.ics.uci.edu/ml/datasets.html

21

https://archive.ics.uci.edu/ml/datasets.html

https://archive.ics.uci.edu/ml/datasets.html

3.4. Bias Variance Study of Decision Trees

b error v_n

0

20

40

60

1 2 3 4 5 6 7 9 10 11 12 15 22

Height

Err

or (

%)

(a)

5

10

15

8 9 10 11 14 15 17 18 19 21 22 25 30 35 43 47

Height

Err

or (

%)

(b)

0

10

20

30

40

50

1 2 3 4 5 9 13 16

Height

Err

or (

%)

(c)

0

10

20

30

40

1 2 3 5 8 9 10 12 13 15 18 24 26 36

Height

Err

or (

%)

(d)

10

20

30

2 3 7 9 10 11 12 15 16 20

Height

Err

or (

%)

(e)

0

10

20

30

40

50

1 2 4 5 7 8 9 10 11 12 18

Height

Err

or (

%)

(f)

Figure 3.5: Six plots using bivar where “Height” is the height of the respective trees. (a) Ecoli.(b) Image. (c) Ionosphere. (d) Liver. (e) Diabetes. (f) Sonar.

22

4 Bagging

Bagging is a method for generating multiple versions of a model, base learners (not necessarilytree structured), and using these to get an aggregated model; such procedures are also known asensemble learning. Let the notation be as in Chapter 2. The method is developed by Leo Breiman[5], and the idea is to replace yD(x) by the main prediction introduce in Section 2.1. That is, if Y isnumerical, yD(x) is replaced with the expected value

yA(x) = Ex[YD(x)],

where the subscript A denotes “aggregation”. Suppose that Y is discrete, taking values in1, 2, . . . J, then we define

yA(x) = arg maxj

Px(YD = j),

as the aggregated prediction at input x. In practise, yA is estimated with the bootstrap methoddescribed in Chapter 2, hence the name bagging (bootstrap aggregating). It is a relatively easyway to improve an existing method, since all that needs to be added is a loop over the bootstrapreplicates. What one loses, with trees, is a simple and interpretable structure. What one gainsis increased accuracy. It should be noted that in practise, one uses an empirical estimate of yA.Bagging is available in R through the popular package adabag [1]. Adabag is a concatenation of"Adaboost" and "Bagging, where Adaboost is another ensemble method introduced in Chapter 6.

4.1 Regression

Let Y be a continuous random output variable and assume that L is the squared loss function.Using Jensen’s inequality it follows that

Ex[L(Y, YD(x))] = Ex[Y 2] + Ex[YD(x)2]− 2Ex[Y ]Ex[YD(x)]

≥ Ex[Y 2] +(

Ex[YD(x)])2

− 2Ex[Y ]Ex[YD(x)]

= Ex[L(Y, yA(x))],

indicating that the expected generalisation error is always improved by the bagging model yA.Also, since averaging leaves bias unchanged, the expected generalization error is improvedthrough variance reduction, due to the bias-variance decomposition in (2.1), leading to morestable predictions.

4.2 Classification

Denote byQ(j|x) = Px(YD(x) = j)

the probability that the model YD predicts class j at input x. Assuming that Y and YD areconditionally independent given X , the probability of correct classification at x is given by

Px(Y = YD(x)) =

J∑j=1

Q(j|x)P (j|x)

23

4.2. Classification

where P (j|x) is the probability that Y equals j given the input x. Using this, the overall probabilityof correct classification is

R(yD) =

∫[

J∑j=1

Q(j|x)P (j|x)]P (dx),

where P (dx) is the probability distribution of X . Notice that

J∑j=1

Q(j|x)P (j|x) ≤

J∑j=1

Q(j|x)

maxiP (i|x) = max

iP (i|x)

with equality if

Q(j|x) =

1 if j = arg maxi P (i|x)

0 else,

and thus Bayes classifier y∗(x) = arg maxj P (j|x) attains the highest possible probability of correctclassification

R(y∗) =

∫maxjP (j|x)P (dx),

as expected. Although the above argument also follows by minimization of (2.2) since R(yD) =

1− Err(YD), it gives a convenient notation facilitating the following arguments.Call YD order-correct if

arg maxj

Q(j|x) = arg maxj

P (j|x).

An order-correct predictor is not necessarily an accurate predictor, although, by Definition 2.3,it is unbiased, i.e B(x) = 0 since y∗ = ymain with y∗ playing the role of Bayes classifier andymain = arg maxj Q(j|x). Suppose, for a binary classification problem, that P (1|x) = 0.9, P (2|x) =

0.1, Q(1|x) = 0.6 and Q(2|x) = 0.4. Then the probability of correct classification induced bythe predictor used to construct Q at x gives 0.58 whereas y∗(x) gets correct classification withprobability 0.90.

The aggregated predictor is yA(x) = arg maxj Q(j|x) and the probability of correctclassification at x is

J∑j=1

1 [yA(x) = j]P (j|x). (4.1)

If YD is order-correct at x, (4.1) equals maxj P (j|x). Letting C be the set of all inputs x for whichYD is order-correct, the overall probability of correct classification of yA becomes

R(yA) =

∫x∈C

maxjP (j|x)P (dx) +

∫x∈C′

[

J∑j=1

1 [yA(x) = j]P (j|x)]P (dx),

where C ′ is the complement of C. Even if YD is order-correct at x, its classification rate can be farfrom optimal as discussed above, but yA is since R(yA) = R(y∗) at x in this case. If a predictoris good in the sense that it is order-correct for most inputs x, the aggregation can transformit into a nearly optimal predictor. However, unlike the numerical prediction situation, poorpredictors can be transformed into worse ones. Bagging unstable classifiers with high varianceV (x) = Ex[L(Y , ymain)] usually improve them, but bagging stable classifiers is not a good idea,see [5]. Therefore, decision trees are usually taken as base learners.

24

5 Random Forests

As seen in Chapter 4, bagging is a technique for reducing the variance of an estimated predictionfunction working especially well for high-variance/low-bias procedures such as trees. Randomforests is an ensemble method used in tandem with bagging, lowering the variance even further.

5.1 Definition: A random forest is a classifier or regressor consisting of a collection of random tree-structured models h(x; θk, Dk), k = 1, 2, . . . where ψ = θkk=1,2,... is a family of i.i.d. hyper-parameter vectors drawn from a random vector Θ independent of the learning sets Dk. Forclassification, each tree casts a unit vote for the most popular class at input x and for regression,the tree predictions are averaged.

For random forests introduced in [6] (the Forest-RI algorithm1) by Breiman, the hyper-parametersdetermine a random subset of the input variables at each node in the trees used to construct therandom forest ensemble. Thus, θk defines the structure of the k′th random forest tree in terms ofsplit variables etc. By injecting such randomness via Θ into the trees, the individual trees becomesde-correlated before aggregating them into a single prediction function. This procedure leads tobetter prediction accuracy and lower variance in general, and in the following section I will arguewhy. The procedure of Breiman’s Forest-RI algorithm is seen in Algorithm 5.1. Notice, whenmtry = p the algorithm reduces to that of bagging with CART methodology.

Algorithm 5.1 Forest-RIInput: A learning set D of size N and a new input x

1: for b = 1, 2 . . . , B do2: Draw a bootstrap sample Db of size N from D3: Grow a random-forest tree h(·; θb, Db) by the CART methodology to the bootstrapped data4: without pruning, by recursively repeating the following steps for each leaf of the tree, until5: a minimum node size Nmin is reached.6:

i) Select mtry input variables at random from the p possible; i.e. draw θb from Θ

ii) Pick the best split among the mtry chosen input variablesiii) Split the node appropriately according to the best split

7: end for8: Construct the ensemble of trees h(·; θb, DbBb=1 and predict a new output based on x by:

yrf (x) =1

B

B∑b=1

h(x; θb, Db) or yrf (x) = arg maxj

B∑b=1

1 [h(x; θb, Db) = j]

for regression and classification respectively.

Although Breiman’s Forest-RI algorithm is commonly referred to as “random forests”, thereexist many other ensemble methods using tree-structured classifiers. Random forest methodsmostly differ from each other in the way they introduce randomness via Θ.

1RI is short for “random input”

25

5.1. Classification

Rotation Forests introduced in [22] is another random forest method. For each bootstrapreplicate, the set of inputs is randomly split into q subsets and principal component analysis(PCA) is applied to each subset. Pooling all principal components from the q subsets then gives anew dataset which are used by the individual trees in the ensemble. The rotation forest methodhas shown to give results as good, and sometimes better, than Forest-RI, see [22]. In terms ofcomplexity however, the computational overhead due to the q PCAs should not be overlooked.Forest-RI is available within R from the package randomForests and Rotation Forest is availblefrom rotationForest.

Throughout the rest of this chapter I will suppress the dependency on the learning set andwrite h(x; θk) := h(x; θk, Dk) for the k′th random forest tree.

5.1 Classification

Given an ensemble of tree-structured classifiers h(x, θ1), h(x, θ2), . . . , h(x, θK) and with thelearning set drawn at random from Ω = X × Y , independent of Ψ, define the empirical marginfunction as

mgK(x, y, ψ) =1

K

K∑k=1

1 [h(x; θk) = y]− maxyj∈Y:yj 6=y

1

K

K∑k=1

1 [h(x; θk) = yjj] .

The margin function measures the extent to which the average number of votes for a correct classexceeds the average number of votes for any other class. The empirical ensemble error is defined as

PE(ψ)K = Pψ(mgK(X,Y,Ψ) < 0) := P (mgK(X,Y,Ψ) < 0 | Ψ = ψ).

That is, PE(ψ)K is a measure of the confidence we have in the ensemble classifier and optimallyPE(ψ)K is small.

5.1 Theorem: As the number of trees, K, tends to infinity, PE(ψ)K converges almost surely for all ψoutside a Ψ null set C to the ensemble error

PE := P (mg(X,Y ) < 0), (5.1)

where

mg(X,Y ) := PX,Y (h(X; Θ) = Y )− maxyj∈Y:yj 6=Y

PX,Y (h(X; Θ) = j) (5.2)

is called the margin function. Here PX,Y denotes the random probability of Θ given (X,Y ).2

Proof: Suppose that mgK(x, y, ψ) converge to mg(x, y) for allψ outside a null setC not dependingon (x, y). Then the result follows by dominated converges since

Pψ(mgK(X,Y,Ψ) < 0) = Eψ[1[mgK(X,Y,Ψ) < 0]]

converge to

Eψ[1[mg(X,Y ) < 0]] = P (mg(X,Y ) < 0),

for all ψ outside C since (X,Y ) is independent of Ψ.

Suppose now that the input space X is countable. Then we could simply take C =

∪(x,y)C(x, y), with C(x, y) being a Ψ null set at (x, y), since the countable union of null sets is

2As in previous chapters, all probabilities are taken over all random variables except on those indicated withsubscription, these will denote conditional probabilities; either random or not.

26

5. Random Forests

again a null set.3 However, we do not know whether or not X is countable, and we now considerthe case where X is not countable.

For a fixed output yj and for any hyper-parameter θ, the set x|h(x; θ) = yj is a union ofhyper-rectangles.4 Since the state space of Θ is finite, by construction of the Forest-RI algorithm,there can only be a finite number of such unions of hyper rectangles, say S1, S1, . . . , SM . Defineφ(θ) = m if x | h(x; θ) = yj = Sm and note that φ(θ) does not depend on x. Then

1[φ(θ) = m,h(x; θ) = yj ] = 1[φ(θ) = m,x ∈ Sm]

= 1[φ(θ) = m]1[x ∈ Sm].

Moreover,

1[h(x; θ) = yj ] =

M∑m=1

1[φ(θ) = m,h(x; θ) = yj ].

Combining these facts gives

1

N

N∑i=1

1[h(x; θi) = yj ] =1

N

N∑i=1

(M∑m=1

1[φ(θi) = m,h(x; θi) = yj ]

)

=1

N

M∑m=1

N∑i=1

1[φ(θi) = m]1[x ∈ Sm]

=1

N

M∑m=1

Nm1[x ∈ Sm]

where Nm =∑Ni=1 1[φ(θi) = m]. Now, by the strong law of large numbers, for each m there is a ψ

null set Cm so that Nm/N converges to P (φ(Θ) = m). Let C = ∪Mm=1Cm, then C is a null set and

limN→∞

1

N

N∑i=1

1[h(x; θi) = yj ] =

M∑m=1

P (φ(Θ) = m)1[x ∈ Sm],

for all ψ outside C. Finally note that

M∑m=1

P (φ(Θ) = m)1[x ∈ Sm] =

M∑m=1

E[1[φ(Θ) = m]]1[x ∈ Sm]

=

M∑m=1

E[1[φ(Θ) = m,x ∈ Sm]]

= E[1[h(x; Θ) = yj ]]

= P (h(x; Θ) = yj)

from which we can deduce that mgK(x, y, ψ) converges to mg(x, y) for all ψ outside C. Thisconcludes the argument.

Theorem 5.1 guarantees a limiting value of PE(ψ)K when the number of trees grow large, andin this setting I will argue, in the following section, that decorrelation of individual trees willimprove prediction accuracy. Also notice, that 5.1 is not limited to Forest-RI; the Theorem holdsfor all tree-structured classifiers with Θ being finite.

3Let (Ω,F , P ) be a measure space and C := Cnn∈N a sequence of null sets in F . Then, by countable sub additivityit follows that P (C) ≤

∑n∈N P (Cn) = 0.

4Consider for example Figure 3.1 in which x|h(x; θ) = y3 = X (3) = Xt5 ∪ Xt8 , where Xt5 and Xt8 are indeedhyper-rectangles.

27

5.1. Classification

5.1.1 Strength and Correlation

Define the strength of a random forest as

s = E[mg(X,Y )]

where mg(x, y) is defined in (5.2) and assume that s > 0. Then

PE ≤ P (|mg(X,Y )− s| ≥ s) ≤ Var[mg(X,Y )]

s2, (5.3)

using Chebychev’s inequality (see Appendix A) and where PE is ensemble error defined in (5.1).In what follows, the bound in (5.3) is expressed by correlations between the individual classifiersin the forest. First let

j(x, y) = arg maxyj∈Y:yj 6=y

Px,y(h(x; Θ) = yj)

such that

mg(x, y) = Px,y(h(x; Θ) = y)− Px,y(h(x; Θ) = j(x, y))

= Ex,y[1 [h(x; Θ) = y]− 1[h(x; Θ) = j(x, y)]]

and define the raw margin function as

rmg(θ, x, y) = 1 [h(x; θ) = y]− 1[h(x; θ) = j(x, y)].

Since mg(x, y) = Ex,y[rmg(Θ, x, y)] and

mg(x, y)2 = Ex,y[rmg(Θ, x, y)rmg(Θ′, x, y)]

where Θ and Θ′ are i.i.d, we obtain

Var[mg(X,Y )] = E[EX,Y [rmg(Θ, X, Y )rmg(Θ′, X, Y )]]− E[EX,Y [rmg(Θ, X, Y )]] E[EX,Y [rmg(Θ′, X, Y )]]

= E[EΘ,Θ′ [rmg(Θ, X, Y )rmg(Θ′, X, Y )]]− E[EΘ[rmg(Θ, X, Y )]] E[EΘ′ [rmg(Θ′, X, Y )]]

= E[CovΘ,Θ′ [rmg(Θ, X, Y ), rmg(Θ′, X, Y )]]

= E[ρΘ,Θ′sdΘ(rmg(Θ, X, Y ))sdΘ(rmg(Θ′, X, Y ))] ,

where ρΘ,Θ′ is the correlation between rmg(Θ, X, Y ) and rmg(Θ′, X, Y ) holding Θ and Θ′ fixedand sdΘ(rmg(Θ, X, Y )) is the standard deviation of rmg(Θ, X, Y ) holding Θ fixed. Thus, usingJensen’s inequality, we obtain the bound

Var[mg(X,Y )] = ρE[sdΘ(rmg(Θ, X, Y ))]2 (5.4)

≤ ρE[VarΘ[rmg(Θ, X, Y )]] ,

where ρ is given by

ρ =E[ρΘ,Θ′sdΘ(rmg(Θ, X, Y ))sdΘ′(rmg(Θ′, X, Y ))]

E[sdΘ(rmg(Θ, X, Y ))sdΘ(rmg(Θ′, X, Y ))].

Finally by writing

E[VarΘ[rmg(Θ, X, Y )]] = E[EΘ[rmg(Θ, X, Y )2]

]− E

[EΘ[rmg(Θ, X, Y )]2

](5.5)

≤ 1− s2

we have, from (5.3), (5.4) and (5.5), the following theorem.

28

5. Random Forests

5.2 Theorem: An upper bound for the ensemble error is given by

PE ≤ ρ(1− s2)

s2.

From Theorem 5.2 we see, that the two ingredients involved in the ensemble error for randomforests are the strength of the individual classifiers and the correlation between them in termsof the raw margin function. Thus, the smaller the ratio ρ/s2 the better. As mentioned, Breimansuggested to randomly select mtry out of the p input variables in each node of the random foresttrees to reduce ρ/s2. In Forest-RI, mtry is the nearest integer of

√p as per default.

In [4], 20 datasets selected from the UCI repository where used to study the relationship of theratio ρ/s2 and the test error in random forests. It was concluded, in general, when ρ/s2 was smallthen so was the test error, confirming the theoretical foundation within this section.

By injecting randomness into the trees as described, the random forest reduces the varianceeven more than bagging.

5.2 Regression

When the random output variable Y is continuous, the random forest predictor is defined as

yrf (x) =1

K

K∑k=1

h(x; θk),

where each individual tree is a regression tree. Let ρ(x) denote the (Pearson) correlation betweenh(x; Θk) and h(x; Θk′) where Θk and Θk′ are i.i.d. Recall that the k′th random forest tree dependson the learning set Dk, hence h(x; Θ) and h(x; Θ′) are not necessarily independent. We have

Varx[yrf (x)] =1

K2

K∑k=1

Varx[h(x; Θk)] +1

K2

∑k 6=k′

Covx [h(x; Θk), h(x; Θk′)]

=1

KVarx[h(x; Θk)] +

K(K − 1)

K2ρ(x)Varx[h(x; Θk)]

= ρ(x)Varx[h(x; Θ)] +1− ρ(x)

KVarx[h(x; Θ)] ,

since all trees are identically distributed. As the size, K, of the ensemble tends to infinity thevariance of yrf (x) reduces to ρ(x)Varx[h(x; Θ)]. Under the assumption that the randomizationinjected via Θ has some effect, i.e. ρ(x) < 1, the variance of a random forest is therefore strictlysmaller than the variance of an individual regression tree. As a result, the expected generalizationerror of a random forest is smaller than that of a single regression tree.

5.3 Consistency

Consistency of single classification trees is discussed in Section 3.3, however this does not extendto random forests for the following reasons:

i) In single decision trees, the number of samples in leave nodes is let to become large, whiletrees in random forests are usually fully developed in the sense that no pruning or otherstopping criteria is used.

ii) Single decision trees do not make use of bootstrapping.

iii) The splitting strategy in single decision trees consists in selecting the split that maximizesthe impurity measure. By contrast, in random forests, the splitting strategy is randomized.

Several consistency results of random forests in special cases have been proven, but the firstconsistency result of Breimans original Forest-RI was established recently in [24] for regressionproblems.

29

5.4. Out of Bag Samples and Variable Importance

5.4 Out of Bag Samples and Variable Importance

An important feature of random forests are the out-of-bag (OOB) samples: For each observation(xi, yi), construct its random forest predictor y(i)

rf using only those trees corresponding to bootstrapsamples in which (xi, yi) did not appear. Suppose the problem is regression, then the i′th OOBestimate is given by

y(i)rf (xi) =

1

Bi

∑b:(xi,yi)/∈Db

h(xi; θb, Db),

where Bi is the number of bootstrap samples where the i′th observation was not included. TheOOB error estimate is then computed as

errOOB(yrf ) =1

N

N∑i=1

L(yi, y(i)rf (xi))

where L is the loss function of interest and N is the number of observations in D. The OOBerror can be computed along the process in the random forest at virtually no extra cost by simplebook-keeping of out-of-bag samples. Since the probability that an observation, say zi = (xi, yi),belongs to bootstrap sample b is

1− P (Zi /∈ Db)

= 1− (1− 1

N)N

≈ 1− e−1

= 0.632

it follows that Bi ≈ 0.37B.Furthermore, the LOOCV error of yrf , see Chapter 2, is given by

ELOOCV(yrf ) =1

N

N∑i=1

L(yi, yrf,−i(xi)),

where yrf,−i(xi) is the random forest predictor constructed without the i′th observation evaluatedat xi. The only difference in errOOB(yrf ) and ELOOCV(yrf ) is, that the ladder uses approximatelythree times the number of bootstrap samples than the former, to evaluate each observation.Assume now, thatB∗ = 3·BS is the effective number of bootstrap samples, whereBS is the numberof bootstrap samples needed for the test error of the random forest to stabilize. Then Bi ≈ BS anderrOOB(yrf ) and ELOOCV(yrf ) are almost identical since the error obtained by LOOCV is the samewhether we use BS or 3 ·BS .

Consider again the six datasets listed in Table 3.2. In Figure 5.1 the OOB errors of baggingand Forest-RI over 1500 trees obtained from these datasets are displayed. As seen, the Forest-RIoutperforms bagging in all cases. In the following section, it will also be apparent, that the OOBerrors for random forests in Figure 5.1 are virtually the same as estimated expected loss computedby bivar, see Figure 5.2 .

Another utility of random forests is their ability to measure the variable importance of eachinput variable. When the b′th tree is grown, the OOB samples are passed down the tree, andthe prediction accuracy (or the impurity) is recorded. Then the values for the j′th variable arerandomly permuted in the OOB samples, and the accuracy is again computed. The decrease inaccuracy, as a result of this permuting is averaged over all trees, and is used as a measure of theimportance of variable j in the random forest. This is handy since the random forests is somethingof a black box, giving good predictions but usually no insights of the relationship between theinputs and the output.

30

5. Random Forests

15

20

25

0 500 1000 1500

Number of trees

OO

B e

rror

(a)

2

4

6

0 500 1000 1500

Number of trees

OO

B e

rror

(b)

7

9

11

0 500 1000 1500

Number of trees

OO

B e

rror

(c)

25

30

35

40

45

0 500 1000 1500

Number of trees

OO

B e

rror

(d)

24

28

32

36

0 500 1000 1500

Number of trees

OO

B e

rror

(e)

15

20

25

30

35

0 500 1000 1500

Number of trees

OO

B e

rror

(f)

Figure 5.1: Six plots comparing OOB errors of bagging and Forest-RI. (a) Ecoli. (b) Image. (c)Ionosphere. (d) Liver. (e) Diabetes. (f) Sonar.

5.5 Bias Variance Study of Forest-RI

In this section, I study how the theoretical results of the Forest-RI algorithm apply in practice onthe six data sets listed in Table 3.2 using bivar. First notice how the error in all plots in Figure 5.2is significantly lower than the error in the plots in Figure 3.5 where single trees were studied. Also,the number of trees do not need to be very large in order to reduce the error. The trend in Figure5.2 is that bias virtually stays the same in all six plots whereas the net variances are dramaticallyreduced which concur with the theoretical analysis.

31

5.5. Bias Variance Study of Forest-RI

b error v_n

10

20

1 10 20 30 40 100 200 500

Number of trees

Err

or (

%)

(a)

2.5

5.0

7.5

1 10 20 30 40 100 200 300 400 500

Number of trees

Err

or (

%)

(b)

5

10

1 10 20 30 40 100 200 300 400 500

Number of trees

Err

or (

%)

(c)

0

10

20

30

40

1 10 20 30 40 100 200 300 400 500

Number of trees

Err

or (

%)

(d)

0

10

20

30

1 10 20 30 40 100 200 300 400 500

Number of trees

Err

or (

%)

(e)

0

10

20

30

1 10 20 30 40 100 200 300 400 500

Number of trees

Err

or (

%)

(f)

Figure 5.2: Six plots using bivar with Forest-RI. (a) Ecoli. (b) Image. (c) Ionosphere. (d) Liver. (e)Diabetes. (f) Sonar.

32

6 Boosting

The idea of ensemble learning is to build a prediction model by combining the strengths of acollection of simpler base learners; bagging in Chapter 4 and random forests in Chapter 5 areexamples of ensemble learners. Boosting is another ensemble learning method where, as opposedto bagging and random forests, the base learners evolves over time.

6.1 AdaBoost

Boosting is actually a family of algorithms, among which the AdaBoost1 algorithm is the mostinfluential one. In fact, AdaBoost was the first algorithm to (constructively) prove that anensemble of weak classifiers can be combined into a strong classifier. I will briefly address thegeneral boosting procedure and then move on to a more theoretical examination of the AdaBoostalgorithm which was first introduced by Freund and Schapire in [14]. As with the C4.5 algorithmintroduced in Section 3.1, AdaBoost became very popular when published in [30].

Let D = (xi, yi)Ni=1 be a learning set drawn at random from Ω = X × Y according to adistribution D and where yi ∈ −1,+1 for i = 1, 2, . . . , N . Then, the problem of predicting a newunseen output is a binary classification problem.

Suppose we are given a weak classifier h1 trained on D, where each observation (xi, yi) ∈ Dis drawn with probability 1/N from D, which is only slightly better than random guessing, sayit has 49% 0/1−loss. In order to correct the mistakes made by h1 we can try to derive a newdistribution D′ from D, which makes mistakes of h1 more evident, for example, it focuses moreon the observations wrongly classified by h1. We can then train a new classifier h2 and drawobservations from D, using D′. Again, suppose we are unlucky and h2 is also a weak classifier.Since D′ was derived from D, if D′, satisfies some condition, h2 will be able to achieve a betterperformance than h1 on some places in D where h1 does not work well, without modifying theplaces where h1 performs well. Thus, by combining h1 and h2 in an appropriate way, the combinedclassifier will be able to achieve less loss than that achieved by h1. By repeating this process, wecan expect to get a combined classifier with a very small 0/1−loss.

Algorithm 6.1 AdaBoostInput: A learning set D of size N , a new input x and a set of base learners.

1: Initialize the weight distribution W1(xi, yi) := W1,i = 1/N for i = 1, 2, . . . , N .2: for t = 1, 2 . . . , T do3: Train a new base learner ht using D and Wt.4: εt =

∑Ni=1Wt,i1[yi 6= ht(xi)] . error when using ht

5: if εt > 0.5 then break . else a random guess is better6: end if7: αt = 1

2 log(

1−εtεt

)8: Wt+1,i =

Wt,i

Ztexp (−αtyiht(xi)) . Zt is a normalizing factor so Wt+1 is a distribution

9: end for10: return yboost(x) = sign

(∑Tt=1 αtht(x)

)1AdaBoost is short for Adaptive Boosting

33

6.2. Bounding the Training Error

The more specific procedure of AdaBoost is given in Algorithm 6.1. Once the base classifierht has been constructed, AdaBoost chooses a parameter αt as a measure of importance that isassigned to ht. Notice that αt > 0 by construction, since we require εt < 0.5, and that αt getslarger as εt gets smaller. Thus, the more accurate the base learner ht is, the more “importance” weassign to it.

The weight distribution Wt is next updated by first multiplying by either e−αt < 1 for outputscorrectly classified and eαt > 1 for incorrectly classified observations. This update can be thoughtof as a scaling of each observation i by exp (−αtyiht(xi)). Next, the resulting set of values isrenormalized by dividing through by the factor Zt to ensure that the new distribution Wt+1

indeed sum to one. The effect of this rule is to increase the weights of outputs misclassified by htand to decrease the weights of correctly classified outputs such that they will gain more attentionin the following iteration.

AdaBoost can be extended to non-binary classification and regression. However I willnot address these problems further and instead focus on binary classification. AdaBoost isimplemented in R via the package adabag which also handles bagging as mentioned earlierin Chapter 4. Unfortunately the main function boosting is somewhat slow, even for smalldatasets, hence I have implemented a version of Algorithm 6.1 in ClassifyR named adaboost

(see bivar_methods()) which is more than twice as fast as boosting. This makes a hugedifference since bivar needs to run the desired learning method 100 times per default.

6.2 Bounding the Training Error

Denote by Z =∏ts=1 Zs the product of the first t normalizing factors. Then we have the following

theorem:

6.1 Theorem: Suppose that εt, αt and Wt are chosen as in Algorithm 6.1. Then

Etrain(yboost) =1

N

∑(x,y)∈D

L(y, yboost(x)) ≤ Z,

where L is the 0/1−loss function

Proof: Notice that the weight distribution can be computed recursively as

Wt+1,i =Wt,i

Ztexp (−αtyiht(xi))

=1

N

1

Zexp

(−yi

t∑s=1

αshs(xi)

),

and since∑Ni=1Wt+1,i = 1 we have

Z =1

N

N∑i=1

exp (−yiH(xi)) , (6.1)

where H(xi) =∑ts=1 αshs(xi) is the additive weighted combination of the base learners at iteration

t. Assume we have constructed the predictor yboost = sign (H(xi)) using t iterations and wemiss-classify k observations. Then the training error becomes

Etrain(yboost) =k

N.

At the same time we see that

1

N

N∑i=1

exp (−yiH(xi)) =1

N

∑i:yi=yboost(xi)

exp (−|H(xi)|) +1

N

∑i:yi 6=yboost(xi)

exp (|H(xi)|) . (6.2)

34

6. Boosting

Let A = maxi:yi=yboost(xi) |H(xi)| and B = mini:yi 6=yboost(xi) |H(xi)|. Then (6.2) is always greaterthan or equal to

N − kN

exp(−A) +k

Nexp(B) >

k

N,

hence Etrain(yboost) ≤ Z as we wanted to show.

In fact αt = 12 log

(1−εtεt

)is chosen to minimize Zt. This follows by minimizing (6.3) w.r.t αt. We

can write Z in terms of the errors and improve on the bound in Theorem 6.1.

6.2 Corollary: Assuming that for all t, εt = 1/2− γt for some edge γt > 0 it holds that

exp

(−2

t∑s=1

γ2t

).

Proof: First notice that, at each iteration, the normalizing factor Zt can be written as

Zt =

N∑i=1

Wt,i exp (−αtyiht(xi))

= exp (−αt)∑

i:yi=ht(xi)

Wt,i + exp (αt)∑

i:yi 6=ht(xi)

Wt,i

= exp (−αt) (1− εt) + exp (αt) εt (6.3)

=

√εt

1− εt(1− εt) +

√1− εtεt

εt

= 2√εt(1− εt)

We now use that εt = 1/2−γt for all t; since we are only considering binary classification problems,a random prediction will be correct exactly half of the time. Hence, εt = 1/2− γt means that thepredictions of the base learners are slightly better than random guessing. It follows that

Zt =√

1− 4γ2t ≤ exp

(−2γ2

t

),

where the inequality√

1− 4x2 ≤ exp(−2x2

)for x ∈ [0, 0.5] is applied. Finally we have

Etrain(yboost) ≤ exp

(−2

t∑s=1

γ2t

)(6.4)

which ends the proof.

It can be seen that AdaBoost reduces the error exponentially fast. Also, to achieve an error lessthan δ, the minimum number of iterations t needed is lower bounded by⌈

1

2γ2log(1/δ)

⌉,

where it is assumed that all edges are equal, γ = γ1 = γ2 = · · · = γt. Thus, with base learners allhaving edge γ = 0.45 we need at least t = 600 rounds of AdaBoost in order to achieve a trainingerror of δ = 0.05.

The bound in (6.4) suggests that the training error will eventually become zero as the numberof iterations increases. Thus one would conjecture that AdaBoost tends to overfit data. Thisphenomona counters Ocams Razor where a model is punished when the complexity (in thiscase the number of rounds) is increased. In some studies, researchers have observed decreasinggeneralization error, even after the training error hit zero. In Section 6.5 I give a theoreticaljustification of this phenomena, see Theorem 6.3.

35

6.3. Choosing the Base Learners

6.3 Choosing the Base Learners

In this section it is shown how new base learners are constructed based on the weight distributiongiven in Algorithm 6.1. In the previous section we saw that AdaBoost tries to minimize the trainingerror through the product of the normalizing constants Z = 1

N

∑Ni=1 exp (−yiH(xi)) which is an

estimate of the expectation E[exp (−Y H(X))].Assume now that we have constructed an additive weighted combination H and α according

to algorithm 6.1 and we wish to find a base learner h in order to construct the new additiveweighted combination H + αh such that E[exp (−Y (H(X) + αh(X)))] is minimized. We canminimize this expectation point wise by minimizing

Ex[exp (−Y (H(x) + αh))] = Ex[exp (−Y H(x)) exp(−Y αh(x))].

We can represent exp(−Y αh(x)) as a second order Taylor approximation around h(x) = 0, whenfixing α = 1, as 1− Y h(x) + y2h2/2 = 1− Y h(x) + 1/2 and therefore

Ex[exp (−Y H(x)) exp(−Y αh)] ≈ Ex[exp (−Y H(x)) (1− Y h(x) + 1/2]

= −Ex[exp (−Y H(x)) (Y h(x)− 1− 1/2] (6.5)

Hence, minimizing (6.5) is equivalent to maximizing

Ex[e−Y H(x)Y h(x)]

=(e−H(x)Px(Y = 1)− eH(x)Px(Y = −1)

)h(x)

∝(e−H(x)Px(Y = 1)− eH(x)Px(Y = −1)

)e−H(x)Px(Y = 1) + eH(x)Px(Y = −1)

h(x)

= E(x,Y )∼e−yH(x)Px(y)[Y |X = x]h(x), (6.6)

where E(x,Y )∼e−yH(x)Px(y)[Y |X = x] is the expected value of Y |X = x drawn from the distribution∝ e−yH(x)Px(y). From (6.6) it is seen, that Ex[exp (−Y (H(x) + αh))] is minimized by takingh(x) = sign

(E(x,Y )∼e−yH(x)Px(y)[Y |X = x]

)since this quantity is maximizing (6.6).

Suppose now, that we have determined αt and ht. Then, according to (6.6), the distributionfor the next round should be

e−y(H(x)+αtht(x))Px(y)

Zt=e−yH(x)Px(y)

Zt· e−yαtht(x) = Wt · e−yαtht(x),

where Zt is a normalizing constant. This is the way AdaBoost updates the weight distribution inAlgorithm 6.1. Finally, notice that

sign(E(x,Y )∼e−yH(x)Px(y)[Y |X = x]

)= sign

(P(x,Y )∼e−yH(x)Px(y)(Y = 1|X = x)− P(x,Y )∼e−yH(x)Px(y)(Y = −1|X = x)

)= arg max

yP(x,Y )∼e−yH(x)Px(y)(Y = y|X = x),

which is the Bayes error under the current distribution.

6.4 Consistency

Consistency of the AdaBoost algorithm has been studied, and proved to be true, in a number ofspecial cases with restrictive assumptions2. In [3], consistency of AdaBoost (and not a modificationhereof) was proved in a constructive way. In particular they showed that AdaBoost is consistentif stopped sufficiently early, after tN iterations, for tN = Nν with ν < 1.

2Leo Breiman is among the contributors of the work on consistency for AdaBoost and his work was the foundation tothe later and more satisfactory results.

36

6. Boosting

6.5 The Margin Distribution

Subsequently to the introduction of the AdaBoost algorithm, a lot of effort was put into explainingwhy AdaBoost works so well. In this section, I give an explanation through the notion of marginsfirst introduced in [23] by Schapire, Freund, Bartlett, and Lee. The margin is a quantitative measureof the confidence of a prediction made by the combined classifier. Recall that the combinedclassifier is yboost(x) = sign(H(x)) where

H(x) =

T∑t=1

αtht(x) (6.7)

is the additive weighted combination of the base learners. In the following it will be convenientto normalize the weights αt such that

αt :=αt∑Tt=1 αt

.

We now define the normalized additive weighted combination as

f(x) =

T∑t

αtht(x) =H(x)∑Tt=1 αt

.

Such a normalization does not change the combined classifier, and thus yboost(x) = sign(f(x)).For a given observation (x, y) we can now define the margin simply to be yf(x); this quantityis sometimes referred to as the normalized margin to distinguish it from the un-normalized marginyH(x) obtained by omitting the normalization step above. In fact, yf(x) is equivalent to theempirical margin mgK defined in Section 5.1. We can assume WLOG, that αt = 1/T for all t.Then, since y ∈ −1,+1

yf(x) =

T∑t=1

αt(yht(x))

=

T∑t=1

αt(1[ht(x) = y]− 1[ht(x) 6= y])

=1

T

T∑t=1

1[ht(x) = y]− 1

T

T∑t=1

1[ht(x) 6= y],

which is indeed the empirical margin mgK .Recall that the base classifiers ht have range −1,+1, , and that outputs y also are in −1,+1, .

Because the weights αt are normalized, this implies that f has range [−1,+1], and so the marginis also in [−1,+1]. Furthermore y = yboost(x) iff y has the same sign as f(x), that is, if and onlyif the margin of (x, y) is positive. Hence, the sign of the margin indicates whether or not weclassify correctly with the combined classifier. We can also visualize the effect AdaBoost has onthe margins of the learning set by plotting their distribution. In particular, we can create a plotshowing, for each θ ∈ [−1,+1], the fraction of observations in the learning set with margin at mostθ. For such a cumulative distribution curve, the bulk of the distribution lies where the curve risesthe most steeply. See Figure 7.5 in Chapter 7 for an example of margin cumulative distributions.

Let the set-up be as in the introduction of this chapter where D = (xi, yi)Ni=1 is a learningset drawn at random from Ω = X × Y according to a distribution D and where yi ∈ −1,+1 fori = 1, 2, . . . , N . The goal is to analyse the expected prediction error in terms of margins. In whatfollows I will denote by P∼D(·) the empirical distribution of the learning set where an observation(x, y) is selected uniformly at random from the training set D and E∼D[·] the corresponding

37

6.5. The Margin Distribution

expected value3. More specifically,

P∼D(yboost(X) 6= Y ) =1

N

N∑i=1

1[yboost(xi) 6= yi]

is the training error of yboost. Recall that yboost makes a mistake if and only if yf(x) is not positive,implying that the expected prediction error of yboost is equivalent to P (Y f(X) ≤ 0), and similarlyfor the training error.

In order to derive the next result, let H be a base learner space and define the convex hull C(H)

of H as the set of all mappings that can be generated by taking a weighted average of classifiersfromH:

C(H) :=

f : x 7→

∑h∈H

ahh(x)

∣∣∣∣ ∑h

ah = 1; ah ≥ 0, for all h ∈ H,

where it is understood that only finitely many a′hs may be non-zero.

6.3 Theorem: LetD be a distribution over X ×−1,+1, and letD be a learning set ofN observationschosen independently at random. Assume that the base learner space H is finite, and let δ > 0.Then with probability at least 1 − δ over the random choice of the learning sets, every functionf ∈ C(H) satisfies the following bound for all θ > 0:

P (Y f(X) ≤ 0) ≤ P∼D(Y f(X) ≤ θ) +O

(√log(|H|)Nθ2

· log

(Nθ2

log(|H|)+

log(1/δ)

N

)). (6.8)

Proof: First define by

Cm(H) :=

f : x 7→ 1

m

m∑j=1

hi(x)

∣∣∣∣ h ∈ Hthe set of unweigthed averages over m elements fromH and notice that, for any f ∈ Cm(H) andθ > 0 we can write

P (Y f(X) ≤ 0) ≤ P (Y f(X) ≤ θ/2) + P (Y f(X) > θ/2, Y f(X) ≤ 0), (6.9)

since for any two events A and B it holds that

P (A) = P (B ∩A) + P (B′ ∩A) ≤ P (B) + P (B′ ∩A). (6.10)

The (normalized) weights ah in C(H) naturally define a probability distribution overH and we canimagine an experiment in whichm base learners h1, h2, . . . , hm fromH are selected independentlyat random from this distribution where we choose hj to be equal to h with probability ah. We canthen form the average

f(x) =1

m

m∑j=1

hj(x)

which is clearly a member of Cm(H). I will denote by P∼Q(·) a probability where elements inCm(H) are randomly chosen according to a distribution Q defined by choosing h1, h2, . . . , hmindependently at random according to the normalized weights αh. Since (6.9) holds for anyf ∈ Cm(H), we can take expectation with respect to Q on the right hand side and get

P (Y f(X) ≤ 0) ≤ E∼Q[P (Y f(X) ≤ θ/2)] + E[P∼Q(Y f(X) > θ/2, Y f(X) ≤ 0)]

≤ E∼Q[P (Y f(X) ≤ θ/2)] + E[P∼Q(Y f(X) ≤ 0 | Y f(X) > θ/2)]

= E∼Q[P (Y f(X) ≤ θ/2)] + E[P∼Q(Y f(X)− Y f(X) > θ/2)]

≤ E∼Q[P (Y f(X) ≤ θ/2)] + E[P∼Q(|f(X)− f(X)| > θ/2)].

3In P∼D(·) the tilde is used to be explicit about that we are taking the probability according to the (uniform)distribution that generates D. Without the tilde, this notation would clash with earlier notation where a subscript isused to indicate conditional probabilities which is not wanted in this case.

38

6. Boosting

Notice that

E∼Q[f(x)] =1

mE∼Q

m∑j=1

hj(x)

= E∼Q[h1(x)] =∑h∈H

αhh(x) = f(x).

Thus, by Corollary A.3 in Appendix A, we have that

E[P∼Q(|f(X)− f(X)| > θ/2)] ≤ 2e−mθ2/8,

since each hj is in the interval [−1,+1] of length 2. Next, we want to bound the expectationE∼Q[P (Y f(X) ≤ θ/2)]. In order to proceed, we need the following lemma.

6.4 Lemma: With probability at least 1 − δ ( where the probability is taken over the choice of therandom learning set), for all m ≥ 1, for all f ∈ Cm(H) and for all θ > 0 it holds that

P (Y f(X) ≤ θ/2) ≤ P∼D(Y f(X) ≤ θ/2) + εm (6.11)

where εm =√

1/(2N) · log (m(m+ 1)2|H|m/δ).

Proof: Let

pf ,θ := P (Y f(X) ≤ θ/2) and pf ,θ := P∼D(Y f(X) ≤ θ/2)

and assume first that m, f and θ are fixed. Since pf ,θ = E[pf ,θ], we can apply Corollary A.3 toobtain

P (pf ,θ ≥ pf ,θ + εm)

=P (pf ,θ ≤ pf ,θ − εm) ≤ e−2mε2m ,

since pf ,θ is an average of indicator functions having range in [0, 1] of length 1. This meansthat (6.11) holds for a particular choice of f and θ with high probability. By construction off the inequality yf(x) ≤ θ/2 is equivalent to y

∑mj=1 hj(x) ≤ mθ

2 which in turn is equivalentto y

∑mj=1 hj(x) ≤

⌊mθ2

⌋since the left hand side is an integral number. Thus pf ,θ = pf ,θ and

pf ,θ = pf ,θ where θ is chosen such that mθ/2 =⌊mθ/2

⌋. That is

θ ∈ Θm =

2i

m| i = 0, 1, . . . ,m

.

Notice that we never have to consider θ > 2 since yf(x) ∈ [−1,+1]. Hence, for fixed m, theprobability that the event Ff ,θ,m := pf ,θ ≥ pf ,θ + εm occurs for any f ∈ Cm(H) and any θ > 0 is

P(∃(f ∈ Cm(H), θ > 0) such that pf ,θ ≥ pf ,θ + εm

)=P

(∃(f ∈ Cm(H), θ ∈ Θm) such that pf ,θ ≥ pf ,θ + εm

)=P

⋃f∈Cm(H),θ∈Θm

Ff ,θ,m

≤

∑f∈Cm(H),θ∈Θm

P(Ff ,θ,m

)=|Cm(H)| · |Θm| · e−2mε2m

=|H|m · (m+ 1) · e−2mε2m

=δ

m(m+ 1)

39

6.5. The Margin Distribution

where the last equality follows by the choice of εm. Finally, the probability that Ff ,θ,m occurs forany m ≥ 1, for any f ∈ Cm(H) and for any θ > 0 is at most

∞∑m=1

δ

m(m+ 1)= δ

∞∑m=1

(1

m− 1

m+ 1

)= δ,

implying that pf ,θ ≤ pf ,θ + εm occurs with probability 1− δ which ends the proof of the lemma.

Using Lemma 6.4 we can write

E∼Q[P (Y f(X) ≤ θ/2)]

≤E∼Q[P∼D(Y f(X) ≤ θ/2) + εm]

=P∼D,∼Q(Y f(X) ≤ θ/2) + εm.

To finish the argument we apply the inequality in (6.10) again to obtain

P∼D,∼Q(Y f(X) ≤ θ/2) ≤ P∼D,∼Q(Y f(X) ≤ θ) + P∼D,∼Q(Y f(X) ≤ θ/2, Y f(X) > θ)

= P∼D(Y f(X) ≤ θ) + E∼D[P∼Q(Y f(X) ≤ θ/2, Y f(X) > θ)]

≤ P∼D(Y f(X) ≤ θ) + E∼D[P∼Q(|f(X)− f(X)| > θ/2)]

≤ P∼D(Y f(X) ≤ θ) + 2e−mθ2/8

implying that

P (Y f(X) ≤ 0) ≤ P∼D(Y f(X) ≤ θ) + 4e−mθ2/8 +

√log (m(m+ 1)2|H|m/δ)

2N, (6.12)

which is true for everym. Hence, we can selectm to minimize (6.12). Since this is hard to minimize,we notice that

4e−mθ2/8 +

√log (m(m+ 1)2|H|m/δ)

2N≤ 4e−mθ

2/8 +log(|H|2m/δ

)2N

,

where the right hand side is minimized when

m =

⌈8

θ2log

(Nθ2

log (|H|)

)⌉. (6.13)

Finally we observe that

O

(4e−mθ

2/8 +

√log (m(m+ 1)2|H|m/δ)

2N

)

= O

(√m+m log(|H|) + log(1/δ)

2N

)

= O

(√log(|H|)Nθ2

· log

(Nθ2

log(|H|)

)+

log(1/δ)

N

)where I have used (6.13) and the fact that O(log(n) = n for any number n. This concludes theproof.

Looking at (6.8) we see that the "big-oh" term becomes small as the size, N , gets larger, providedthat the size ofH is controlled for θ bounded away from zero. In Theorem 6.3 we assumed that thebase learner spaceH was finite which restricts the set of all base learners to that of decision treeswith discrete-valued input space. Non-tree structured base learners can also be applied. Howeverempirical results show that the C4.5 and CART methodology are very well suited for boosting. Asimilar result holds when |H| is not assumed to be finite, see [23]. Interestingly the bound doesnot depend on the number of iterations which indeed suggests that boosting rarely overfits, asmentioned in Section 6.2.

40

6. Boosting

6.5 Theorem: Suppose that the AdaBoost algorithm generates base learners with errors ε1, ε2, . . . , εT .Then for any θ, it holds that

P∼D(Y f(X) ≤ θ) ≤T∏t=1

√(1 + 2γt)1+θ(1− 2γt)1−θ

where γt := 1/2− εt and f is defined in (6.7).

Proof: Note that yf(x) ≤ θ if and only if

y

T∑t=1

αtht(x) ≤ θT∑t=1

αt

which in turn holds if an only if

1 ≤ exp

(−y

T∑t=1

αtht(x) + θ

T∑t=1

αt

).

Thus,

1[yf(x) ≤ θ] ≤ exp

(−y

T∑t=1

αtht(x) + θ

T∑t=1

αt

)implying that

P∼D(Y f(X) ≤ θ) =1

N

N∑i=1

1[yif(xi) ≤ θ]

≤ 1

N

N∑i=1

exp

(−yi

T∑t

αtht(xi) + θ

T∑t

αt

)

=exp

(θ∑Tt=1 αt

)N

N∑i=1

exp

(−yi

T∑t=1

αtht(xi)

)

= exp

(θ

T∑t=1

αt

)(T∏t=1

Zt

),

where the last equality follows from the identity in (6.1). Plugging in

αt =1

2log

(εt

1− εt

)=

1

2log

(1 + 2γt1− 2γt

)and Zt =

√1− 4γ2

t =√

(1 + 2γt)(1− 2γt)

we get

P∼D(Y f(X) ≤ θ) ≤ exp

(T∑t=1

θ

2log

(1 + 2γt1− 2γt

)) T∏t=1

√(1 + 2γt)(1− 2γt)

= exp

(T∑t=1

log((1 + 2γt)θ/2 · (1− 2γt)

−θ/2)

)T∏t=1

√(1 + 2γt)(1− 2γt),

from which the result follows.

To reason about this bound, assume that, for all t, εt ≤ 12 − γ for some γ > 0. Then we can

simplify the upper bound in Theorem 6.5 to(√(1 + 2γ)1+θ(1− 2γ)1−θ

)T.

41

6.6. Bias Variance Study of AdaBoost

If θ < Γ(γ) where

Γ(γ) :=− log(1− 4γ2)

log(

1+2γ1−2γ

) ,

it is easily seen that √(1 + 2γ)1+θ(1− 2γ)1−θ < 1,

implying that the sample probability that Y f(X) ≤ θ decreases exponentially fast with T . Inother words, when every base learner is only slightly better than random guessing, the probabilityP∼D(yf(x) ≤ Γ(γ)) tends to zero with T increasing, resulting in high prediction accuracy. Thiswill happen sooner or later, since we have restricted the base learners to be better than randompredictions.

6.6 Bias Variance Study of AdaBoost

In Figure 6.1, I have applied bivar to four of the datasets4 listed in Table 3.2; (a), (a2) Ionosphere.(b),(b2) Liver. (c), c(2) Diabetes. (d), (d2) Sonar. Here a single letter is used to indicate thatbaselearners are chosen as decision stumps, decision trees of height one (two leaves and oneroot), whereas a letter followed by "2" indicates that bivar is run on the same dataset as thecorresponding letter without a "2", but instead the baselearners are pruned CART trees of arbitraryheights, that is; trees with default settings in rpart.

The first thing to notice is, that AdaBoost performs better with fully developed trees thandecision stumps, which is also intuitively clear. However, there is a trade off in runtime betweenthese two base learners which, in some situations, favours decision stumps. In the following Iwill only consider (a2), (b2), (c2) and (d2) and for these, the bias stays approximately the sameover all iterations as expected, whereas the net variance is reduced significantly which is the sametrend as with random forests in Figure 5.2. For ionosphere, random forests (at 500 trees) andadaboost have almost the same error (at 500 iterations) whereas adaboost outperforms randomforest in liver, diabetes and sonar. A fair conjecture would be that adaboost handles datasetswith many predictors better than random forests, since random forests may pick many unwantedpredictors at each iteration. A much more thorough data analysis would be required to confirmsuch a conjecture, however.

4The four datasets are those with binary outcome.

42

6. Boosting

b error v_n

0

5

10

15

20

1 2 3 4 5 10 20 50 100 200 300 400 500

Iterations

Err

or (

%)

(a)

0

5

10

15

1 2 3 4 5 10 20 50 100 200 300 400 500

Iterations

Err

or (

%)

(a2)

10

20

30

40

1 2 3 4 5 10 20 50 100 200 300 400 500

Iterations

Err

or (

%)

(b)

10

20

30

1 2 3 4 5 10 20 50 100 200 300 400 500

Iterations

Err

or (

%)

(b2)

0

10

20

1 2 3 4 5 10 20 50 100 200 300 400 500

Number of trees

Err

or (

%)

(c)

0

10

20

30

1 2 3 4 5 10 20 50 100 200 300 400 500

Number of trees

Err

or (

%)

(c2)

10

20

1 2 3 4 5 10 20 50 100 200 300 400 500

Number of trees

Err

or (

%)

(d)

0

10

20

30

1 2 3 4 5 10 20 50 100 200 300 400 500

Number of trees

Err

or (

%)

(d2)

Figure 6.1: Eight plots using bivar with AdaBoost. (a), (a2) Ionosphere. (b),(b2) Liver. (c), c(2)Diabetes. (d), (d2) Sonar.

43

7 Forensic Genetics - Data Driven Example

Genes, DNA (deoxyribo nucleic acid) and chromosomes are what makes a person (or an organismin general) unique. They are the hereditary material passed on from our parents holding a specificset of instructions and is found in all living cells (nerve cells, skin cells, hair cells etc.) of thehuman body. Almost every cell has the same basic parts; an outer border called the membranewhich contains a liquid material called cytoplasm and in the cytoplasm is the nucleus. Furthermore,inside the nucleus are the 23 pairs of chromosomes; 22 autosomales which are the same in both malesand females and one pair determining the sex (males(XY) and females(XX)). The chromosomes arereally long strings of DNA which is shaped like a ladder that has been twisted into a double helix,see Figure 7.1. The steps of the ladder are made of four bases (also termed nucleotides); adenine(A), guanine (G), cytosine (C) and thymine (T) and more than 99% of those bases are the same inall people. The bases always come in pairs; A always binds to T and G always binds to C, exceptwhen a mutation is formed where the possible mutations are A→ T ,T → A, G→ C and C → G.For an example, the pair A/T may mutate to T/T. A gene is a sequence of the DNA that codes fora protein that is used to express some trait; not all DNA sequences codes for proteins, and so notall DNA sequences are genes. A genome is an organism’s complete set of DNA, including all of itsgenes. Each genome contains all of the information needed to build and maintain that organism.In humans, a copy of the entire genome, more than 3 billion DNA base pairs, is contained inall cells that have a nucleus; that is, a set of 23 chromosomes in the human body constitutes thegenome.

An allele is one of a series of different forms of a gene that arises by mutation and are foundat the same loci on a chromosome.1. A population or species of organisms typically includesmultiple alleles at each locus among various individuals. Alleles are also referred to as geneticmarkers, since they distinguish individuals from the population.

A SNP (single-nucleotide polymorphism) is a DNA sequence variation occurring when a singlebase — A, T, C, or G — in the genome differs between members of a species. For example, twoDNA fragments from different individuals, AAGCCTA and AAGCTTA, contain a difference ina single base and we refer to the former sequence as allele C and the latter as allele T. There arevariations between human populations, so a SNP that is common in one geographical or ethnicgroup may be much rarer in another.

AT

CG

GC

TA

AT

Figure 7.1: An illustration of a DNA sequence with 5 base pairs.

Forensic genetics is the branch of genetics, the science study of heredity, that deals withthe application of DNA in criminal investigations, fatherhood migration and disaster victimidentification, see for example [9]. DNA can be detected and analysed using a number ofdifferent forensic techniques, each of which target different parts of DNA. In what follows, I

1The word "allele" is a short form of allelomorph which means "other form"

44

7. Forensic Genetics - Data Driven Example

will analyse the dataset GRLDNA which is a collection of 2017 individuals, Greenlanders (103)and non-Greenlanders (1914), each having 562 recorded SNPs.2 Hence, the goal is to seggregategreenlanders from non-greenlanders and detect the most influential genetic markers (or SNPs).Four randomly chosen observations in GRLDNA is listed in Table 7.1, where each SNP is one of theforms A/A, A/T, T/T, C/C, C/G and G/G.

Although the table only highlights 5 of the 562 SNPs, it is apparent that variation occursbetween NGRL (Greenlanders) and NGRL (non-Greenlanders). With that being said, the first NGRLhas many similarities with the second GRL. Now, in order to actually analyse GRLDNA we have do

Population SNP_1 SNP_2 SNP_3 SNP_4 SNP_5

NGRL T/T A/A G/G T/T C/CNGRL T/T A/G G/G C/T C/CGRL C/T G/G G/G C/T T/TGRL T/T A/G G/G T/T C/T

Table 7.1: A small extract of the dataset GRL illustrating the SNP combinations of Greenlandersand non-Greenlanders.

decide upon a "rule" in order to distinguish the groups of GRL and NGRL. It is of course possible todecide that T/T and C/T, for an example, are different (and they are) - but I use a somewhat moresubtle discrimination. I let the first person in GRLDNA be a reference; that is, the first letter for eachSNP is extracted. The first 5 reference letters are T,A,G,C and C. Next, all other observations aretransformed by counting the number of reference alleles, such that an observation with SNP_1 =T/T becomes SNP_1 = 2 and SNP_1 = C/T becomes SNP_1 = 1 etc. In Table 7.2 a conversion ofTable 7.1 with this method is listed.

Population SNP_1 SNP_2 SNP_3 SNP_4 SNP_5

NGRL 2 2 2 0 2NGRL 2 1 2 1 2GRL 1 0 2 1 0GRL 2 1 2 0 1

Table 7.2: A small extract of the dataset GRL illustrating the conversion from base pairs to numbersof bases compared to the reference base.

7.1 Analysis of GRLDNA

Throughout the analysis I use the CART procedure for single trees, since the running time ofCART is very low when independent variables are numeric, as in GRLDNA. In order to growCART trees, there are 3 parameters to consider; Nmin, Nleaf (see Chapter 3) and CP where CP isa complexity parameter controlling for the degree of pruning as described in Section 3.2. I willlet Nmin = 3 · Nleaf which is default since it is interrelated with Nleaf and I also let CP = 0.01

which is default. Notice that Nleaf implicitly specifies the height of a tree where small numbers ofNleaf will cause higher trees and large numbers will cause smaller trees. Throughout the analysisI will produce both training errors and test errors, where the training error is an initial and roughindication of the prediction accuracy. Also, class errors are given when computing the trainingerror which is of big importance.

In the first attempt to segregate GRL from NGRL I ran a CART procedure with all parametersset to default and obtained the confusion matrix in Table 7.3 where the training error is 1.24%.

2The original dataset consistet of 5572 observations; in 3555 of them there were missing. In this analysis I will ignorethis issuse. A possible remedy is imputation.

45

7.1. Analysis of GRLDNA

Training error: 1.24% GRL NGRL class error

GRL 86 17 16.5%NGRL 8 1906 0.04%

Table 7.3: Confusion matrix obtained by CART with default settings. Rows are observations andcolumns are class predictions.

Furthermore, we see that 17 GRL were wrongly predicted as NGRL and 8 NGRL were wronglypredicted as GRL. Although the training error is very satisfactory, the class error of GRL is 16.5%which we clearly want to reduce.

Next, I applied bivar to GRLDNA using the CART procedure, to asses the test error and locatea reasonable value of Nleaf such that the error is minimized. In Figure 7.2 it is seen that the test

1

2

3

20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

N_leaf

Err

or (

%)

b

error

v_n

Figure 7.2: A bivar plot of GRLDNA with varying numbers of Nleaf .

error is decreasing from 3.5% to 2.75%, as the complexity parameter Nleaf becomes smaller. Thesmallest error, 2.6%, is obtained at Nleaf = 2 and the trend of increasing variance and decreasingbias is in accordance with the observations in Section 3.4 for C5.0 trees. Choosing Nleaf = 2

resulted in the confusion matrix in Table 7.4 where the training error is only 0.06%. Notice that

Error: 0.06% GRL NGRL class error

GRL 98 5 4.9%NGRL 7 1907 0.037%

Table 7.4: Confusion matrix obtained by CART with Nleaf = 2.

the class error for GRL is reduced from 16.5% to 4.9%.The first random forest was run with default values; T = 500 number of trees and mtry =

d√

562e = 24 which resulted in an OOB error of 0.74% and the confusion matrix seen in Table7.5. I found that increasing the number of trees did not result in any improvements. However

Training error: 0.07% GRL NGRL class error

GRL 88 15 14.5%NGRL 0 1914 0%

Table 7.5: Confusion matrix obtained by random forests with T = 500 and mtry = 24.

setting mtry = 80 resulted in the lowest OOB error of 0.03%, where the class error of GRL is

46


only 5.8% amounting to 97 correct and 6 wrong predictions in the GRL class. All NGRL were stillaccurately predicted. Next, I ran a variable importance procedure to identify the most prominentvariables using both decrease in accuracy and decrease in Gini index which resulted in the outputin Figure 7.3 where it is seen that SNP_271 and SNP_241 are indeed good predictors. Since the

SNP_356SNP_41SNP_394SNP_416SNP_479SNP_71SNP_66SNP_42SNP_271SNP_241

10 15 20MeanDecreaseAccuracy

SNP_356SNP_41SNP_394SNP_71SNP_416SNP_479SNP_66SNP_42SNP_241SNP_271

0 5 10 15MeanDecreaseGini

Figure 7.3: Variable importance plot obtained from a random forest. Only the 10 most importantvariables are displayed.

mean accuracy and Gini index gives slightly different results, I decided to pool the importantvariables obtained from each method, and pick the union of these which resulted in the following40 important SNPs

[1] SNP_6 SNP_41 SNP_42 SNP_52 SNP_56 SNP_65 SNP_66 SNP_68 SNP_71




[37] SNP_506 SNP_535 SNP_539 SNP_541

In Figure 7.4, I have applied bivarwith the CART procedures again, only with these 40 importantvariables which, interestingly, resulted in an even better test error, 2% at Nleaf = 1, than observedin Figure 7.2. This is hugely favourable in terms of algorithmic running time, but also in termsof locating key variables that distinguish NGRL from GRL. Observe that the trend of increasingvariance and decreasing bias also appears in Figure 7.4 until Nleaf ≤ 12 where it seems thatthe variance decreases. In fact, at Nleaf = 1 the bias and variance contributes equally much tothe error, each being 1% of the total error. The final decision tree produced by CART applied toGRLDNA with the aforementioned 40 important variables and Nleaf = 1 is depicted in Figure 7.6and has training error 0.06% with a GRL class error of 5%. This tree serves as an insight into thecomplex black box, which the random forest indeed is.

Finally I ran GRLDNA through the boosting procedure boosting in adabag where a zerotraining error was obtained after 5 iterations. This is also reflected in Figure 7.5 where the graphsof the margins for T ≥ 5 number of iterations, is never to the left of zero.3 It is seen, that asthe number of iterations is increased, we have more confidence in the AdaBoost classifier, and atT = 50 the cumulative distribution is nearly identical to the one with T = 200. This implies, thatthe number of iterations needed to achieve "a good" prediction accuracy does not need to be large.

3The cumulative distribution graphs are obtained by calculating margins on the same data as used to build the modeland thus, they reflect the training error. We should expect a slight deviation if these graphs were based on new unseendata.

47

7.2. Conclusion

1

2

3

20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

N_leaf

Err

or (

%)

b

error

v_n

Figure 7.4: A bivar plot of GRLDNA with varying numbers of Nleaf .

Furthermore, running bivar with the boost method, with T = 100 number of iterations,resulted in a test error of 0.05% which is quite similar to the 0.03% OOB error achieved by therandom forest procedure.

0.00

0.25

0.50

0.75

1.00

−1.0 −0.5 0.0 0.5 1.0

m

% o

bser

vatio

ns

Iteration

T = 1

T = 2

T = 3

T = 4

T = 5

T = 20

T = 50

T = 200

Figure 7.5: Margin cumulative distribution graphs of GRLDNA with various iterations T .

7.2 Conclusion

I have shown throughout this analysis, that the two state-of-the art methods, random forests andAdaBoost, are extremely powerful in terms of segregating Greenlanders from non-Greenlandersbased on 40 important SNP’s. Furthermore, the single tree-structured CART procedure was ableto produce some useful insights into how such a segregation is conducted. It would be interestingto redo this analysis with the complete data set of 5572 observations if no missing values werepresent. Hopefully the results will be as satisfactory as seen here.

48


7.3 CART Decision Trees for GRLDNA

< 0.5

≥ 1.5

< 1.5

≥ 1.5 < 1.5

≥ 1.5

< 1.5

< 0.5

≥ 1.5

< 1.5

< 1.5 ≥ 1.5

≥ 1.5

< 1.5

≥ 0.5

< 0.5

≥ 0.5 < 0.5

≥ 0.5

≥ 0.5

≥ 0.5

< 1.5 ≥ 1.5

< 0.5 ≥ 0.5

< 0.5

< 0.5

< 0.5 ≥ 0.5

≥ 0.5

SNP271

SNP289

SNP326

SNP171

GRL55/56

NGRL3/3

NGRL8/8

SNP56

SNP171

SNP412

SNP163

GRL21/22

NGRL2/2

NGRL4/4

NGRL27/27

SNP241

SNP152

GRL7/8

NGRL14/14

NGRL98/99

SNP295

SNP66

GRL9/12

SNP266

GRL2/2

NGRL29/29

SNP66

SNP394

GRL3/3

NGRL36/38

NGRL1687/1690

Figure 7.6: A CART decision tree of GRLDNA with Nleaf = 1 constructed based on 40 importantvariables.

49

Bibliography

[1] Esteban Alfaro, Matías Gámez, and Noelia García. “adabag: An R Package forClassification with Boosting and Bagging”. In: Journal of Statistical Software 54.2 (2013),pp. 1–35. URL: http://www.jstatsoft.org/v54/i02/ (cit. on p. 23).

[2] Revolution Analytics and Steve Weston. doParallel: Foreach Parallel Adaptor for the ’parallel’Package. R package version 1.0.10. 2015. URL:https://CRAN.R-project.org/package=doParallel (cit. on p. 9).

[3] Peter L Bartlett and Mikhail Traskin. “Adaboost is consistent”. In: Journal of MachineLearning Research 8.Oct (2007), pp. 2347–2368 (cit. on p. 36).

[4] Simon Bernard, Laurent Heutte, and Sébastien Adam. “A study of strength and correlationin random forests”. In: International Conference on Intelligent Computing. Springer. 2010,pp. 186–191 (cit. on p. 29).

[5] Leo Breiman. “Bagging predictors”. In: Machine learning 24.2 (1996), pp. 123–140 (cit. onpp. 23, 24).

[6] Leo Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32 (cit. on p. 25).

[7] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification andregression trees. CRC press, 1984 (cit. on pp. 11, 19–21).

[8] Leo Breiman et al. “Statistical modeling: The two cultures (with comments and a rejoinderby the author)”. In: Statistical science 16.3 (2001), pp. 199–231 (cit. on p. 1).

[9] John M Butler. Advanced topics in forensic DNA typing: methodology. Academic Press, 2011(cit. on p. 44).

[10] Pedro Domingos. “A unified bias-variance decomposition”. In: Proceedings of 17thInternational Conference on Machine Learning. Stanford CA Morgan Kaufmann. 2000,pp. 231–238 (cit. on pp. 6, 8).

[11] Dirk Eddelbuettel, Romain François, J Allaire, John Chambers, Douglas Bates, andKevin Ushey. “Rcpp: Seamless R and C++ integration”. In: Journal of Statistical Software 40.8(2011), pp. 1–18 (cit. on p. 9).

[12] Bradley Efron and Trevor Hastie. Computer Age Statistical Inference. Vol. 5. CambridgeUniversity Press, 2016 (cit. on p. 1).

[13] Usama M Fayyad and Keki B Irani. “On the handling of continuous-valued attributes indecision tree generation”. In: Machine learning 8.1 (1992), pp. 87–102 (cit. on p. 16).

[14] Yoav Freund and Robert E Schapire. “A desicion-theoretic generalization of on-linelearning and an application to boosting”. In: European conference on computational learningtheory. Springer. 1995, pp. 23–37 (cit. on p. 33).

[15] Max Kuhn, Steve Weston, Nathan Coulter, and Mark Culp. C code for C5.0 by R. Quinlan.C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.0-24. 2015. URL:https://CRAN.R-project.org/package=C50 (cit. on p. 9).

[16] Mads Lindskou. ClassifyR: An implementation of bias-variance decomposition for zero-one loss. Rpackage version 0.1.1. URL: http://github.com/Lindskou/ClassifyR (cit. on p. 8).

50

http://www.jstatsoft.org/v54/i02/

https://CRAN.R-project.org/package=doParallel

https://CRAN.R-project.org/package=C50

http://github.com/Lindskou/ClassifyR

Bibliography

[17] Gilles Louppe. “Understanding random forests: From theory to practice”. In: arXiv preprintarXiv:1407.7502 (2014) (cit. on pp. 4, 11).

[18] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN:0262018020, 9780262018029 (cit. on p. 1).

[19] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014 (cit. on pp. 16, 18).

[20] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing. Vienna, Austria, 2016. URL: https://www.R-project.org/(cit. on p. 8).

[21] Laura Elena Raileanu and Kilian Stoffel. “Theoretical comparison between the gini indexand information gain criteria”. In: Annals of Mathematics and Artificial Intelligence 41.1 (2004),pp. 77–93 (cit. on p. 19).

[22] Juan José Rodriguez, Ludmila I Kuncheva, and Carlos J Alonso. “Rotation forest: A newclassifier ensemble method”. In: IEEE transactions on pattern analysis and machine intelligence28.10 (2006), pp. 1619–1630 (cit. on p. 26).

[23] Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. “Boosting the margin: Anew explanation for the effectiveness of voting methods”. In: The annals of statistics 26.5(1998), pp. 1651–1686 (cit. on pp. 37, 40).

[24] Erwan Scornet, Gérard Biau, Jean-Philippe Vert, et al. “Consistency of random forests”. In:The Annals of Statistics 43.4 (2015), pp. 1716–1741 (cit. on p. 29).

[25] Terry Therneau, Beth Atkinson, and Brian Ripley. rpart: Recursive Partitioning and RegressionTrees. R package version 4.1-10. 2015. URL:https://CRAN.R-project.org/package=rpart (cit. on p. 9).

[26] Robert Tibshirani Trevor Hastie and Jerome Friedman. The Elements of Statistical Learning.Second edition. Springer Series in Statistics, 2009. ISBN: 9780387848570 (cit. on pp. 1, 5).

[27] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Fourth. ISBN0-387-95457-0. New York: Springer, 2002. URL:http://www.stats.ox.ac.uk/pub/MASS4 (cit. on p. 9).

[28] Hadley Wickham and Winston Chang. devtools: Tools to Make Developing R Packages Easier. Rpackage version 1.12.0. 2016. URL:https://CRAN.R-project.org/package=devtools (cit. on p. 9).

[29] Hadley Wickham and Romain Francois. dplyr: A Grammar of Data Manipulation. R packageversion 0.5.0. 2016. URL: https://CRAN.R-project.org/package=dplyr (cit. onp. 9).

[30] Xindong Wu and Vipin Kumar. Top 10 algorithms in data mining. Vol. 14. 1. Springer, 2008,pp. 1–37 (cit. on pp. 1, 13, 16, 33).

51

https://www.R-project.org/

https://CRAN.R-project.org/package=rpart

http://www.stats.ox.ac.uk/pub/MASS4

https://CRAN.R-project.org/package=devtools

https://CRAN.R-project.org/package=dplyr

A Hoeffding’s Inequality

Let Z be any non-negative random variable and t > 0. Then

E[Z] = E[Z1[Z ≥ t]]≥ E[t1[Z ≥ t]]= P (Z ≥ t) · t,

implying that P (Z ≥ t) ≤ E[Z]t which is also known as Markov’s inequality. Using this we can, for

an arbitrary random variable X , derive Chebyshev’s inequality:

P (|X − E[X] | ≥ t) = P ((X − E[X])2 ≥ t2) ≤ Var[X]

t2.

A.1 Lemma: Let X be a random variable such that E[X] = 0 and a ≤ X ≤ b with probability 1. Then

E[esX

]≤ e

s2(b−a)2

8 , s > 0.

Proof: Notice that

esx = eαsb+(1−α)sa,

when α = x−ab−a . Since the exponential function is convex and α ∈ [0, 1] this implies that

esx ≤ αesb + (1− α)esa =x− ab− a

esb +b− xb− a

esa.

Thus

E[esX

]≤ E

[X − ab− a

esb]

+ E[b−Xb− a

esa]

=b

b− aesa − a

b− aesb

= (1− λ+ λes(b−a))e−λs(b−a)

since E[X] = 0 and where λ = −ab−a . Now let u = s(b− a) and define

φ(u) := −λu+ log(1− λ+ λeu)

such that

E[esX

]≤ (1− λ+ λes(b−a))e−λs(b−a)) = eφ(u). (A.1)

We see that φ(0) = 0 and

φ′(u) = −λ+λeu

1− λ+ λeu(A.2)

52

A. Hoeffding’s Inequality

implying that also φ′(0) = 0. From (A.2) we have

φ′′(u) =λeu(1− λ+ λeu)− λ2e2u

(1− λ+ λeu)2

=λeu

1− λ+ λeu

(1− λeu

1− λ+ λeu

)= κ(1− κ)

where κ = λeu/(1− λ+ λeu). Note that κ(1− κ) ≤ 1/4 for any value of κ. Finally a Taylor seriesof φ around zero with remainder is given by

φ(u) = φ(0) + uφ′(0) +u2

2φ′′(ν), for some ν ∈ [0, u]

=u2

2φ′′(ν)

≤ u2

8

=s2(b− a)2

8

and the result now follows from (A.1).

A.2 Theorem (Hoefding’s Inequality): LetZ1, Z2, . . . , ZN be independent bounded random variablessuch that Zi ∈ [ai, bi] with probability 1. Let Sn =

∑Ni=1 Zi. Then for any t > 0 we have

P (SN − E[SN ] ≥ t) ≤ e− 2t2∑N

i=1(bi−ai)

2. (A.3)

Proof: For s > 0 we have by independence and Markov’s inequality that

P (SN − E[SN ] ≥ t) = P (es(SN−E[SN ]) ≥ est)

≤ e−stE[es(SN−E[SN ])

]= e−st

N∏i=1

E[es(Zi−E[Zi])

].

Hence we seek a good bound for the expectation E[es(Zi−E[Zi])

]. Using Lemma (A.1) we have

P (SN − E[SN ] ≥ t) ≤ e−stN∏i=1

E[es(Zi−E[Zi])

]≤ e−stes

2 ∑Ni=1(bi−ai)2/8. (A.4)

Choosing s = 4t/(∑Ni=1(bi − ai))2 to minimize (A.4) we obtain

P (SN − E[SN ] ≥ t) ≤ e− 2t2∑N

i=1(bi−ai)

2

which is the desired result.

A.3 Corollary: With the same assumptions as in Theorem A.2 it holds that

P (SN − E[SN ] ≤ −t) ≤ e− 2t2∑N

i=1(bi−ai)

2 and P (|SN − E[SN ] | ≥ t) ≤ 2e− 2t2∑N

i=1(bi−ai)

2.

Proof: To prove the first inequality one just needs to apply (A.3) to the random variables−Z1,−Z2, . . . ,−ZN . Finally, the second inequality follows by using (A.3) and the first inequalitysimultaneously.

53

Package ‘ClassifyR’August 19, 2017

Type Package

Title An implementation of bias-variance decomposition for zero-one loss

Version 1.0.1

Description Compute the bias-variance decomposition of the expected generalizationerror according to the approach in Domingos (2000).

License GPL (>= 2)

Encoding UTF-8

LazyData true

Depends Rcpp (>= 0.12.10),tidyr,dplyr,doParallel,ggplot2

Suggests C50,rpart,class,RWeka,randomForest,adabag

LinkingTo Rcpp

RoxygenNote 6.0.1

URL http://github.com/Lindskou/ClassifyR

BugReports http://github.com/Lindskou/ClassifyR

R topics documented:bivar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2bivars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3bivar_methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4plot.bivars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4print.bivar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4tuning_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Index 5

1

B Vignette of ClassifyR

54

2 bivar

bivar Bias-variance decomposition for the zero-one loss function

Description

Compute the bias-variance decomposition of the misclassification rate according to the approach inDomingos (2000).

Usage

bivar(form, method, data, B = 100, div = 4/5, ...)

Arguments

form Formula

method A character with a learner name.

data Data frame

B Number of bootstrap samples

div The ratio of observations used for training

... Further arguments passed to a learner

Details

The ellipsis argument, ..., is used to pass on further arguments to one of the methods. For moredetails about available methods, use bivar_methods(). Use tuning_parameters(method) to seea list of tuning/complexity parameters available for specific method. Use bivars to turn a list ofelements of type bivar into a class of bivars and apply this object to plot along with a vector oftuning parameters to obtain a plot.

Value

A list with the expected error, bias, variance, unbiased variance, biased variance and the net vari-ance.

Author(s)

Mads Lindskou

References

Domingos, P. (2000). A unified bias-variance decomposition for zero-one and squared loss.

See Also

bivars, bivar_methods, tuning_params and plot.bivars

bivars 3

Examples

# knn from library classbivar(Species ~ ., "knn", iris, k = 5)

# C5.0 from library C50bivar(Species ~ ., "C5.0", iris)

# The following only works when packaga C50 is loaded# since C5.0Control is in namespace C50

library(C50)ctrl1 = C5.0Control( minCases = 20)bivar(Species ~ ., "C5.0", iris, control = ctrl1)

# rpart from library rpartbivar(Species ~ ., "rpart", iris)

#' # The following only works when package rpart is loaded# since rpart.control is in namespace rpart

library(rpart)ctrl2 = rpart.control( maxdepth = 2)bivar(Species ~ ., "rpart", iris, control = ctrl2)

# The following example is from 'Title of thesis' by Mads LindskouU = vector(mode = "list")form = Species ~ .K = 1:15

for ( i in K ) # Parallelize for large datasetsU[[i]] = bivar(form, "knn", iris, k = i)

U = bivars(U)plot(U, K, x_lab = "k")

bivars Converting a list of bivar objects in to a bivars object

Usage

bivars(bivar_list)

Arguments

bivar_list A list with objects of type bivar

4 tuning_params

bivar_methods Listing the possible learners for bivar

Usage

bivar_methods()

plot.bivars Plotting an object of class bivars

Usage

## S3 method for class 'bivars'plot(bivars, tp, x_lab = "Tuning parameter",

reverse_x = FALSE)

Arguments

bivars An object of type bivarstp A vector with tuning parametersx_lab Label of x-axisreverse_x Logical: Should the x-axis be reversed?

print.bivar Default printing method for objects of type bivar

Usage

## S3 method for class 'bivar'print(x)

Arguments

x An object of type bivar

tuning_params Listing the possible tuning parameters for method in function bivar

Usage

tuning_params(method)

Arguments

method A charcter with a learner name. See bivar_methods()

Examples

tuning_params("rpart")

C Code Chunks from ClassifyR

C.1 bivar

1 #’ @name bivar2 #’ @title Bias-variance decomposition for the zero-one loss function3 #’ @description Compute the bias-variance decomposition of the misclassification4 #’ rate according to the approach in Domingos (2000).5 #’ @param form Formula6 #’ @param method A character with a learner name.7 #’ @param data Data frame8 #’ @param B Number of bootstrap samples9 #’ @param div The ratio of observations used for training10 #’ @param ... Further arguments passed to a learner11 #’ @details The ellipsis argument, \code..., is used to pass on further arguments12 #’ to one of the methods. For more details about available methods, use13 #’ \codebivar_methods(). Use \codetuning_parameters(method) to see a list of tuning/complexity14 #’ parameters available for specific method. Use \codebivars to turn a list of15 #’ elements of type bivar into a class of bivars and apply this object to16 #’ \codeplot along with a vector of tuning parameters to obtain a plot.17 #’ @return A list with the expected error, bias, variance, unbiased variance,18 #’ biased variance and the net variance.19 #’ @author Mads Lindskou20 #’ @examples21 #’ # knn from library class22 #’ bivar(Species ~ ., "knn", iris, k = 5)23 #’24 #’ # C5.0 from library C5025 #’ bivar(Species ~ ., "C5.0", iris)26 #’27 #’ # The following only works when packaga C50 is loaded28 #’ # since C5.0Control is in namespace C5029 #’30 #’ library(C50)31 #’ ctrl1 = C5.0Control( minCases = 20)32 #’ bivar(Species ~ ., "C5.0", iris, control = ctrl1)33 #’34 #’35 #’ # rpart from library rpart36 #’ bivar(Species ~ ., "rpart", iris)37 #’38 #’ #’ # The following only works when package rpart is loaded39 #’ # since rpart.control is in namespace rpart40 #’41 #’ library(rpart)42 #’ ctrl2 = rpart.control( maxdepth = 2)43 #’ bivar(Species ~ ., "rpart", iris, control = ctrl2)44 #’45 #’46 #’ U = vector(mode = "list")47 #’ form = Species ~ .48 #’ K = 1:1549 #’50 #’ for ( i in K ) # Parallelize for large datasets51 #’ U[[i]] = bivar(form, "knn", iris, k = i)52 #’ 53 #’54 #’ U = bivars(U)55 #’ plot(U, K, x_lab = "k")56 #’ @seealso \code\linkbivars, \code\linkbivar_methods, \code\linktuning_params57 #’ and \code\linkplot.bivars58 #’ @references Domingos, P. (2000). A unified bias-variance decomposition for59 #’ zero-one and squared loss.60 #’ @import Rcpp61 #’ @import dplyr

58

C. Code Chunks from ClassifyR

62 #’ @import doParallel63 #’ @export6465 bivar = function(form,66 method,67 data,68 B = 100,69 div = 4/5,70 ...) 71 # Error handling of ellipsis argument72 if( !is.character(method)) stop("method has to be a character. See bivar_methods()")73 if( !class(form) == "formula") stop("form must be of class formula")74 if( !( method %in% bivar_methods()$method ) ) stop("Not a valid method. See bivar_methods()")75 catch_ellipsis(method, list(...))76 #.call = match.call()77 envir = environment()78 class = as.character(form[[2]]) # Bad naming!79 if(!is.factor(data[, c(class)])) stop("Response variable must be of type factor")8081 n_class = length(levels(data[, c(class)]))82 # if( method == "boost" && n_class > 2 ) stop("Boosting can only handle binary responses")83 levels(data[,c(class)]) = 1:n_class84 N = nrow(data)8586 # Split data into training and test data87 set.seed(90210)88 n_train = floor(div*N)89 n_test = N - n_train90 train_indx = sample(1:N, n_train)91 train = data[train_indx, ]92 test = data[-train_indx, ]9394 P = switch( method,95 C5.0 = predict_C5.0( form, data, ... ),96 rpart = predict_rpart( form, data, ... ),97 knn = predict_knn(form, data, class, envir, ...),98 J48 = predict_J48(form, data, ...),99 bagging = predict_bagging(form, data, ...),100 randomForest = predict_randomForest(form, data, ...),101 boosting = predict_boosting(form, data, ...)102 )103104 mc = parallel::detectCores()105 cl = parallel::makeCluster(mc)106 doParallel::registerDoParallel(cl, cores = mc)107 spl = clsplit(mc, B)108109 low = 0L110 high = 0L111112 pred_list = foreach(j = 1:mc) %dopar% 113114 if(j == 1)115 low = 1116 high = spl[1]117 else118 low = spl[j-1] + 1119 high = spl[j]120 121122 out = list()123 for(b in low:high)124125 if (method == "knn")126 127 S_b = dplyr::slice( train, sample(1:n_train, n_train, replace = TRUE) )128 true_labels_S_b = S_b[, class]129 S_b = dplyr::select( S_b, -matches( class ) )130 pred_b = P(S_b, test_knn, true_labels_S_b, ...)131 else132 133 S_b = dplyr::slice(train, sample(1:n_train, n_train, replace = TRUE) )134 if( method == "boosting")135 pred_b = P(as.data.frame(S_b), test, ...)136 else pred_b = P(S_b, test, ...)137138 139 out[[b-low+1]] = pred_b

59

C.1. bivar

140 141 out142 143144 stopCluster(cl)145 pred_list = unlist(pred_list, recursive = FALSE)146 pred_matrix = matrix( unlist(pred_list), ncol = n_test, byrow = TRUE )147148 # Main prediction149 ym = apply(pred_matrix,150 2,151 function(x)152 names(table(x))[which.max(table(x))]153 154 )155 # Bias156 bias = as.numeric(ym != test[, c(class)])157 bias_false = which(bias == 0)158 bias_true = which(bias == 1)159160 # Variance161 V = rbind(ym, pred_matrix)162 variance = apply(V,163 2,164 function(x) length(which(x != x[1])) / B165 )166167 # Correcting factor in multiclass problems168 if( n_class > 2 )169 170 kappa_list = apply(V,171 2,172 function(x)173 ym_neq_pred = which(x != x[1])174 p1 = length(which(x[ym_neq_pred] == test[ym_neq_pred, c(class)]))175 p2 = length(ym_neq_pred)176 ifelse(p2 == 0, 0, p1/p2)177 178 )179 kappa = bias * kappa_list180 else181 182 kappa = 1183 184185 # Unbiased Variance186 variance_u = variance187 variance_u[bias_true] = 0188189 # Biased Variance190 variance_b = variance191 variance_b[bias_false] = 0192193 # Net Variance194 variance_n = variance_u - kappa * variance_b195196 # Error197 error = bias + variance_n198199 out = list(error = mean(error),200 b = mean(bias),201 v = mean(variance),202 v_u = mean(variance_u),203 v_b = mean(variance_b),204 v_n = mean(variance_n))205206 class(out) = c(class(out), "bivar")207 return(out)208 209210 print.bivar = function(x) 211 x = as.data.frame(x)212 x[] = lapply(x, round, digits=4)213 print(x, row.names = FALSE)214

60


C.2 catch ellipsis

1 catch_ellipsis = function(method, expr) 23 M = which(bivar_methods()$method == method)4 pckg = as.character(bivar_methods()$package[M])56 tryCatch(expr7 , warning = function(w) 8 stop(paste("Ellipsis argument not valid. Maybe you have forgotten to load package ", pckg))9 , error = function(e) 10 stop(paste("Ellipsis argument not valid. Maybe you have forgotten to load package ", pckg))11 )12

C.3 bivar methods

1 #’ @name bivar_methods2 #’ @title Listing the possible learners for \codebivar3 #’ @export4 bivar_methods = function() 56 learners = c("C5.0",7 "rpart",8 "knn",9 "J48",10 "bagging",11 "randomForest",12 "boosting")1314 pckgs = c("C50",15 "rpart",16 "class",17 "RWeka",18 "adabag",19 "randomForest",20 "Internal function using rpart")2122 lrn_type = c("Tree",23 "Tree",24 "Non-parametric method",25 "Tree (C4.5)",26 "Ensemble",27 "Ensemble",28 "Ensemble (Binary AdaBoost)")2930 out = data.frame(method = learners, package = pckgs, type = lrn_type)31 out32

61

C.4. bivars

C.4 bivars

1 is.bivar = function(x)2 ifelse("bivar" %in% class(x), TRUE, FALSE)3 45 #’ @name bivars6 #’ @title Converting a list of bivar objects in to a bivars object7 #’ @param bivar_list A list with objects of type bivar8 #’ @export9 bivars = function(bivar_list) 10 if(sum(unlist(lapply(bivar_list, is.bivar))) != length(bivar_list)) 11 stop("One or more elements is not of type bivar")12 13 class(bivar_list) = c(class(bivar_list), "bivars")14 invisible(bivar_list)15 1617 #’ @name plot.bivars18 #’ @title Plotting an object of class bivars19 #’ @param bivars An object of type bivars20 #’ @param tp A vector with tuning parameters21 #’ @param x_lab Label of x-axis22 #’ @param reverse_x Logical: Should the x-axis be reversed?23 #’ @import tidyr24 #’ @import dplyr25 #’ @import ggplot226 #’ @export27 plot.bivars = function(bivars, tp, x_lab = "Tuning parameter", reverse_x = FALSE) 2829 D = lapply(bivars, unlist)30 D = as.data.frame(do.call(rbind, D))31 D = D %>% tidyr::gather(error_source, estimate)32 D = cbind( tuning_parameter = rep(tp, 6) , D)33 D$estimate = 100 * D$estimate34 D$tuning_parameter = factor(D$tuning_parameter)3536 p = ggplot(D, aes( tuning_parameter,37 estimate,38 group = error_source,39 color = error_source))4041 p = p + geom_point(stat = "summary", fun.y=sum)42 p = p + stat_summary(fun.y = sum, geom="line")43 p = p + theme_bw()44 p = p + ylab("Error (%)")45 p = p + theme(legend.title=element_blank())46 p = p + xlab(x_lab)47 if(reverse_x) p = p + scale_x_discrete(limits = rev(levels(D$tuning_parameter)))48 p49

C.5 predict <function>

1 predict_boosting = function(form, data, ...) 23 # iter SHALL be determined in the ellipses!4 # iter = 100 , depth = 305 if(!requireNamespace("rpart", quietly = TRUE)) stop("Package ’rpart’ is missing")6 adaboost <- function(form, data, test, ...)78 # require(rpart)9 # Output must be factor - ensured by bivar!1011 # This following is CRUCIAL in order for rpart to use weight argument properly12 # See http://r.789695.n4.nabble.com/Rpart-and-case-weights-working-with-functions-td849795.html13 environment(form) <- environment()14 y_char <- as.character(form[[2]])15 y_class <- levels(data[, y_char]) # Use levels() instead!16 if(length(y_class) != 2 ) stop("Output class must have exactly two levels")1718 data[, c(y_char)] <- ifelse( data[, c(y_char)] == y_class[1], "-1", "1")

62


19 test[, c(y_char)] <- ifelse( test[, c(y_char)] == y_class[1], "-1", "1")2021 y_train <- as.factor(data[,c(y_char)])22 y_new <- as.factor(test[,c(y_char)])2324 N <- nrow(data)25 W_t <- rep(1/N, N)26 D_t <- data2728 h_ts <- list()29 a_ts <- list()30 cnt <- 131 iter <- list(...)$iter32 depth <- ( list(...)$control )$maxdepth3334 for( t in 1:iter ) # control = rpart::rpart.control( maxdepth = depth )35 h_t <- rpart::rpart(form, D_t, weights = W_t, control = rpart::rpart.control(

maxdepth = depth ))36 h_t_pred_train <- unname(predict(h_t, data, type = "class"))37 h_t_pred_new <- unname(predict(h_t, test, type = "class"))38 eps_t <- sum( W_t * ifelse( h_t_pred_train == y_train, 0, 1))39 if( eps_t >= 0.5 ) break40 if( eps_t == 0 )41 h_ts[[cnt]] <- h_t_pred_new42 a_ts[[cnt]] <- 1e0643 break44 45 a_t <- 0.5 * log( (1-eps_t ) / ( eps_t ) )46 Z_t <- 2 * sqrt( eps_t * (1 - eps_t) )47 exp_t <- exp( -a_t * ifelse(h_t_pred_train == y_train, 1, -1) )48 W_t <- W_t * exp_t / Z_t4950 h_ts[[cnt]] <- h_t_pred_new51 a_ts[[cnt]] <- a_t52 cnt <- cnt + 153 5455 h <- lapply(h_ts, function(f) as.numeric(levels(f))[f])56 h <- unname(as.matrix(as.data.frame(h)))57 a_ts <- unlist(a_ts)58 H <- h %*% a_ts59 S <- sign(H)60 y_boost <- as.factor(ifelse(S == -1, as.character(y_class[1]), as.character(y_class[2])))6162 return(y_boost)63 6465 P = function(data, test, ...)66 67 out = adaboost(form, data, test, ...)68 out69 70 return(P)71 7273 predict_C5.0 = function(form, data, ...) 7475 if(!requireNamespace("C50", quietly = TRUE)) stop("Package ’C50’ is missing")7677 L = function(data, ...)78 79 C50::C5.0(form, data, ...)80 8182 P = function(train, test, ...)83 84 out = predict( L(train, ...), test )85 factor(out)86 8788 return(P)8990 9192 predict_rpart = function(form, data, ...) 9394 if(!requireNamespace("rpart", quietly = TRUE)) stop("Package ’rpart’ is missing")95

63

C.5. predict <function>

96 L = function(data, ...)97 98 rpart::rpart(form, data, ...)99 100101 P = function(train, test, ...)102 103 out = unname(predict(L(train, ...), test, type = "vector"))104 factor(out)105 106107 return(P)108109 110111 predict_knn = function(form, data, class, envir, ...) 112113 if(!requireNamespace("class", quietly = TRUE)) stop("Package ’class’ is missing")114 # Selecting all independent variables for the knn syntax115 rhs_form = as.character(form)[3]116 if( length(rhs_form) == 1 && rhs_form == ".")117 118 vars = names(data); rm = which(vars == class); vars = vars[-rm]119 else120 121 str = strsplit(rhs_form, " ")[[1]]122 del = which(str == "+")123 vars = str[-del]124 125126 assign("test_knn", envir$test[, vars], envir)127128 P = function(train, test, train_labels, ...)129 130 class::knn(train, test, train_labels, ...)131 132133 return(P)134 135136 predict_J48 = function(form, data, ...) 137138 if(!requireNamespace("RWeka", quietly = TRUE)) stop("Package ’RWeka’ is missing")139140 L = function(data, ...)141 142 RWeka::J48(form, data, ...)143 144145 P = function(train, test, ...)146 147 out = predict(L(train, ...), test)148 factor(out)149 150151 return(P)152153 154155 predict_randomForest = function(form, data, ...) 156157 if(!requireNamespace("randomForest", quietly = TRUE)) stop("Package ’randomForest’ is missing")158159 L = function(data, ...)160 161 randomForest::randomForest(form, data, ...)162 163164 P = function(train, test, ...)165 166 out = unname(predict( L(train, ...), test, type = "response" ))167 factor(out)168 169 return(P)170

64

analysing ensemble methods based on tree-structured base … 2017/m... · 2018. 1. 4. ·...

Documents