boosting and instability for regression trees

18
Computational Statistics & Data Analysis 50 (2006) 533 – 550 www.elsevier.com/locate/csda Boosting and instability for regression trees Servane Gey a , , Jean-Michel Poggi b a Laboratoire MAP5, Université ParisV, 45, rue des Saints Peres, 75006 Paris Cedex 06, France b Laboratoire de mathématiques, U.M.R. 8628, Université Paris XI, 91405 Orsay, France Received 13 June 2003; received in revised form 3 September 2004; accepted 4 September 2004 Available online 30 September 2004 Abstract The AdaBoost like algorithm for boosting CART regression trees is considered. The boosting predictors sequence is analysed on various data sets and the behaviour of the algorithm is investigated. An instability index of a given estimation method with respect to some training sample is defined. Based on the bagging algorithm, this instability index is then extended to quantify the additional instability provided by the boosting process with respect to the bagging one. Finally, the ability of boosting to track outliers and to concentrate on hard observations is used to explore a non-standard regression context. © 2004 Elsevier B.V.All rights reserved. Keywords: Bagging; Boosting; CART; Instability; Prediction; Regression 1. Introduction Coming from the machine learning community, many authors have proposed boosting algorithms during the nineties. In the first half decade, the main concern is around the so- called PAC (Perturb And Combine) algorithm, see Schapire (1999) for a concise overview. Starting from a training set {(X i ,Y i )} 1 i n , the problem addressed is to fit a model de- livering a prediction ˆ y = ˆ f(x) at input x. The basic idea is to generate many different base predictors obtained by perturbing the training set and to combine them. Breiman (1996a) introduces bagging, theoretically analysed by Bühlmann and Yu (2002), which averages Corresponding author. Tel.: +33 1 42862113; fax: +33 1 69156577. E-mail addresses: [email protected] (S. Gey), [email protected] (J.-M. Poggi). 0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2004.09.001

Upload: servane-gey

Post on 26-Jun-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Boosting and instability for regression trees

Computational Statistics & Data Analysis 50 (2006) 533–550www.elsevier.com/locate/csda

Boosting and instability for regression trees

Servane Geya,∗, Jean-Michel PoggibaLaboratoire MAP5, Université Paris V, 45, rue des Saints Peres, 75006 Paris Cedex 06, France

bLaboratoire de mathématiques, U.M.R. 8628, Université Paris XI, 91405 Orsay, France

Received 13 June 2003; received in revised form 3 September 2004; accepted 4 September 2004Available online 30 September 2004

Abstract

The AdaBoost like algorithm for boosting CART regression trees is considered. The boostingpredictors sequence is analysed on various data sets and the behaviour of the algorithm is investigated.An instability index of a given estimation method with respect to some training sample is defined.Based on the bagging algorithm, this instability index is then extended to quantify the additionalinstability provided by the boosting process with respect to the bagging one. Finally, the ability ofboosting to track outliers and to concentrate on hard observations is used to explore a non-standardregression context.© 2004 Elsevier B.V. All rights reserved.

Keywords:Bagging; Boosting; CART; Instability; Prediction; Regression

1. Introduction

Coming from the machine learning community, many authors have proposed boostingalgorithms during the nineties. In the first half decade, the main concern is around the so-called PAC (Perturb And Combine) algorithm, seeSchapire (1999)for a concise overview.Starting from a training set{(Xi, Yi)}1� i�n, the problem addressed is to fit a model de-

livering a predictiony = f (x) at inputx. The basic idea is to generate many different basepredictors obtained by perturbing the training set and to combine them.Breiman (1996a)introduces bagging, theoretically analysed byBühlmann and Yu (2002), which averages

∗ Corresponding author. Tel.: +33142862113; fax: +33169156577.E-mail addresses:[email protected](S. Gey), [email protected]

(J.-M. Poggi).

0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2004.09.001

Page 2: Boosting and instability for regression trees

534 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

predictions given by a sequence of predictors built from bootstrap samples of the trainingset, and then proposes different variants (seeBreiman, 2001a,b).After the paper ofFreund and Schapire (1997)introducing the AdaBoost algorithm for

classification problems, a considerable interest for boosting schemes grew not only in ar-tificial intelligence but also in statistics. The basic idea of boosting is to improve the per-formance of a given estimation method using some bagging like scheme except that eachpredictor copes with a bootstrap sample obtained from the original one by adaptive resam-pling, highlighting the observations poorly predicted. A lot of contributions try to elucidatethe surprisingly good behaviour of this algorithm in the classification context, seeSchapireet al. (1998), Friedman et al. (2000), Lugosi andVayatis (2004)andBlanchard et al. (2003).An idea of the discussions 5 years ago can be found in the important discussion paperwrittenby Breiman (1998)about arcing classifiers.This paper deals with boosting for regression problems. Roughly speaking, two different

approaches have been considered. The first one is related to the gradient-based algorithmfollowing the ideas initiated by Friedman (seeFriedman, 2001; Zemel and Pitassi, 2001;Rätsch et al., 2000). Following an entirely different line, other authors more closely relatethe second one to the original algorithm of Freund and Schapire. On one hand,Ridgewayet al., 1999consider the regression problem as a large classification one. On the other hand,Drucker (1997)proposes a direct adaptation of AdaBoost to the regression framework,which exhibits interesting performance by boosting CART regression trees (see alsoBorraand Di Ciacco, 2002). These two papers conclude with interesting remarks about globalperformance comparisons but do not analyse the boosting predictors sequence and do notaddress the stability issue. These two points are the main purpose of this paper.The paper is organized as follows. Section 2 sets the model and presents the regression

tree estimation method. The Drucker’s boosting algorithm is recalled in Section 3. The sixreference data sets are described in Section 4. Section 5 analyzes the boosting predictorssequence. Section 6 provides a definition of two instability indices, for a given estimationmethod with respect to some training sample: an instability index based on bagging and anincremental instability index measuring the contribution to instability provided by boostingwith respect to bagging. The ability of boosting techniques to track outliers and to concen-trate on hard observations is then used in Section 7 to explore a non-standard regressioncontext. Some concluding remarks are collected in Section 8.

2. Model and estimation method

We consider the following regression model:

Y = f (X)+ �, (1)

where(X, Y ) ∈ X × R, f is the unknown regression function to be recovered and� is anunobservable additive noise centered conditionally toX with unknown variance�2�. The

generalization error of an estimatorf of f is defined asR(f, f )= E[‖f − f ‖2

].

Let us consider a training (or learning) sampleL and a test sampleT, of respective sizen andnt , composed of independent realizations of the variable(X, Y ). The goal is then

Page 3: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 535

to construct fromL an estimatorf of f having low generalization error. Since the jointdistribution is unknown, we evaluate the training error (often called the resubstitution error)and the prediction error off , respectively defined asRL(f, f ) andRT (f, f ), where for agiven sampleSof sizemof the random variable(X, Y )

RS(f, f )= 1

m

∑(Xi ,Yi )∈S

(Yi − f (Xi)

)2.

Sincef is constructed fromL and sinceT andL are independent,RT (f, f ) is used as anestimator ofR(f, f ).In this paper we focus on CART regression trees to generate predictors. One of the

particularly attractive property of this estimation method is, in this context, its instability.Therefore, the booststrap regression trees do not have the same number of terminal nodesand involve different features of the data.Let us denote byA the base algorithm providing some base estimator (or predictor)f

of f, which is here the well-known CART for regression (seeBreiman et al., 1984), wherethe choice of the final tree is performed using a tuning set randomly taken fromL, keepingT to give independent estimates ofR(f, f ).

3. Algorithms

To improve theperformanceofCART,we focuson twoalgorithms: baggingandboosting.Bagging is based on a bootstrap scheme: first, generateK replicate samplesLk from the

uniform distribution, then construct predictorsfk(bag)

usingA onLk, and finally aggregateby averaging

f (bag) = 1

K

K∑k=1

fk(bag)

.

It is now well known that bagging improves, often a lot, the performance of the consideredmethodA (seeBreiman, 1996a, 2001bandDrucker, 1997for regression trees), leading totake bagging as a reference to evaluate boosting.The boosting algorithm used here is given inTable 1. This algorithm is that ofDrucker

(1997), which was directly adapted fromAdaBoost algorithm (Freund and Schapire, 1997)for classification problems.This algorithm comes from two ideas, theoretically motivated for the classification case

in Freund and Schapire (1997). The first is about the resampling distribution updating. Atiterationk, the distributionpk+1 is computed to increase the probability of appearance inthe next sampleLk+1 of the observations ofL that are poorly predicted byfk. So the nextpredictorfk+1 will concentrate more on these observations. Here�k (seeTable 1) can beviewed as a synthetic index of the performance offk onLk, where, for brevity, we writethe performance onL underpk as the performance onLk. The smaller�k, the betterfkperforms onLk. Sincedk(i) is the individual performance offk onL, multiplyingpk(i) by�1−dk(i)k merges inpk+1 the global performance offk onLk and also the individual ones

Page 4: Boosting and instability for regression trees

536 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

Table 1Boosting algorithm

Boosting

Input: L the training sample of sizen,A the estimation method andK the number of iterations,Initialization: Setp1 =D the uniform distribution on{1, . . . , n}

Loop: fork = 1 toK do• randomly draw fromL with replacement, according topk , a sampleLk of sizen,• usingA, construct an estimatorfk of f fromLk ,• set from the original training sampleL: i = 1, . . . , n

lk(i)=(Yi − fk(Xi)

)2and�(boost)pk

= ∑ni=1pk(i)lk(i),

�k = ε(boost)pk

max1� i �n

lk(i)− ε(boost)pk

anddk(i)= lk(i)

max1� i �n

lk(i),

if ε(boost)pk<0.5max1� i �n lk(i), wk+1(i)= �1−dk(i)

kpk(i),

elsewk+1 = p1,

pk+1(i)= wk+1(i)∑nj=1wk+1(j)

.

Output: f the median of(fk)1� k�K weighted by(log( 1�k )

)1� k�K

.

on L. The second is about the aggregation part of the algorithm. It is important to noticethat the predictorfk is constructed under the distributionpk, so the combination of the(fk) has to take into account the resampling distributions(pk). Indeed, each predictorfk isdesigned to fit observations ofLk obtained frompk, and then iffk performs well onLkbut not globally onL, its weight will be large anyway. Let us mention that if�k�1, thenboosting is regenerated by resettingpk to p1, avoiding possible degeneration.

4. Data sets and global performance

4.1. Data sets

We consider three simulated data sets, denoted by FR#1, FR#2 and FR#3, and three realdata sets, Boston Housing, Paris Pollution and Vitrolles Pollution. They are summarized inTables 2and3, respectively. Data sets FR#1, FR#2, FR#3 and Boston Housing are classicaltest examples (seeBreiman, 1996a; Breiman et al., 1984; Drucker, 1997; Breiman, 2001a)while P-Pollution and V-Pollution are French ozone data.The simulated data sets have been considered byFriedman (1991). They are relevant

for our experiments for two reasons. First, they exhibit different difficulties with respect toCART regression trees. Indeed, model FR#1 is defined using a simple nonlinear functionpartially additive and separable but involves, in addition, five useless variableswhilemodelsFR#2 and FR#3 correspond to highly nonlinear functions with strong interactions (themoreimportant for FR#2). Second, these simulated data sets allow us to consider three differentsituations (see Section 4.2): FR#1 for which boosting outperforms bagging, FR#2 for whichbagging outperforms boosting and FR#3 which is a borderline example.

Page 5: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 537

Table 2Simulated data sets

Data Predictors Model

FR#1 Xi ∼ U([0,1]) f (x)= 10 sin(�x1x2)+ 20(x3 − 0.5)2

i = 1, . . . ,10 +10x4 + 5x5 + 0∑10i=6xi

� ∼ N(0,1)

FR#2 X1 ∼ U([0,100]) f (x)=√x21 + (x2x3 − 1/x2x4)2

X2/2� ∼ U([20,280]) � ∼ N(0,��) SNR= 3X3 ∼ U([0,1])X4 ∼ U([1,11])

FR#3 Same as FR#2 f (x)= tan−1[x2x3 − (1/x2x4)

x1

]

� ∼ N(0,��) SNR= 3

Table 3Real data sets

Data Response Predictors Number of obs.

Boston housing Median housing 13 predictors fully described 506(Bost. Hous.) price in the tract inBreiman et al. (1984)

Paris pollution Daily maximum 3 predictors fully described 1200(P-Pollution) ozone concentration inChèze et al. (2003)

Vitrolles pollution Daily maximum 41 predictors fully described 822(V-Pollution) ozone concentration inGhattas (1999)

In addition to data sets FR#2 and FR#3 corresponding to a signal-to-noise ratio (SNR)equal to 3, we consider data sets denoted by FR#2-b and FR#3-b for which the SNR is equalto 9. In the first case, the signal explains 75% of the variance while it explains 90% in thesecond one. These two variants have been considered separately inBreiman (1996a)andDrucker (1997). Thus, we can compare FR#2 with FR#2-b and FR#3 with FR#3-b to getan idea of the impact of changing the SNR from a moderate to a high value.The first real data set, called Boston Housing, is fully described in [Breiman et al., 1984,

pp. 217–220], and extensively used in regression literature, allowing fair comparisons.Let us sketch the specific problemaddressed in this paper for theP-Pollution data, dealing

with the analysis and prediction of ozone concentration in Paris area. Highly polluted daysare often hard to predict: usual estimation methods need to be suitably post-processed toimprove the performance on these observations. We investigate in Section 7.2 if, startingfrom a CART regression tree, boosting performs automatically this improvement. The lastdata set called V-Pollution differs from the previous one since the interesting days aresufficiently numerous and a large number of explanatory variables is considered.

Page 6: Boosting and instability for regression trees

538 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

4.2. Global performance

To start with a reference, we evaluate the performance of three estimators: a single treeand those obtained by bagging and boosting.We performK=200 iterations for bagging andboosting. The performance is measured by the prediction errorRT (f, f ) computed using atest sampleT. This estimation is computed by Monte Carlo from 10 runs of each algorithm.For the simulated data, 500 training examples and 600 test examples are used. For theBoston Housing data, we randomly take 10% of the data as test examples. For the pollutiondata sets, we use a test sample containing 10% of the data and stratified with respect to theresponse to preserve its distribution in the test set since performance for polluted days isconsidered as crucial.

Remark 1. Usually (seeBreiman, 1996a; Drucker, 1997; Ghattas, 1999) the samplesLandT (often of fixed sizes) are randomized, which is convenient when the objective isto estimate the true generalization errors. In this paper, since we focus on definitions ofinstability depending on the training sample, all the predictors are generated from fixedsamplesL andT, chosen in such a way that the prediction error of a single tree designedusingL, evaluated usingT, is the median of the errors obtained from numerous randomlygenerated candidates.

The performance of the three estimators on the data sets is given inTable 4.

Remark 2. Let us mention that the results for single tree and bagging are close to thoseobtained byBreiman (1996a)on common classical data sets and byGhattas (1999)on V-Pollution. The prediction errors obtained for boosting are of the same order of magnitude(larger for FR#1 and smaller for FR#2, FR#3 and Bost. Hous.) than those obtained byDrucker (1997), who uses a different pruning step in CART.

First of all, bagging and boosting perform significantly better than a single tree. This awidely known behaviour. However, boosting does not always have a better performancethan bagging since boosting outperforms bagging only for FR#1, FR#3-b, Boston Housingand V-Pollution.

Table 4Single tree, bagging and boosting performance for data sets

Data Single tree Bagging Boosting

FR#1 8.79 5.75 4.46FR#2 65,981 51,734 57,038FR#2-b 26,147 18,981 19,507FR#3 0.067 0.046 0.050FR#3-b 0.036 0.026 0.024Bost. Hous. 13.8 11.9 10.6P-Pollution 556 469 488V-Pollution 835 415 394

Page 7: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 539

For FR#2 and FR#3 it is clear that boosting performs worse than bagging while thedegradation for P-Pollution is smaller. But when the SNR is increased (see rows labeledFR#2-b and FR#3-b), the performance of a single tree and the bagged predictor are dramat-ically improved and, what is more interesting, bagging and boosting become closer. Thecomparison is even inverted for FR#3-b.

5. Analysis of the boosting predictors sequence

In the classification case, one can observe (seeBreiman, 1998) from the boosting pre-dictors sequence the five following facts. (1) Boosting outperforms bagging in all but afew data sets examined in the literature. (2) The sequence(�k), where�k is defined viathe misclassification rate offk, oscillates without any pointwise stabilization. (3) Train-ing and prediction errors decrease rapidly. Furthermore, one observes that the training (orresubstitution) error rapidly reaches zero and that the prediction error keeps on decreas-ing even after the training error is equal to zero. (4) No characteristic behaviour indicatessome possible overfitting. (5) Breiman notices a “strange signature” which can be inter-preted as an expression of the stabilization of the learning method via boosting. Moreprecisely, he plots the misclassification rate of each observation ofL along theK iterations,defined by

r(i)= 1

K

K∑k=1

1lYi �=fk(Xi), i = 1, . . . , n,

with respect to the number of times that the observationi appears in the different bootstrapsamplesLk. The intuitive idea is that the more the observationi is misclassified, the moreit appears in the bootstrap sample. Butr seems to stabilize around a constant value reachedby the majority of the observations. Breiman calls it “the plateau”. That is why boosting isoften interpreted in the classification context as a misclassification rate equalizer.We sketch in this section a similar analysis for boosting in regression, to study if themain

attractive features of AdaBoost are preserved.

5.1. The weights(�k) and the prediction errors

The behaviour of(�k) (defined inTable 1) and the training and prediction errors withrespect tok for FR#1 are plotted inFig. 1.The sequence(�k) (seeFig. 1(a)) is very unstable and oscillates around 0.07. Since(�k)

is related to the global quality offk on Lk, it follows that this quality exhibits the samebehaviour all along the boosting process without any pointwise stabilization. Similarly,fk(xi) typically oscillates aroundyi without any pointwise stabilization. OnFig. 1(b), thetraining and prediction errors have nearly the same behaviour as in the classification case,despite the fact that the prediction error does not obviously keep on decreasing after thetraining error stabilization. Moreover, it is interesting to notice that there is no charac-teristic behaviour indicating some possible overfitting, even after 200 iterations which is

Page 8: Boosting and instability for regression trees

540 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

0 20 40 60 80 100 120 140 160 180 2000.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

iteration(a)

Beta

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

iteration(b)

Training and prediction errors

Prediction errorTraining errorSingle tree

Fig. 1. FR#1: On the left, the(�k) and on the right, from bottom to top, two curves representing the boostingtraining and prediction errors and a line indicating the prediction error of a single tree.

0 20 40 60 80 100 120 140 160 180 200

0

2

4

6

8

10

12x 10-3

iteration

Resampling distributions − FR1

0 20 40 60 80 100 120 140 160 180 200

0

0.005

0.01

0.015

0.02

0.025

iteration

Resampling distributions −P−Pollution

(a) (b)

Fig. 2. For FR#1 (a) and P-pollution (b), the boosting resampling probabilities for two observations. The linerepresents the uniform probability.

usually considered as a large value (the corresponding curves for the other data sets aresimilar).

5.2. The probability distributions(pk)

Fig. 2shows the boosting resampling probabilities along the iterations, for two observa-tions from FR#1 (panel (a)) and P-pollution (panel (b)). In each case the two observationshave been chosen in such a way that the associated probabilities are respectively almostalways above and below the line representing the uniform probability.Again it appears that, in general, no pointwise stabilization occurs.

Page 9: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 541

0 20 40 60 80 100 120 140 160 180 2005

6

7

8

9

10

11

12

Number of appearances

Ave

rage

pre

dict

ion

erro

r

The Plateau for FR1

Fig. 3. FR#1: average error on 200 iterations vs. the number of appearances in the samplesLk .

5.3. The plateau

To study the plateau in regression, we define the individual average prediction error ofYi using(fk) along theK iterations by

r(boost)(i)= 1

K

K∑k=1

(Yi − fk(Xi)

)2. (2)

Then, we plot, for each observationi of L, r(boost)(i) with respect to the number of timesthat the observationi appears in the different bootstrap samplesLk (seeFig. 3).Boosting behaves in regression as in classification since the plateau phenomenon occurs

(the same behaviour is observed for the other data sets). So boosting can be considered asan error equalizer.After this quick tour of the behaviour of the boosting predictors sequence, let us focus

on instability issue.

6. Bagging, boosting and instability

In the classification case, it is well known (see for exampleBreiman, 1998) that, in orderto make bagging and boosting effective, a key property of the given estimationmethodA isto be unstable. IntuitivelyA is an unstable method if small changes in the training sampleL may cause large changes in the resulting estimatorf . In other words,A is unstable if itis not a robust method. Hence different ways to define the instability or the stability ofAappear in the classification literature. On one hand,Bousquet and Elisseeff (2002)studiedthe stability ofA by replacing one observation inLwith another one coming from the same

Page 10: Boosting and instability for regression trees

542 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

model. On the other hand,Breiman (1996b, 1998)studies a version of the variance ofAand connects instability with the good performance of bagging. We follow the second lineand we try to connect instability with bagging and boosting sequences of predictors.

6.1. Bagging and instability

We define an instability index depending on a specific choice of the training sample inorder to be close to the actual problem of model estimation, for which only one realizationis available.

6.1.1. Instability indexThe key quantities measuring the instability can be derived from the errors, evaluated on

L, of the bagging predictors(fk

(bag)):

ε(bag)k = RL

(f, fk

(bag)). (3)

Indeed the fluctuations of these errors describe the changes of the estimator caused bychanges in the training sample. So the coefficient of variation ofε(bag) provides an instabilityindex defined as the ratio:

IL = std(ε(bag)

)ε(bag)

, (4)

whereε(bag) denotes the vector(ε(bag)k

)1�k�K

, ε(bag) its average and std(ε(bag)

)its em-

pirical standard deviation.It follows from this definition of instability that, for a givenL, the more unstableA, the

largerIL, and, as soon asA is a local estimation method,IL is not sensitive to outliers,since bagged predictors are involved. Let us mention that, with respect to model (1), for agiven training sampleL and some nonparametric estimation methodA, IL can be viewedas an estimate of the coefficient of variation of the random variable‖f − f ‖2.

6.1.2. Instability for data setsTo relate instability with performance of bagging, as Breiman does inBreiman (1998)

to show that bagging improves only unstable methods, we compute, for each data set, theinstability index and the percentage of improvement of a single tree by bagging, definedby 100(pest − pebag)/pest wherepest andpebag denote the prediction errors (given inTable 4) for a single tree and bagging, respectively. CART is bagged on the training sampleL and 10 runs of bagging with 200 iterations per run are performed to estimateIL by theMonteCarlomethod.The question is: canwe consider that the largerIL is, themore baggingwill improve CART? The results are given inTable 5. The data sets are split in two groupsnumbered 1 when boosting performs better than bagging and numbered 2 otherwise.Asexpected, thepercentageof improvement increaseswithIL, except forBostonHousing

which behaves as an outlier. For this last case, we observe the highest value ofIL and apercentage close to the smallest one. This is due to the intrinsic difficulty of this data set

Page 11: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 543

Table 5Instability index and CART improvement by bagging

Data Group IL %Improvementby bagging

FR#1 1 0.075 34.6FR#2 2 0.052 21.6FR#2-b 2 0.055 27.4FR#3 2 0.081 31.3FR#3-b 1 0.093 27.8Bost. Hous. 1 0.167 14.1P-Pollution 2 0.039 12.0V-Pollution 1 0.074 50.2

since a very large instability index is combined with a large improvement of bagging byboosting (seeTable 4). This point will be addressed in the next section.Let us remark that data sets ofGroup1 seem to have larger instability indexes than the

ones ofGroup2.Two obvious remarks illustrate on data sets the good behaviour of the previously defined

instability index: for a given regression function,IL increases with the SNR (see FR#2 andFR#3 related rows) andIL decreases with the number of missed variables (see Pollutiondata rows). This point is developed inGey and Poggi (2002)by considering nested models.

6.2. Boosting and instability

6.2.1. Incremental instability indexLet us now define some indexes measuring the contribution of boosting to instability.

The general idea is that, compared to bagging, the adaptive updating of the probabilities(pk) should provide a sequence of boosted predictors(fk) much more unstable than the

sequence of bagged predictors(fk

(bag)).

Starting from the key sequence of the training errors of the boosting predictors(fk)

ε(boost)k = RL(f, fk), (5)

the instability indexI (boost)L carried out by boosting for the sampleL is obtained fromILby takingε(boost) rather thanε(bag), that is:

I(boost)L = std

(ε(boost)

)ε(boost)

. (6)

Then the incremental instability index of boosting for the sampleL is defined as the ratioof the instability index carried out by boosting and the instability index:

�I (boost)L = I(boost)L

IL= std

(ε(boost)

)ε(boost)

× ε(bag)

std(ε(bag)

) . (7)

Page 12: Boosting and instability for regression trees

544 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

Table 6Instability indices for bagging and boosting

Data Group IL �I (boost)L

Ratio bagging/boosting

FR#1 1 0.075 1.87 1.29FR#2 2 0.052 2.38 0.91FR#2-b 2 0.055 2.58 0.97FR#3 2 0.081 1.57 0.92FR#3-b 1 0.093 1.58 1.08Bost. Hous. 1 0.167 1.50 1.11P-Pollution 2 0.039 3.50 0.97V-Pollution 1 0.074 1.97 1.05

It follows that, for a givenL, the larger�I (boost)L , the more boosting contributes to instabilitywith respect to bagging.

6.2.2. Incremental instability for data setsFor each data set, the two numerical instability indices and the ratio of the prediction

errors of bagging and boosting are given inTable 6.First of all, let us notice that the contribution of boosting to instability is effective for all

data sets, since�I (boost)L is always greater than 1.Second the results ofTable 6have to be balanced with the classification of the data sets

recalled in the second column of the table. We obtain that forGroup1 the instability indexis large and the incremental instability index is moderate and, forGroup2, except for FR#3,the instability index is small and the incremental instability index is significantly larger thanthe ones ofGroup1, particularly for P-Pollution.This leads to the conjecture that the effectiveness of boosting is not only related to the

global instability ofCART, but also to the incremental contribution of boosting to instability:a large (resp. small) instability combined with a moderate (resp. large) contribution ofboosting to instability may imply that boosting will perform well (resp. poorly).The FR#3 data set is a borderline example since a change of the SNR suffices to invert

the bagging/boosting ratio.The P-Pollution data set illustrates that boosting focuses on hard observations, which

are the highly polluted days. Indeed, the instability index is small since it is robust withrespect to them, but the incremental instability index is very large since it is sensitive tothem.Table 4shows that boosting performs slightly worse than bagging. A deeper analysiscarried out in Section 7.2 exhibits a more interesting behaviour of boosting: it performswell and improves bagging performance on highly polluted days even if it slightly degradesthe global performance.

6.2.3. More about the training errorsLet us formulate a surprising remark about the averaged errors by considering the fol-

lowing ratio:�(boost)(bag) = ε(boost)

ε(bag). Intuitively, since bagging is not sensitive to outliers or hard

Page 13: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 545

Table 7Ratio of the average errors of the predictors sequences provided by boosting and bagging respectively and incre-mental instability index for the data sets

Data �(boost)(bag) �I (boost)

L

FR#1 1.51 1.87FR#2 1.69 2.38FR#2-b 1.57 2.58FR#3 1.55 1.57FR#3-b 1.49 1.58Bost. Hous. 1.50 1.50P-Pollution 1.68 3.50V-Pollution 1.55 1.97

observations, themore boosting concentrates on someobservations of the training sampleL,the more the average error of the sequence(fk) is far from the average error of the sequence

(fk(bag)

). SoTable 7contains�(boost)(bag) and the incremental instability index of boosting for

all the considered data sets.

Surprisingly�(boost)(bag) remains within the interval[1.5;1.7] for the considered data sets.

This means that the Drucker’s boosting seems to design predictors in order to equalize theiraverage error to at least 1.5 times the average error of the bagging predictors sequence, assoon asK is sufficiently large. This remark shows how important is the aggregation part ofthe algorithm.In addition, it follows that the incremental instability index given by (7) seems to be close

to the ratio of the standard deviations ofε(boost) andε(bag), up to a multiplicative constantroughly speaking.Let us come back to the plateau evidenced in Section 5.3 by connectingr defined by

(2) andε, both for bagging and boosting. Indeed, we have thatr(boost) = ε(boost). Then by

replacingfk by fk(bag)

in the definition ofr(boost)(i), we define similarlyr(bag)(i) and wehaver(bag) = ε(bag). So

�(boost)(bag) = ε(boost)

ε(bag)= r(boost)

r(bag).

So �(boost)(bag) is also equal to the ratio ofr(boost) and r(bag), this approximately gives the

ordinate of the plateau.

7. Outliers and hard observations

Let us analyse the boosting dynamic in front of hard observations, first by contaminatingsome population with outliers, and second by studying if boosting is able to track thenon-dominant population and to perform a kind of compromise. We first consider some

Page 14: Boosting and instability for regression trees

546 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

simulated data, to cope with outliers, and then the P-Pollution data, which can be seen as amore intricate case of mixing of populations.

7.1. Outliers

The main point of this section is to supply evidence that boosting does concentrateon the outliers. We start with the model generating the FR#1 data set and contaminateit incrementally. Hence, since boosting performs well on this problem, the behaviour ofboosting in front of outliers can be easily analysed. From model FR#1 used to generatethe dominant population, we simulate a training sampleL and a test sampleT of size 500and 600, respectively. ThenL andT are altered by adding to each of them�% observationsgenerated using a shifted version of the former regression function (f2=f +20) to producethe outliers. In addition we simulate two test samplesT1 andT2 of size 600, coming fromthe dominant population and the outliers, respectively.Let us summarize the comparison of CART, bagging and boosting with respect to pre-

diction errors evaluated onT, T1 andT2, for � = 1%,2%,3%,5%,10%,25% (seeGeyand Poggi, 2002for details). The prediction error of bagging is always smaller than theprediction error of a single tree. Indeed bagging is not sensitive to outliers and generallyimproves the performance of CART. For� less than 10%, the prediction errors of boostingand bagging are close to each other but bagging outperforms boosting for the three test sets.This illustrates the sensitivity of boosting to outliers, since boosting has better performancethan bagging on the unaltered data set FR#1. Moreover, when there is a large proportion ofoutliers (�=25%), boosting performs evenworse than a single tree on the dominant popula-tion. Nevertheless an improvement onT2 is reached. This balance between the performanceon the dominant population and a second one is the key point highlighted in the next section.What is more interesting is to focus on outliers.Fig. 4contains the number of times that

each observation is taken in the bootstrap sampleLk and the percentage of outliers taken in

0 100 200 300 400 500 6000

20

40

60

80

100

120

140

160

180

observation index

Number of appearances

0 20 40 60 80 100 120 140 160 180 20020

25

30

35

40

45

50

55

iteration

Percentage of outliers

(b)(a)

Fig. 4. (a) Number of appearances of each observation in the bootstrap samples. (b) Percentage of outliers takenin the bootstrap samples during the loop for FR#1 altered with� = 25% of outliers. Note: the outliers are the lastobservations in the training sample.

Page 15: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 547

Lk at each iterationk. Fig. 4(a) shows that the outliers (the last observations in the trainingsample) are taken in almost every bootstrap sampleLk. Fig. 4(b) shows that outliers do notrepresent more than 55% of the observations inLk. Let us mention that the weight (withrespect to the resampling distributions) of the outliers varies between 40% and 50%. Thisshows that boosting not only concentrates on outliers but mixes observations from the twosubpopulations. This behaviour is analogous to the behaviour of boosting in classification,for which thepk+1-weight of the observations misclassified byfk is equal to 50%.

7.2. Hard observations

Wenowconsider theP-Pollution data.A comparison between different approaches showsthat a simple nonlinear additivemodel captures themain features of the complex underlyingdynamics, despite the small number of explanatory variables (seeChèze et al., 2003). Themaindifficulty is to correctly predict alarmsdefinedby themaximumof ozoneconcentrationexceedances of a high level threshold, for which only few observations are available.Whena classical estimator is used, one can observe an underestimation for these interesting daysdespite the fact that the mean of absolute errors is satisfactory since it is close to themeasurement error. To improve this estimator,Chèze et al. (2003)define partial estimatorsfor nonlinear additive models, which are recombined using suitably chosen weights leadingto a modified estimator improving tremendously the performance for interesting days. Westudy if boosting is able to automatically perform this. Since the legal threshold ofmaximumconcentration of ozone over Paris is 130, population 1 (denoted byP1) is composed byobservations having a response smaller than 130 and population 2 (P2) contains the otherones. From Section 4.2, we know that boosting performs worse than bagging on this dataset when performance is evaluated usingT, which mergesP1 andP2. Let us evaluate theperformance separately on each population. The training and prediction errors obtainedfrom 10 runs with 200 iterations are given inTable 8.As claimed, a single tree performs better on the dominant populationP1, both on training

and test sets.On the training set, bagging improves on a single tree both onP1 andP2.What is more

interesting is that boosting improves the performance of bagging again on both populationsand the accuracy onP2 becomes close to the one obtained forP1. This is illustrated inFig. 5 giving along the iterations the training errors for bagging (panel (a)) and boosting(panel (b)). Let us remark that the final values are already reached forK = 80 iterations.

Table 8Performance per population for P-Pollution data: training and prediction mean squared errors

Single tree Bagging Boosting

Training error P1 305 249 207P2 864 640 236

Prediction error P1 549 477 503P2 632 386 318

Page 16: Boosting and instability for regression trees

548 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

0 20 40 60 80 100 120 140 160 180 200200

300

400

500

600

700

800

900

1000

1100

1200

iteration

Training errors for bagging

0 20 40 60 80 100 120 140 160 180 200200

300

400

500

600

700

800

900

1000

1100

1200

iteration

Training errors for boosting

TE on P1+P2TE on P1TE on P2

TE on P1+P2TE on P1TE on P2

(a) (b)

Fig. 5. P-Pollution: the training errors for bagging (a) and boosting (b). From bottom to top of each graph thecurves correspond toP1,P1∪ P2 andP2.

0 100 200 300 400 500 600 700 800 900 10000

20

40

60

80

100

120

140

160

180

200

observation index

Number of appearances

0 20 40 60 80 100 120 140 160 180 2005

10

15

20

25

30

35

40

45

iteration

Percentage of P2 observations

(a) (b)

Fig. 6. Number of times that each observation is taken in the bootstrap samples (a) and percentage ofP2 obser-vations taken in the bootstrap samples during the loop (b) for P-Pollution data.

On the test set, a more surprising situation occurs: on one hand, bagging has betterperformance onP2 than onP1 and on the other hand, boosting slightly degrades theperformance of bagging onP1, but improves a lot the bagging performance onP2. Theresults on the test set must be taken with caution since only a small number of observationsis involved.From another viewpoint, it is clear fromFig. 6 that boosting concentrates on theP2

observations.

8. Concluding remarks

The analysis of Drucker’s boosting algorithm focusing on predictors sequence revealsthat many meaningful properties of the original AdaBoost algorithm are preserved. The

Page 17: Boosting and instability for regression trees

S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550 549

definitions of instability introduced in this paper are also derived from the predictors se-quence and confirm the importance of instability in order to get such resampling schemeseffective. The instability indices allow a better understanding of the performance of baggingand boosting on data sets. For example, the effectiveness of boosting is not only relatedto the global instability of CART, but also to the incremental contribution of boosting toinstability. The experiments carried out in this paper confirm the interest of the Drucker’salgorithm to improve regression trees: the Paris pollution data illustrates the good behaviourof the algorithm in front of a non-standard regression model, performing a kind of trade-off between the two underlying subpopulations. In addition, our experiments suggest thatboosting could be used to highlight outliers or hard observations.Finally, let us mention that, following numerous experiments (seeGey and Poggi, 2002),

it seems to be rather difficult to significantly improve the original algorithm using variantslike aggregation usingmean rather thanmedian, aggregation using space adaptive definitionof the weight or probabilities updating using classification like rule followingZemel andPitassi (2001).

Acknowledgements

We would like to thank Badih Ghattas for introducing us to boosting during excitingdiscussions and useful advices about computational issues. We are also grateful to GillesBlanchard for helpful discussions. In addition, we thank an anonymous referee and anAssociate Editor for remarks and suggestions which helped to improve the presentation ofthis article.

References

Blanchard, G., Lugosi, G., Vayatis, N., 2003. On the rate of convergence of regularized boosting classifiers.J. Machine Learning Res. 4, 861–894 (special issue on learning theory).

Borra, S., Di Ciacco, A., 2002. Improving nonparametric regression methods by bagging and boosting. Comput.Statist. Data Anal. 38, 407–420.

Bousquet, O., Elisseeff, A., 2002. Stability and generalization. J. Machine Learning Res. 2, 499–526.Breiman, L., 1996a. Bagging predictors. Machine Learning 24 (2), 123–140.Breiman, L., 1996b. Heuristics of instability and stabilization in model selection. Ann. Statist. 24 (6), 2350–2383.Breiman, L., 1998. Arcing classifiers (with discussion). Ann. Statist. 26 (3), 801–849.Breiman, L., 2001a. Random forests. Machine Learning 45 (1), 5–32.Breiman, L., 2001b. Using iterated bagging to debias regressions. Machine Learning 45 (3), 261–277.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification And Regression Trees, Chapman &

Hall.Bühlmann, P., Yu, B., 2002. Nonparametric classification—analyzing bagging. Ann. Statist. 30 (4), 927–961.Chèze, N., Poggi, J.-M., Portier, B., 2003. Partial and recombined estimators for nonlinear additivemodels. Statist.

Inf. Stochastic Process. 6 (2), 155–197.Drucker, H., 1997. Improving regressors using boosting techniques. In: Kaufmann, M. (Ed.), Proceedings of the

14th International Conference on Machine Learning, pp. 107–115.Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of on-line learning and an application to

boosting. J. Comput. System Sci. 55 (1), 119–139.Friedman, J., 1991. Multivariate adaptive regression splines (with discussion). Ann. Statist. 19, 1–141.Friedman, J., 2001. Greedy function approximation: the gradient boosting machine. Ann. Statist. 29 (5),

1189–1232.

Page 18: Boosting and instability for regression trees

550 S. Gey, J.-M. Poggi / Computational Statistics & Data Analysis 50 (2006) 533–550

Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive logistic regression: a statistical view of boosting. Ann.Statist. 28 (2), 337–407.

Gey, S., Poggi, J.-M., 2002. Boosting and instability for regression trees. Technical Report 36, Université ParisXI.

Ghattas, B., 1999. Prévision des pics d’ozone par arbres de régression, simples et agrégés par bootstrap. Rev.Statist. Appl. 47 (2), 61–80.

Lugosi, G., Vayatis, N., 2004. On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32(1), 30–55.

Rätsch, G., Warmuth, M., Mika, S., Onoda, T., Lemm, S., Müller, K., 2000. Barrier boosting. In: Proceedings ofthe 13th Conference on Computer Learning Theory.

Ridgeway, G., Madigan, D., Richardson, T., 1999. Boostingmethodology for regression problems. In: Proceedingsof the Seventh International Workshop on Artificial Intelligence and Statistics.

Schapire, R.E., 1999. A brief introduction to boosting. In: Proceedings of the Sixteenth International JointConference on Artificial Intelligence.

Schapire, R.E., Freund, Y., Bartlett, P., Sun Lee, W., 1998. Boosting the margin: a new explanation for theeffectiveness of voting methods. Ann. Statist. 26 (5), 1651–1686.

Zemel, R.S., Pitassi, T., 2001.A gradient-based boosting algorithm for regression problems. In:Advances inNeuralInformation Processing Systems, no.13 in NIPS-13, pp. 696–702.