overfitting of boosting and regularized boosting algorithms

Overfitting of Boosting and Regularized Boosting Algorithms

Takashi Onoda

Communication and Information Research Laboratory, Central Research Institute of Electric Power Industry, Komae, 201-8511 Japan

SUMMARY

The impressive generalization capacity of AdaBoosthas been explained using the concept of a margin intro-duced in the context of support vector machines. However,this ability to generalize is limited to cases where the datadoes not include misclassification errors or significantamounts of noise. In addition, the research of Schapire andcolleagues has served to provide theoretical support forthese results from the perspective of improving margins. Inthis paper we propose a set of new algorithms, AdaBoostReg,n-Arc, and n-Boost, that attempt to avoid the overfitting thatcan occur with AdaBoost by introducing a normalizationterm into the objective function minimized by AdaBoost.© 2007 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3,90(9): 69–78, 2007; Published online in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/ecjc.20344

Key words: AdaBoost; overfitting; normalization;margin; support vector machines.

1. Introduction

In recent years ensemble learning methods such asAdaBoost, Arcing, and Bagging [1–3] have been in thespotlight as methods for solving classification problems. Inparticular, Boosting methods related to AdaBoost havebeen shown to produce excellent results on optical characterrecognition benchmark data sets and various practical prob-lems such as state identification for domestic electrical

appliances and have received the attention of many re-searchers. The AdaBoost algorithm proposed by Schapireand colleagues [1] is shown in Fig. 1. In Fig. 1 I is a lossfunction that evaluates to +1 if I(true) and to –1 if I(false).

Broadly speaking, the algorithm shown in Fig. 1proceeds by focusing training on those samples that aredifficult to classify, lowering the importance of those train-ing samples that were correctly classified by the most recentlearning hypothesis. If the base learner used by AdaBoostis always able to find a learning hypothesis that results in a

© 2007 Wiley Periodicals, Inc.

Electronics and Communications in Japan, Part 3, Vol. 90, No. 9, 2007Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J85-D-II, No. 5, May 2002, pp. 776–784

(3)

(1)

(2)

Fig. 1. The AdaBoost algorithm.

69

training error, εt in Fig. 1, that is less than 1/2 for arbitrarilyweighted training samples, then AdaBoost will be able togenerate a final learning hypothesis that results in a trainingerror of zero in a finite number of iterations generatingtraining hypotheses.

Of course, this is guaranteed to hold only on thetraining sample; it does not provide a guarantee that thegeneralization error as measured on an unseen sample willalso be small. Since AdaBoost generates a more complexfinal learning hypothesis that is drawn from a larger space,it has been suggested that it should lead to overfitting anda final learning hypothesis that results in a larger generali-zation error than a single learning hypothesis that does notemploy ensemble learning methods. However, there aremany experimental results showing that using a variety ofdifferent well-known learning algorithms as the base learn-ers and in a variety of different application domains, Ad-aBoost is able to obtain a small generalization error [4, 5].Against this background, however, recently there have beenexperimental results showing that AdaBoost can overfitwhen applied to noisy training samples [6–8]; there havealso been attempts to explain this overfitting from theperspective of improving margins [7, 9].

In this paper we propose the new algorithms Ada-BoostReg, n-Arc, and n-Boost that attempt to avoid theoverfitting that can occur with AdaBoost by introducing anormalization term into the objective function minimizedby AdaBoost based on the theory introduced in Refs. 7 and9.

2. Overfitting of Boosting

In the case of binary classification, given an input–output pair, zi = (yi, xi), where yi = ±1 (i = 1, . . . , n) and atraining hypothesis, ht, with weights, a = [α1, . . . , αT], themargin ρ(zi, α) is defined as follows. Here T is the totalnumber of iterations for which learning hypotheses aregenerated.

In addition, the smallest margin is defined as follows:

The training procedure in AdaBoost consists of a processof minimizing the following function Gab(a) of ρ(zi, a) withrespect to ρ(zi, a) [7]:

The following proposition was introduced as Propo-sition 6 in Freund and Schapire [10]. This proposition hasthe following corollary that is of significant interest [9].

[Proposition 1] Letting ε1, ⋅ ⋅ ⋅ , εT be the weightedtraining errors for the learning hypotheses, h1, . . . , hT,generated by AdaBoost, the following inequality holds forθ ∈ [−1, 1]. See Refs. 9 or 10 for the proof of this proposi-tion.

Here the function I evaluates to +1 if I(true) and to –1 ifI(false). In addition, f is the final hypothesis.

We can derive the following corollary from thisproposition.

[Corollary 1] A lower bound on the minimum marginof AdaBoost is given asymptotically by the following equa-tion that depends on ε. See Ref. 9 for a proof.

Here, ε = max t εt.When the maximum training error ε is less than 1/2,

the RHS of Eq. (8) in Corollary 1 will be positive andAdaBoost will classify all training samples correctly. Wecall the situation where the value of this minimum marginbecomes positive, a hard margin [11]. If the training datacontains input noise or classification noise, the formationof a hard margin signifies overfitting and will lead to adegradation in generalization capacity [7, 9, 11].

3. Boosting and Linear ProgrammingProblems

In this section we describe the relationship betweenboth AdaBoost and Arc-GV (an improved version of Ad-aBoost) and linear programming problems from the per-spective of improving margins. Below we define the marginof AdaBoost after the t (< T) iterations of the algorithm andthe minimum margin ρ(at) as follows:

(4)

(5)

(6)

(7)

(8)

(9)

(10)

70

Here at = [α1, . . . , αt].The objective function minimized by AdaBoost, Gab,

is shown in Eq. (6); the weight update for training sampleswt+1(zi) is expressed as follows using the notations for themargin ρ(zi, a):

In addition, the weights, αt, for each learning hypothesis arechosen as follows:

Here εt is given by Eq. (1).In addition, the objective function, GArc, minimized

by Arc-GV can be expressed as in Eq. (13) using the marginnotation ρ(zi, a), while the weighting update for the trainingexamples is the same as for AdaBoost given in Eq. (11):

In addition, the weights, αt, for the learning hypotheses areexpressed as follows:

Both AdaBoost and Arc-GV attempt to minimize the lossfunction expressed in Eqs. (6) and (13) by gradually maxi-mizing the minimum margin at each iteration. From this itfollows that we can think of AdaBoost as asymptoticallycalculating a solution to the linear programming problemfor the complete set of learning hypotheses given by Eq.(15) and the Arc-GV algorithm as asymptotically calculat-ing that to the linear programming problem of Eq. (16):

Let ρlp be the solution to the linear programming problemshown in Eq. (16).

We consider an experimental setup with a data setconsisting of 20 reasonable learning hypotheses for a train-ing set of 50 patterns.* With this finite class of learninghypotheses we can easily solve the linear programmingproblem to maximize the margin; once a suitable numberof iterations have elapsed, we expect both Arc-GV andAdaBoost to have reached the region where they show theirasymptotic characteristics. First, in order to obtain ρlp, wegenerate the suitable hypotheses and training data. In theseexperiments we considered both the case where ρlp ispositive and where it is negative. We then ran Arc-GV andAdaBoost on the training data we generated and recordedρ(at) after 100 and after 1000 iterations. Figure 2 shows therelationship between the values of ρlp and the values ofρ(at) for the learning hypotheses generated after 100 and1000 iterations. The filled circles in Fig. 2 represent thelearning hypotheses generated after 100 iterations while thecrosses represent the learning hypotheses after 1000 itera-tions. From these figures we can draw the following con-clusions regarding the asymptotic solutions of Arc-GV andAdaBoost.

(1) When the optimal solution to the linear program-ming problem of Eq. (16) is ρlp < 0, then it is difficult forAdaBoost to approach the solution ρlp.

(2) After generating learning hypotheses for 1000iterations, the characteristics of the convergence of theapproximate solution of AdaBoost and Arc-GV to the linearprogramming problem of Eq. (15), appear such that whenthe solution ρlp is ρlp > 0, then the solution computed byAdaBoost converges more quickly to the solution ρlp thanthat computed by Arc-GV.

The occurrence of the phenomenon described abovein (1) is clear from Eqs. (15) and (8). When the solutionρlp to Eq. (16) is ρlp < 0, then the constraints ρ(a) > 0 ofthe linear programming problem in Eq. (15) will not besatisfied. In other words ||a||1 will end up converging to acertain value. This corresponds to the fact that settingαt = 0, in other words, εt = 1

2, shows that εt will not be less

than 1/2 as demanded by AdaBoost. In addition, the phe-nomenon described in (2) above can be anticipated from thefact that when ρlp > 0, for samples that are difficult to

(15)

(11)

(12)

(13)

*This data can be obtained from http://ida.first.gmd.de/~raetsch/data.

(14)

(16)

71

classify, the weighting assigned by Arc-GV is constrainedsuch that ||a||1 = 1, while the ||a||1 computed by AdaBoosthas no such constraint and can therefore take on a largervalue more quickly and update the weights of these difficultsamples to be larger.

From the above discussion we see that as well as thedifference in convergence in the asymptotic solutions com-puted by AdaBoost and Arc-GV when the condition ρlp >0 is satisfied for the solution ρlp, we can also see that as thenumber of iterations of learning hypotheses generatedgrows, there is a tendency for the solution of both algo-rithms to converge asymptotically to the solution ρlp. Inparticular, regarding the margin ρ(a) of Arc-GV, in Ref. 12Breiman showed that theoretically this converges asymp-totically to ρlp.

4. Avoiding Overfitting of Boosting byNormalization

In this section we report on a method for avoiding theoverfitting that can arise with Boosting by adding a nor-malization term to the margin.

4.1. AdaBoostReg

With a view to avoiding overfitting and obtainingbetter generalization capacity, we have introduced [9] anextension of AdaBoost that is based on an analogy with theweight decay method [13]. Below we assume that themaximum training error ε is less than 1/2.

With AdaBoost, as the number of iterations becomeslarge the margin on all training samples will become non-negative. This characteristic leads to overtraining and is thereason for the formation of a hard margin. In other words,the following inequality becomes valid:

Here ρ takes the value obtained at the point when theinequality in Eq. (8) holds with equality.

Here we introduce a coefficient, ζi, to ensure that theminimum margin ρ satisfies the following equation.

This corresponds to replacing the constraint ρ M 0 in thelinear programming problem of Eq. (15) with the con-str aint ρ M −Cζi. In other words, the linear programmingproblem becomes

In this way we do not classify some of the training samplesthat appear to contain large amounts of noise or are misla-beled and instead allow a certain number of training errors.By introducing Eq. (18) we are able to obtain a trade-offbetween the margin on a training sample and how importantwe consider the sample to be. If we set the coefficient C tozero in Eq. (18) we obtain the original AdaBoost algorithm.

By analogy to the weight decay method, we chooseto set ζi as follows:

Fig. 2. The relation between the margin ρ(at) ofAdaBoost and the margin ρlp (upper figure), and the relation between the margin ρ(at) of Arc-GV and the margin ρlp (lower figure).

(17)

(18)

(19)

(20)

72

Here the sum on the RHS represents the accumulativeweight of training samples from all the preceding iterations.For samples that can be classified easily, the introductionof ζi does not affect the manner in which AdaBoost will dealwith the points. However, for samples which are hard toclassify a large weight that is significantly different fromthat which would have been assigned by the original Ad-aBoost will be assigned.

From the relations in Eq. (18) we can define a newmargin ρ~(zi, a) as follows.

If we use Eq. (21) as the margin that we substituteinto Eq. (6) we obtain the following new loss function.

Using this loss function, we are able to achieve a trade-offbetween the weight obtained by the training sample in theprevious iteration and its margin. The weight for a trainingsample at iteration t of the learning hypothesis generationprocess is computed as the derivative of Eq. (22) withrespect to ρ(zi, a

t−1) [compare Eq. (11)] and is expressed bythe equation

Here Zt−1 is the following normalization constant set toensure that ∑i=1

n wt(zi) = 1:

Here ζit = (∑r=1

t αrwr(zi))2. Then the rule for updating theweights of the training samples on the t-th iteration ofgenerating learning hypotheses is given as follows:

Here the computation of the weight αt for the t-th learninghypothesis is slightly troublesome. In particular, it is hardto calculate this weight analytically. However, it is possibleto obtain αt by performing a line search on Eq. (22) [13].There exists an α > 0 and if αt > α, then the derivative ofEq. (22) will be positive. When performing a line search onEq. (22), we have that αt > 0 from the conditions for using

AdaBoost, and therefore (∂ / ∂αt) G~Reg > 0 and there is aunique solution.

To evaluate the performance of this algorithm Ad-aBoostReg, we applied radial basis function (RBF) net-works, which do not make use of ensemble methods,AdaBoost and AdaBoostReg using RBF networks as theirclass of base learners, and support vector machines usingRBF kernels to various benchmark data sets. We selectednine data samples from the STATLOG, UCI, and DELVEbenchmark data collection. These consisted of the follow-ing: breast cancer, diabetes, german, heart, image, splice,new-thyroid, titanic, and twonorm; these are all binaryclassification tasks.

For each benchmark data set we selected 100 sets ofsamples such that the ratio between training samples andtest samples was in each case approximately 6:4. We set theparameter C that appears in AdaBoostReg by cross-valida-tion on the training samples. We applied each method oneach sample set, training the method on the training set andcomputing its generalization error on the test set; to evaluatethe performance of each method we calculated the averagegeneralization error over the 100 sets. These average errorvalues are shown in Table 1. For AdaBoost and AdaBoostReg

we set the number of iterations for which learning hypothe-ses were generated to 200 in these numerical experiments.Using RBF networks which have a high expressive capacityas the base learner algorithm in this simulation we foundthat significant overfitting did occur with 200 learningiterations [9].

From Table 1 we can see that AdaBoostReg showsbetter generalization performance than either RBF net-works or the original AdaBoost algorithm and confirmedthat it achieves approximately the same level of perform-ance as support vector machines.

4.2. n-algorithm

The meaning of the parameter C that appears in theAdaBoostReg algorithm reported in the previous section is

(21)

(22)

(25)

Table 1. Results of the RBF networks, AdaBoost,AdaBoostReg, and Support Vector Machines for the

benchmark data sets

(23)

(24)

73

not intuitively clear. Therefore, based on the concept of then-support vector machine of Ref. 14, we introduce a pa-rameter 0 L n L 1 that controls the equilibrium between thecomplexity of the model and the relaxation variables ξi inthe linear programming problems expressed in Eqs. (15)and (16). This parameter that is used to prevent a hardmargin forming corresponds to the ratio of samples in thetraining set that should not be learned for example becausethey are misclassified; in this respect the meaning of theparameter is intuitively understandable. Therefore, it maybe possible to determine the value of n easily and contributeto a reduction in computational costs.

By introducing the parameter n, Eqs. (15) and (16)can be expressed as follows:

The optimization problems of Eqs. (26) and (27) arenonlinear min-max problems; they can be used to find theasymptotic solutions of Arc-GV and AdaBoost.

4.2.1. n-Arc

ρ(zi, a) in Eq. (13) can be replaced by ρ~n(zi, a) givenbelow:

Here ξi is given by ξi = (ρn − ρ(zi, a))+ where (⋅)+ = max(⋅,0). ρn is expressed as

In addition, it is possible to obtain a new algorithm byreplacing ρ(z, at) and ρ(at) from Eqs. (11) and (14), respec-

tively, with ρ~n(z, at) and ρ~n(at) given below. This algorithmis shown in Fig. 3. We call the algorithm shown in Fig. 3,n-Arc.

Here ρn(at) is expressed as

(27)

(30)

(28)

(29)

(26)

(31)

(33)

(34)

Fig. 3. Pseudo-code for algorithm n-Arc.

74

Here ξit is given by ξi

t = (ρn(at) − ρ(zi, at))+.

Let us consider the case where n = 0. First, Eq. (29)can be replaced by the following equivalent equation:

Here we assume n = 0 and get

holds and that

is true; therefore, all the ξ in Eq. (33) in Fig. 3 will be zeroand Eq. (33) is equivalent to Eq. (14).

Similarly to Arc-GV, if given the required number oflearning hypotheses to generate, T, the n-Arc algorithmshown in Fig. 3 will attach weights adaptively to the trainingsamples on which it made a discriminative error and finally,after having generated the required T learning hypotheses,will output a model that is a combination of the T learninghypotheses.

4.2.2. n-Boost

By replacing ρ(zi, a) of Eq. (6) with ρ~(zi, a) of Eq.(30), and ρ(zi, a

t) of Eqs. (11) and (12) with ρ~n(zi, at) of Eq.

(31), we can obtain a new algorithm. This algorithm isshown in Fig. 4. We call the algorithm shown in Fig. 4,n-Boost.

As with the n-Arc algorithm, we can consider the caseof n = 0 for n-Boost as well. Since we have

if we let n = 0, then

holds and because

is true all the ξ in Eq. (38) are zero and Eq. (38) is equivalentto Eq. (6).

Similarly to AdaBoost, if given the required numberof learning hypotheses to generate, T, the n-Boost algorithmshown in Fig. 4 will attach weights adaptively to the trainingsamples on which it made a discriminative error and finally,after having generated the required T learning hypotheses,will output a model that is a combination of the T learninghypotheses.

(32)

(35)

(36)

(37)

(40)

(41)

(42)

(39)

(38)

Fig. 4. Pseudo-code for algorithm n-Boost.

75

4.2.3. Numerical simulations

Here we report on numerical experiments to exhibitthe general characteristics of n-Arc and n-Boost. In thesenumerical experiments we used radial basis function (RBF)networks as the base learners and used samples from abinary classification task generated by aggregating severalnormalized 2-dimensional distributions.* These samplesconsisted of an overlapping dense region with noise with astandard deviation of 0.16 added to the input data xi, i = 1,. . . , n. In addition, in order to confirm that we are able toavoid overfitting as well as using RBF networks, a highlyexpressive class of base learning hypotheses, we also set T,the number of iterations of learning hypotheses to generate,to 200. The numerical results of these experiments areshown in Figs. 5 and 6.

• In the upper figures of both Figs. 5 and 6 theunbroken line shows the fraction of importantpatterns which received a weight of∑t=1

T wt(zi) > 1 / 2n during the training process.

From the fact that the number of important sam-ples is growing approximately linearly with re-spect to the value of n, we can conclude that bothn-Arc and n-Boost are asymptotically able to de-termine the nn samples that can be effectively usedduring the training process. In other words, theratio of samples that have ρ(zi, a) ~ ρ is controlledby n.*

• The broken line on the upper figures of Figs. 5 and6 show the fraction of samples for which themargin ρ(zi, a) < ρ. From the fact that the brokenline is growing approximately linearly with re-spect to the value of n, we can conclude that bothn-Arc and n-Boost are able to accurately deter-mine the fraction of training samples for which themargin ρ(zi, a) < ρ.†

Fig. 6. The relationship between the parameter n andthe number of samples, whose margin is ρ(zi, a) ~ ρ

or ρ(zi, a) < ρ, for n-Boost (upper figure) and therelationship between the parameter n and the

generalization error for n-Boost (lower figure).

Fig. 5. The relationship between the parameter n andthe number of samples, whose margin is ρ(zi, a) ~ ρ

or ρ(zi, a) < ρ, for n-Arc (upper figure) and the relationship between the parameter n and the

generalization error for n-Arc (lower figure).

*This data is available from: http://ida.first.gmd.de/~raetsch/data/bench-marks.htm.

*Regarding the parameter n, it can be shown [9] that asymptotically nnwill be equal to a lower bound on the number of samples that have marginρ(zi, a) ~ ρ.†Regarding the parameter n, it can be shown [9] that asymptotically nnwill be equal to an upper bound on the number of samples that have marginρ(zi, a) < ρ.

76

• From the lower figures of Figs. 5 and 6 we canconfirm that for both Arc-GV and AdaBoost if theparameter n is set appropriately we can create amodel with a small average test error.

Both n-Arc and n-Boost correspond to bootstrapsampling when n = 1; the weight for all training samples isset the same at wt(zi) = 1 / n. If the weighting coefficient forlearning hypotheses is (αt ~ 1/T), then they correspond toBagging (see Figs. 5 and 6).

5. Conclusions

In this paper we have proposed two types of normal-ized Boosting algorithms that introduce a normalizationterm into the objective function minimized by AdaBoost asa means for avoiding the overfitting seen in Boosting. Thenormalized versions of Boosting proposed in this paperconsist of AdaBoostReg based on an analogy with the weightdecay method and n-Arc and n-Boost which make use ofthe convergence of the algorithm to a solution of a nonlinearmin-max problem and employ a normalization parametern that controls the ratio of samples that enter the marginregion.

In future work we hope to conduct theoretical andempirical research comparing these algorithms to the n-Support Vector Machine introduced in Ref. 14.

REFERENCES

1. Schapire R, Freund Y, Bartlett P, Lee W. Boosting themargin: A new explanation for the effectiveness ofvoting methods. Ann Statis 1998;26:1651–1686.

2. Breiman L. Bias, variance, and arcing classifiers.Tech Rep 460, Statistics Department, University ofCalifornia, 1997.

3. Breiman L. Bagging predictors. Machine Learning1996;24:123–140.

4. Freund Y, Schapire R. Experiments with a new Boost-ing algorithm. Proc 13th International Conference onMachine Learning, Bari, Italy. Morgan Kaufmann;1996. p 148–146.

5. Schwenk H, Bengio Y. AdaBoosting neural networks.Proc ICANN’97 Vol. 1327 of LNCS, Springer, p967–972, 1997.

6. Qunilan J. Boosting first-order learning. Proc 7thInternational Workshop on Algorithm Learning The-ory. Vol. 1160 of LNAI, Springer, p 143–155, 1996.

7. Raetsch G, Onoda T, Mueller K-R. Soft margin forAdaBoost. Machine Learning 2001;42:287–320.

8. Grove A, Schuurmans D. Boosting in the limit: Maxi-mizing the margin of learned ensembles. Proc 15thNational Conference on Artificial Intelligence, Madi-son, WI, p 692–699, 1998.

9. Onda T, Raetsch G, Mueller K-R. An analysis andimprovement of the asymptotic characteristics of Ad-aBoost in binary classification tasks. J Japan Soc ArtifIntell 2000;7:705–719. (in Japanese)

10. Freund Y, Schapire R. A decision-theoretic generali-zation of on-line learning and an application to boost-ing. J Comput Sys Sci 1997;55:119–139.

11. Vapnik V. The nature of statistical learning theory.Springer; 1995.

12. Breiman L. Prediction games and arcing algorithms.Tech Rep 504, Statistics Department, University ofCalifornia, 1997.

13. Bishop C. Neural networks for pattern recognition.Clarendon Press; 1995.

14. Schoelkopf B, Smola A, Williamson R, Bartlett P.New support vector algorithms. Neural Comput2000;12:1207–1245.

15. Onda T, Raetsch G, Mueller K-R. The Arcing algo-rithm with non-intuitive learning parameters. J JapanSoc Artif Intell 2001;16:417–426. (in Japanese)

77

AUTHOR

Takashi Onda graduated from the Department of General Studies and Natural Sciences at the International ChristianUniversity, Japan, in 1986, completed the master’s program in nuclear physics at Tokyo Institute of Technology in 1988, andjoined the Central Research Institute of Electric Power Industry. His research interests are in theoretical and mathematicalaspects of machine learning. He holds a D.Eng. degree.

78

overfitting of boosting and regularized boosting algorithms

Documents