an -boosting algorithm for estimation of a regression function

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 3, MARCH 2010 1417

An ��-Boosting Algorithm for Estimation of aRegression Function

Adil M. Bagirov, Conny Clausen, and Michael Kohler

Abstract—An ��-boosting algorithm for estimation of a regres-sion function from random design is presented, which consists offitting repeatedly a function from a fixed nonlinear function spaceto the residuals of the data by least squares and by defining theestimate as a linear combination of the resulting least squares es-timates. Splitting of the sample is used to decide after how manyiterations of smoothing of the residuals the algorithm terminates.The rate of convergence of the algorithm is analyzed in case ofan unbounded response variable. The method is used to fit a sumof maxima of minima of linear functions to a given data set, andis compared with other nonparametric regression estimates usingsimulated data.

Index Terms—��-boosting, greedy algorithm, rate of conver-gence, regression, statistical learning.

I. INTRODUCTION

I N regression analysis an -valued random vectorwith is considered and the dependency

of on the value of is of interest. More precisely, the goalis to find a function such that is a “good ap-proximation” of . In the sequel we assume that the main aimof the analysis is minimization of the mean squared predictionerror or risk

(1)

In this case the, optimal function is the so-called regressionfunction , , i.e.

(2)

because for an arbitrary (measurable) function wehave

Manuscript received September 29, 2008; revised October 14, 2009. Currentversion published March 12, 2010.

A. Bagirov is with the School of Information Technology and Mathemat-ical Sciences, University of Ballarat, Ballarat Victoria 3353, Australia (e-mail:[email protected]).

C. Clausen is with the Department of Mathematics, Universität des Saar-landes, D-66041 Saarbrücken, Germany (e-mail: [email protected]).

M. Kohler is with the Department of Mathematics, Technische Univer-sität Darmstadt, D-64289 Darmstadt, Germany (e-mail: [email protected]).

Communicated by A. Krzyzak, Associate Editor for Pattern Recognition, Sta-tistical Learning and Inference.

Digital Object Identifier 10.1109/TIT.2009.2039161

(cf., e.g., [14, Sec. 1.1]). In addition, (3) implies that any func-tion is a good predictor in the sense that its risk is close tothe optimal value, if and only if the so-called error

(3)

is small. This motivates to measure the error caused by using afunction instead of the regression function by the error (3).

In applications, usually the distribution of (and hencealso the regression function) is unknown. But often it is possibleto observe a sample of the underlying distribution. This leadsto the regression estimation problem. Here , ,

, are independent and identically distributed i.i.d.random vectors. The set of data

is given, and the goal is to construct an estimate

of the regression function such that the error

is small. For a detailed introduction to nonparametric regressionwe refer the reader to the monograph [14].

In this paper, we are mainly interested in results which holdunder very weak assumptions on the underlying distribution. Inparticular we do not assume that a density of the distribution of

exists or that the conditional distribution of given is anormal distribution. Related results in this respect can be found,e.g., [7], [15], [16], [17], or [18].

A closely related problem to nonparametric regression is pat-tern recognition, where takes on values only in a finite set (cf.,e.g., [8]). One of the main achievements in pattern recognitionin the last fifteen years was boosting (cf. [10] and [11]), wherethe outputs of many “weak” classifiers are combined to producea new powerful classification rule. Boosting can be consideredas a way of fitting an additive expansion in a set of “elemen-tary” basis functions (cf. [13]). This view enables to extend thewhole idea to regression by repeatedly fitting of functions ofsome fixed function space to residuals and by using the sum ofthe fitted functions as final estimate (cf. [12]). [6] showed thatthis so-called -boosting is able to estimate very high-dimen-sional linear models well. Reference [5] analyzed the rate ofconvergence of corresponding greedy algorithms, where itera-tively functions of a fixed function space are fitted to the resid-uals of the previous estimate, and the estimates are defined bya linear combination of these functions. In [5], this algorithm

0018-9448/$26.00 © 2010 IEEE

1418 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 3, MARCH 2010

was used to fit a linear combination of perceptrons to the data,and under the assumption of a bounded first moment of theFourier transform of the regression function and of bounded-ness of the response variable it was shown that these estimatesare able to achieve (up to some logarithmic factors) the same di-mension-free parametric rate of convergence as [4] showed forleast squares neural networks.

In this paper, we modify the general algorithm from [5] bycombining it with splitting of the sample in order to determinehow often the residuals are smoothed. We analyze the modi-fied general algorithm in the context of an unbounded responsevariable satisfying a Sub-Gaussian condition. We use it to fit asum of maxima of minima of linear functions to the data. Sincethis function class contains in particular perceptrons, we get as acorollary the rate of convergence mentioned already above, butthis time for unbounded response variables, too. We use an algo-rithm from Bagirov, Clausen and [16] to compute our estimate,apply our new method to simulated data and compare it to othernonparametric regression estimates.

The outline of the paper is as follows. Section II contains thedefinition and our theoretical result on the general -boostingalgorithm. In Section III, we apply it to estimate the regressionfunction by a sum of maxima of minima of linear functions.This algorithm is applied to simulated data and compared toother nonparametric regression estimates in Section IV. Finally,Section V contains the proofs.

II. A GENERAL -BOOSTING ALGORITHM

Let be such that , let (whichwill later be chosen such that ), and let bea (nonlinear) class of functions . Set

where and

and define

(4)

Depending on a parameter , we define estimates

as follows. Set

(5)

and

(6)

where

(7)

Here we assume for simplicity that the above minima exist,however we do not require that they are unique. Next we trun-cate the estimate at heights . More precisely, we set for

(8)

Finally we use splitting of the sample to select the parameterof the estimate. To do this, we set

(9)

where

(10)

In order to be able to formulate our main theoretical result,we need the notion of covering numbers.

Definition 1: Let and set. Let be a set of functions .

An - -cover of on is a finite set of functionswith the property

for all

(11)The - -covering number of on is the min-imal size of an - -cover of on . In case that there existno finite - -cover of the - -covering number of onis defined by .

For a given class of functions , and fixed, we define as the class of functions

with , where andare such, that the two conditions

(12)

for all , , and

(13)

for all and are satisfied. Our maintheoretical result is the following theorem.

BAGIROV et al.: AN -BOOSTING ALGORITHM FOR ESTIMATION OF A REGRESSION FUNCTION 1419

Theorem 1: Let be a class of functions . Letbe an upper bound on the - -covering number

of on any finite set of points, i.e., assume

for all

Define the estimate by (4)–(10) withfor some . Furthermore assume that the distribution of

satisfies

(14)

for some constant and that the regression function isbounded in absolute value by some constant. Then,

holds for sufficiently large constants , , which do notdepend on , , or .

The upper bound on the expected error in Theorem 1 canbe interpreted as follows: If we ignore the minumum over thenthe first term in the sum is the usual bound for the estimationerror of a least squares estimate in case that a sum of func-tions from is fitted to the data. The second term in thesum measures the approximation error, where besides the usualbound

an additional term occurs which comes from the fact that weuse a greedy algorithm to minimize the empirical risk of theestimate. Finally the minimum in front of the sum of these twoterms shows that by splitting of the sample our estimate behavesin view of the above error bound (up to some constant factor) asgood as if we have chosen the value of optimally according tothe underlying distribution. In this sense our estimate is able toadapt to the underlying distribution.

Remark 1: In principle it is also possible to choose the param-eter of the estimate by splitting of the sample. But in case ofthe simulated data in Section IV it turned out that the estimateimproves always for large values of . Therefore we choose inour simulations a fixed very large value for .

III. FITTING OF A SUM OF MAXIMA OF MINIMA OF LINEAR

FUNCTIONS TO THE DATA

In this section we apply our general algorithm to classes offunctions consisting of maxima of minima of linear functionsas introduced in [1], i.e., we apply it to

(15)

where

denotes the scalar product between

and .This class of functions consists of continuous piecewise linear

functions. For , it contains in particular perceptrons ofthe form

for a suitable chosen squashing function (i.e., for a suitablechosen monotone increasing function satisfying

and ). This isobvious, if we choose for the so-called ramp squasher

In the sequel, we will choose as function class for thegeneral algorithm of Section II for some . Here is inde-pendent of the sample size. In the application in Section IV, wewill choose depending on the dimension of and we willuse larger values of in case of larger .

It is well known that in order to derive nontrivial rate of con-vergence results, we have to make some smoothness assump-tions on the regression function (cf., e.g., [8, Th. 7.2 and Prob.7.2] and [9, Sec. 3]). In the sequel we will impose such smooth-ness conditions implicitely on the regression function by im-posing conditions on its Fourier transform. More precisely, wewill consider functions , which satisfy

(16)

where is the Fourier transform of , that is

and we assume

(17)

for some (cf. [3]). We denote the class of functions, which satisfy (16) and (17) by .

Condition (17) is often used for the analysis of the rate ofconvergence of neural network regression estimates. It is anextremely strong assumption, in particular it implies that the


smoothness of the function increases more and more as di-mension of grows. By imposing it on the regression func-tion we are able to derive the following rate of convergence re-sult for our estimate.

Corollary 1: Let and assume that the dis-tribution of satisfies (14) for some constant ,

a.s. for some and that the regressionfunction is bounded in absolute value by some constant lessthan or equal to and that for some . Letthe estimate be defined by (4)–(10), with for some

, and with . Then we have for

for a sufficiently large constant , that does not depend onor .

Remark 2: By using standard approximation result for neuralnetworks (e.g., [14]) it is easy to see that the proof of Corollary1 implies

for all distributions of satisfying (14) and bounded inabsolute value. By a careful analysis of the proof of Lemma 2 itshould be possible to show the same result even for all distribu-tions of satisfying .

IV. APPLICATION TO SIMULATED DATA

In this section, we want to compare our new —boostingestimate with other nonparametric regression estimates. To dothis, we use results from a simulation study conducted [2]. Theredata were generated according to

where is standard normally distributed and independent ofand , and where is uniformly distributed on

with , and where . Asregression functions the following 11 functions have been con-sidered:

• ;

•,

else,;

•,

else, ;

• ;• ;• ;

• ;• ;• ;• ;• .For these 11 different regression functions and each value

data sets of size have been gen-erated, so altogether different distributions have been

considered, and for each of these distributions the estimates havebeen compared for two different sample sizes. The maxmin-es-timate proposed in [2], which uses splitting of the sample andthe principle of least squares to fit a maximum of minima oflinear functions to a data set, has been compared for withkernel estimates (with Gaussian kernel) (see, e.g., [14, Ch. 5]),local linear kernel estimates (see, e.g., [14, Sec. 5.4]), smoothingsplines (see, e.g., [14, Ch. 20]), neural networks and regressiontrees (as implemented in the freely available statistics softwareR). Since for not all of these estimates are easily appli-cable in , for the maxmin—estimate has been comparedonly with neural networks and regression trees.

In order to compute the errors of the estimates, MonteCarlo integration was used, i.e.

was approximated by

where the random variables are i.i.d. with distribu-tion and independent of , and where .Since this error is a random variable itself, the experiment was25 times repeated with independent realizations of the sample,and the mean and the standard deviation of the Monte Carlo es-timates of the error were reported.

In the sequel, we make the same simulations with our newlyproposed —boosting estimate. Here we set for

and for , , repeat seven boostingsteps and use splitting of the sample with tochoose one of these seven estimates as final estimate. In thesequel, we present the mean and the standard deviation of theMonte Carlo estimates of the error of our estimates. In orderto save space, we do not repeat the error values already pub-lished [2], instead we just summarize them by reporting whetherthe error of the —boosting estimate is better, worse or thesame as the error of the maxmin—estimate (coded by ,and , resp.), and by reporting which position the error of the

—boosting estimate achieves, if we order the mean errorvalues of all estimates (except the —estimate) increas-ingly (which gives us a number between 1 and 6 in case of ,and a number between 1 and 3 in case of ).

Tables I and II summarize the results for the four univariateregression functions , Tables III and IV summarizethe results for the three bivariate regression functions ,and and Tables V and VI summarize the results for the fourregression functions where .

Considering the results in Tables I–VI we can first see, thatthe error of our —boosting estimate is in 47 cases lessthan but only in 15 cases bigger than the error of the originalmaxmin—estimate. Taking into account that the newly pro-posed estimates requires on average three to four times less timefor computation of the estimate, we can say that —boostingclearly leads to an improvement of the maxmin—estimate.

Second, by looking at Table VI we can see that the—boosting estimate is especially suited for high-dimen-


TABLE ISIMULATION RESULTS AND COMPARISON WITH SIX OTHER NONPARAMETRIC

REGRESSION ESTIMATES FOR FOUR UNIVARIATE REGRESSION FUNCTIONS AND

SAMPLE SIZE � � ��

TABLE IISIMULATION RESULTS AND COMPARISON WITH SIX OTHER NONPARAMETRIC

REGRESSION ESTIMATES FOR FOUR UNIVARIATE REGRESSION FUNCTIONS AND

SAMPLE SIZE � � ��

TABLE IIISIMULATION RESULTS AND COMPARISON WITH THREE OTHER

NONPARAMETRIC REGRESSION ESTIMATES FOR THREE BIVARIATE REGRESSION

FUNCTIONS AND SAMPLE SIZE � � ��

TABLE IVSIMULATION RESULTS AND COMPARISON WITH THREE OTHER

NONPARAMETRIC REGRESSION ESTIMATES FOR THREE BIVARIATE REGRESSION

FUNCTIONS AND SAMPLE SIZE � � ��

sional data sets and large sample size in comparison with othernonparametric regression estimates.

TABLE VSIMULATION RESULTS AND COMPARISON WITH THREE OTHER

NONPARAMETRIC REGRESSION ESTIMATES FOR REGRESSION FUNCTIONS

WHERE � � �� FOR SAMPLE SIZE � � ��

TABLE VISIMULATION RESULTS AND COMPARISON WITH THREE OTHER

NONPARAMETRIC REGRESSION ESTIMATES FOR REGRESSION FUNCTIONS

WHERE � � �� FOR SAMPLE SIZE � � ��

V. PROOFS

A. A Deterministic Lemma

Let be a class of functions and let, and define

recursively by

(18)

and

(19)

where

(20)

Lemma 1: Let be defined by (18)–(20). Then for any, , and , such

that


for all , , and

for all

we have

The proof of the above lemma is a modification of [5, Proof ofTh. 2.4]. For the sake of completeness, we repeat it as follows.

Proof of Lemma 1: In the first step of the proof we show

To do this, let and set .Because of we have by definition of the estimate

where we have used

Since and we can conclude


Using we get

which implies the assertion of the first step.In the second step of the proof we show

To do this, let and set . Then

and arguing as above we get

from which we conclude the assertion of the second step.In the third step of the proof, we finish the proof. To do this,

we observe that by the results of the previous steps we knowalready that

satisfies

where is defined as .

But from this we get the assertion, since implies

where the last inequality follows from

B. Splitting of the Sample for Unbounded

The following lemma is an extension of [14, Th. 7.1] to un-bounded data. It is about bounding the error of estimates,which are defined by splitting of the sample. Let ,let be a finite set of parameters and assume that for each pa-rameter an estimate

is given, which depends only on the training data. Then, we define

for all (21)

where is chosen such that

(22)


Lemma 2: Let for some constant andassume that the estimates are bounded in absolute valueby for . Assume furthermore that the distribution of

satisfies the Sub-Gaussian condition (14) for some con-stant , and that the regression function fulfilsfor some , with Then, for every estimatedefined by (21) and (22) and any ,

holds, with and a sufficiently large con-stant .

Proof: We use the following error decomposition

where

and where denotes the truncated version of and

Due to equality (22) we can bound the last term by

for every , and this entails for its conditional expectation

By using we get for

With the Cauchy–Schwarz inequality and

(23)

it follows that


owing to the boudedness of . With forwe get

and hence is bounded by

which is less than infinity by the assumptions of the theorem.Furthermore the third term is bounded by , because

which follows again as above. With the settingit follows for some constants ,

From the Cauchy–Schwarz inequality, we get

where we can bound the second factor on the right hand-sidein the above inequality in the same way we have boundedthe second factor from , because by assumptionis bounded, and is clearly also bounded, namely by .Thus, we get for some constant

Next we consider the first term. With the inequality from Jensen,it follows that

Hence, we get

and therefore the calculations from imply

for some constant . Altogether we getfor some constant .

With the same arguments, we get also

for sufficiently large . Hence, it suffices to show

to complete this proof. But a bound on can be de-rived analogously to the bounding of the corresponding termin the [14, Proof of Th. 7.1] by an application of Bernstein in-equality, because contains only the bounded versions of

and the belonging bounded regression function. Hence thisyields to the desired assertion and closes this proof.

C. Proof of Theorem 1

By Lemma 2 applied with and with

we get

For , we now use the following errordecomposition:

where


and

Here, again is the truncated version of and is theregression function of .

Both terms and can be bounded like their corre-sponding terms in the proof of Lemma 2, and hence we have

for a constant . Next, we consider . Let be theevent, that there exists such that andlet be the indicator function of . Then we get

By the Cauchy–Schwarz inequality we get, for

where the last inequality follows analogously to inequality (23).Because holds for all , we get

which is less than infinity by the assumption (14). Furthermoreis bounded by and therefore the first factor is

bounded by

for some constant , . The second factor is boundedby , because (14) leads to

Since , further on this leads to

(24)

With the definition of and defined as in (8), it followsfor

Lemma 1 yields for arbitrary and


which together with (24) implies

The last part of the proof considers . To get bounds on theexpectation of we need conclusions for the covering num-bers of . With the notations

and

for some (25)

it is clear that . Furthermore, for an

arbitrary class of real functions on

(26)

holds, because whenever is an - -cover of onthen is an - -cover of on , too.

Together with [14, Proof of Lemma 16.4] this yields

This bound will be used to get a bound on the following proba-bility. We have, for arbitrary ,

Thus by [14, Th. 11.4], the above derived bound, and

we get for

Using this we get for arbitrary

With

we get

for some sufficient large constant , which does not de-pend on , or . Gathering the above results, the proof iscomplete.

D. Proof of Corollary 1

In the proof we will use the following bound on the coveringnumber of shown [1, Lemma 2].

Lemma 3: Let . Then we have fordefined by (15), that

holds for all .Furthermore we need the following approximation result for

neural networks, which is proven [14, Lemma 16.8].Lemma 4: Let be a squashing function, i.e.,

assume that is monotone increasing and satisfiesand . Then for every proba-

bility measure on , every measurable , everyand every there exists a neural network in


such that

where is the closed ball around zero with radius . The co-efficients of this neural network may be chosen such that

.Proof of Corollary 1: Application of Theorem 1 with the

choice together with Lemma 3 yields

for large enough constants , . Choosing

, we can bound the minimum above by

for sufficiently large constant , that does not depend on, or .Hence we only need a bound on the infimum over

to conclude this proof. For this purpose we will use Lemma 4.It is quite easy to see that, for the so-called ramp squasher ,defined by , functions of the form

are elements of . This results from the fact that for arbitraryand

with and also

with as well, what ensures that condition (13)holds. Therefore, we can rewrite

by using the algebraic sign of the to choose whether or, as

In this notation, it is now obvious, that

whereas the correctness of condition (12) follows from the fact,that multiplication of a function from with a positive factorstill yields a functions from . If is large enough, the sameis true for , because in this case the boundedness ofthe weights in Lemma 4 together with the boundedness of theregression function imply that the truncation makes no changesat all.

We have moreover assumed and forwe have Thus with Lemma 4 and the as-

sumptions and we can nowbound the last term

for a sufficiently large constant , that does not depend an ,, or .

ACKNOWLEDGMENT

The authors would like to thank two anonymous referees andthe associate editor for many detailed and helpful comments.

REFERENCES

[1] A. M. Bagirov, C. Clausen, and M. Kohler, “Estimation of a regressionfunction by maxima of minima of linear functions,” IEEE Trans. Inf.Theory, vol. 55, pp. 833–845, 2009.

[2] A. M. Bagirov, C. Clausen, and M. Kohler, “An algorithm for the es-timation of a regression function by continuous piecewise linear func-tions,” Computat. Optim. Applicat., 2008, to be published.

[3] A. R. Barron, “Universal approximation bounds for superpositions ofa sigmoidal function,” IEEE Trans. Inf. Theory, vol. 39, pp. 930–944,1993.

[4] A. R. Barron, “Approximation and estimation bounds for artificialneural networks,” Machine Learn., vol. 14, pp. 115–133, 1994.


[5] A. R. Barron, A. Cohen, W. Dahmen, and R. DeVore, “Approximationand learning by greedy algorithm,” Ann. Stat., vol. 36, pp. 64–94, 2008.

[6] P. Bühlmann, “Boosting for high-dimensional linear models,” Ann.Stat., vol. 34, pp. 559–583, 2006.

[7] L. Devroye, “On the almost everywhere convergence of nonparametricregression function estimates,” Ann. Stat., vol. 9, pp. 1310–1319, 1981.

[8] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of PatternRecognition. New York: Springer, 1996.

[9] L. P. Devroye and T. J. Wagner, “Distribution-free consistency resultsin nonparametric discrimination and regression function estimation,”Ann. Stat., vol. 8, pp. 231–239, 1980.

[10] Y. Freund, “Boosting a weak learning algorithm by majority,” Inf.Computat., vol. 121, pp. 256–285, 1995.

[11] Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol.55, pp. 119–139, 1997.

[12] J. Friedman, “Greedy function approximation: The gradient boostingmachine,” Ann. Stat., vol. 29, pp. 1189–1232, 2001.

[13] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression:A statistical view of boosting (with discussion),” Ann. Stat., vol. 28, pp.337–407, 2000.

[14] L. Györfi, M. Kohler, A. Krzyzak, and H. Walk, A Distribution-FreeTheory of Nonparametric Regression, ser. Springer Series in Statis-tics. New York: Springer, 2002.

[15] L. Györfi and H. Walk, “On the strong universal consistency of a recur-sive regression estimate by Pál Révész,” Stat. Prob. Lett., vol. 31, pp.177–183, 1997.

[16] M. Kohler, “Multivariate orthogonal series estimates for random designregression,” J. Stat. Planning Inference, vol. 138, pp. 3217–3237, 2008.

[17] A. Krzyzak, T. Linder, and G. Lugosi, “Nonparametric estimation andclassification using radial basis function nets and empirical risk mini-mization,” IEEE Trans. Neural Netw., vol. 7, pp. 475–487, 1996.

[18] H. Walk, “Strong universal pointwise consistency of recursive regres-sion estimates,” Ann. Inst. Stat. Math., vol. 53, pp. 691–707, 2001.

Adil M. Bagirov was born on January 7, 1960 in Bilesuvar, Azerbaijan. Hereceived the Master degree in applied mathematics from the Baku State Uni-versity, Azerbaijan, in 1983, and the Ph.D. degrees in mathematical cyberneticsfrom the Institute of Cybernetics Azerbaijan National Academy of Sciences in1989 and in optimization from the University of Ballarat, Australia, in 2001.

From 2001 to 2005, he was a Research Fellow at the University of Ballarat.Since 2006, he has been an Australian Research Council Research Fellow at theUniversity of Ballarat. His main research interests are in the area of nonsmoothand global optimization and their applications in data mining, regression anal-ysis.

Conny Clausen was born on June 8, 1980 in Flensburg, Germany. She receivedthe degree in mathematics from Saarland University in 2005 and the Ph.D. de-gree in mathematics from Saarland University in 2008.

Since 2008, she has been working as an IT-Consultant at Beck et al. projectsGmbH, Munich, Germany.

Michael Kohler was born on July 17, 1969 in Esslingen, Germany. He receiveddegrees in computer science and mathematics from the University of Stuttgart,Germany, in 1995 and the Ph.D. degree in mathematics from the University ofStuttgart in 1997.

In 1998, he worked as a Visiting Scientist at the Stanford University, Stan-ford, CA. From 2005 to 2007, he was Professor of Applied Mathematics at theUniversity of Saarbrücken, and since 2007, he has been Professor of Mathemat-ical Statistics at the Technische Universität Darmstadt. He coauthored with L.Györfi, A. Krzyzak, and H. Walk the book A Distribution-Free Theory of Non-parametric Regression (New York: Springer, 2002). His main research interestare in the area of nonparametric statistics, especially curve estimation and ap-plications in mathematical finance.

an -boosting algorithm for estimation of a regression function

Documents