september 2006 trevor hastie, stanford statistics 1hastie/talks/hastie_rss_2006.pdf · september...

32
September 2006 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park, Saharon Rosset, Rob Tibshirani, Hui Zou and Ji Zhu.

Upload: others

Post on 13-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 1

Regularization PathsTrevor Hastie

Stanford University

drawing on collaborations with Brad Efron, Mee-Young Park, Saharon

Rosset, Rob Tibshirani, Hui Zou and Ji Zhu.

Page 2: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 2

Theme

• Boosting fits a regularization path toward a max-marginclassifier. Svmpath does as well.

• In neither case is this endpoint always of interest — somewherealong the path is often better.

• Having efficient algorithms for computing entire pathsfacilitates this selection.

• A mini industry has emerged for generating regularizationpaths covering a broad spectrum of statistical problems.

Page 3: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 3

0 200 400 600 800 1000

0.24

0.26

0.28

0.30

0.32

0.34

0.36

Adaboost Stumps for Classification

Iterations

Tes

t Mis

clas

sific

atio

n E

rror

Adaboost StumpAdaboost Stump shrink 0.1

Page 4: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 4

1 5 10 50 100 500 1000

1.5

2.0

2.5

3.0

3.5

4.0

Boosting Stumps for Regression

Number of Trees

MS

E (

Squ

ared

Err

or L

oss)

GBM StumpGMB Stump shrink 0.1

Page 5: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 5

Least Squares Boosting

Friedman, Hastie & Tibshirani — see Elements of StatisticalLearning (chapter 10)

Supervised learning: Response y, predictors x = (x1, x2 . . . xp).

1. Start with function F (x) = 0 and residual r = y

2. Fit a CART regression tree to r giving f(x)

3. Set F (x) ← F (x) + εf(x), r ← r − εf(x) and repeat steps 2and 3 many times

Page 6: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 6

Linear Regression

Here is a version of least squares boosting for multiple linearregression: (assume predictors are standardized)

(Incremental) Forward Stagewise

1. Start with r = y, β1, β2, . . . βp = 0.

2. Find the predictor xj most correlated with r

3. Update βj ← βj + δj , where δj = ε · sign〈r, xj〉4. Set r ← r − δj · xj and repeat steps 2 and 3 many times

δj = 〈r, xj〉 gives usual forward stagewise; different from forwardstepwise

Analogous to least squares boosting, with trees=predictors

Page 7: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 7

Example: Prostate Cancer Data

0.0 0.5 1.0 1.5 2.0 2.5

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

0 50 100 150 200 250

-0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

t =P

j |βj |

Coeffi

cien

ts

Coeffi

cien

ts

Lasso Forward Stagewise

Iteration

Page 8: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 8

Linear regression via the Lasso (Tibshirani, 1995)

• Assume y = 0, xj = 0, Var(xj) = 1 for all j.

• Minimize∑

i(yi −∑

j xijβj)2 subject to ||β||1 ≤ t

• Similar to ridge regression, which has constraint ||β||2 ≤ t

• Lasso does variable selection and shrinkage, while ridge onlyshrinks.

β^ β^2. .β

1

β 2

β1β

Page 9: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 9

Diabetes Data

Lasso

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

Stagewise

0 1000 2000 3000

-500

050

0

123 4 5 67 89 10 1

2

3

4

5

6

78

9

10••• • • •• •• •

t =P

|βj | →t =P

|βj | →

βj

Page 10: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 10

Why are Forward Stagewise and Lasso so similar?

• Are they identical?

• In orthogonal predictor case: yes

• In hard to verify case of monotone coefficient paths: yes

• In general, almost!

• Least angle regression (LAR) provides answers to thesequestions, and an efficient way to compute the complete Lassosequence of solutions.

Page 11: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 11

Least Angle Regression — LAR

Like a “more democratic” version of forward stepwise regression.

1. Start with r = y, β1, β2, . . . βp = 0. Assume xj standardized.

2. Find predictor xj most correlated with r.

3. Increase βj in the direction of sign〈r, xj〉 until some othercompetitor xk has as much correlation with current residualas does xj .

4. Move (βj , βk) in the joint least squares direction for (xj , xk)until some other competitor x� has as much correlation withthe current residual

5. Continue in this way until all predictors have been entered.Stop when 〈r, xj〉 = 0 ∀ j, i.e. OLS solution.

Page 12: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 12

µ0 µ1

x2 x2

x1

u2

y1

y2

The LAR direction u2 at step 2 makes an equal angle with x1 andx2.

Page 13: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 13

Relationship between the 3 algorithms

• Lasso and forward stagewise can be thought of as restrictedversions of LAR

• Lasso: Start with LAR. If a coefficient crosses zero, stop. Dropthat predictor, recompute the best direction and continue. Thisgives the Lasso path

Proof: use KKT conditions for appropriate Lagrangian. Informally:

∂βj

[12||y − Xβ||2 + λ

∑j

|βj |]

= 0

⇔〈xj , r〉 = λ · sign(βj) if βj �= 0 (active)

Page 14: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 14

• Forward Stagewise: Compute the LAR direction, but constrainthe sign of the coefficients to match the correlations corr(r, xj).

• The incremental forward stagewise procedure approximatesthese steps, one predictor at a time. As step size ε → 0, canshow that it coincides with this modified version of LAR

Page 15: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 15

lars package

• The LARS algorithm computes the entire Lasso/FS/LAR pathin same order of computation as one full least squares fit.

• When p N , the solution has at most N non-zero coefficients.Works efficiently for micro-array data (p in thousands).

• Cross-validation is quick and easy.

• Available for R (CRAN) and Splus.

Page 16: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 16

Forward Stagewise and the Monotone Lasso0

12

34

Coe

ffici

ents

(P

ositi

ve)

Lasso

01

23

4

Coe

ffici

ents

(N

egat

ive)

01

23

4

Coe

ffici

ents

(P

ositi

ve)

Forward Stagewise

01

23

4

Coe

ffici

ents

(N

egat

ive)

0 20 40 60 80

L1 Norm (Standardized)

• Expand the variable set to in-clude their negative versions −xj .

• Original lasso corresponds to apositive lasso in this enlargedspace.

• Forward stagewise corresponds toa monotone lasso. The L1 norm||β||1 in this enlarged space is arc-length.

• Forward stagewise produces themaximum decrease in loss perunit arc-length in coefficients.

Tech-report available from my web page.

Page 17: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 17

Degrees of Freedom of Lasso

• The df or effective number of parameters give us an indicationof how much fitting we have done.

• Stein’s Lemma: If yi are i.i.d. N(µi, σ2),

df(µ) def=n∑

i=1

cov(µi, yi)/σ2 = E

[n∑

i=1

∂µi

∂yi

]

• Degrees of freedom formula for LAR: After k steps, df(µk) = k

exactly (amazing! with some regularity conditions)

• Degrees of freedom formula for lasso: Let df(µλ) be thenumber of non-zero elements in βλ. Then Edf(µλ) = df(µλ).

Page 18: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 18

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

LAR

52

110

84

69

0 2 3 4 5 7 8 10

df for LAR

• df are labeled at thetop of the figure

• At the point a com-petitor enters the ac-tive set, the df are in-cremented by 1.

• Not true, for example,for stepwise regression.

Page 19: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 19

Back to Boosting

• Work with Rosset and Zhu (JMLR 2004) extends theconnections between Forward Stagewise and L1 penalizedfitting to other loss functions. In particular the Exponentialloss of Adaboost, and the Binomial loss of Logitboost.

• In the separable case, L1 regularized fitting with these lossesconverges to a L1 maximizing margin (defined by β∗), as thepenalty disappears. i.e. if

β(t) = arg minL(y, f) s.t. |β| ≤ t,

then

limt↑∞

β(t)|β(t)| → β∗

• Then mini yiF ∗ (xi) = mini yixTi β∗, the L1 margin, is

maximized.

Page 20: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 20

• When the monotone lasso is used in the expanded featurespace, the connection with boosting (with shrinkage) is moreprecise.

• This ties in very nicely with the L1 margin explanation ofboosting (Schapire, Freund, Bartlett and Lee, 1998).

• makes connections between SVMs and Boosting, and makesexplicit the margin maximizing properties of boosting.

• experience from statistics suggests that some β(t) along thepath might perform better—a.k.a stopping early.

• Zhao and Yu (2004) incorporate backward corrections withforward stagewise, and produce a boosting algorithm thatmimics lasso.

Page 21: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 21

Maximum Margin and Overfitting

Mixture data from ESL. Boosting with 4-node trees, gbm package inR, shrinkage = 0.02, Adaboost loss.

−0.

3−

0.2

−0.

10.

00.

1

Number of Trees

Mar

gin

0 2K 4K 6K 8K 10K

0.25

0.26

0.27

0.28

Number of Trees

Tes

t Err

or

0 2K 4K 6K 8K 10K

Page 22: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 22

Lasso or Forward Stagewise?

• Micro-array example (Golub Data). N = 38, p = 7129,response binary ALL vs AML

• Lasso behaves chaotically near the end of the path, whileForward Stagewise is smooth and stable.

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

8

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

LASSO

2968

6801

2945

461

2267

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

6

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

Forward Stagewise

4535

353

246

2945

1834

2267

Page 23: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 23

More Path Algorithms

• Elasticnet: (Zou and Hastie, 2005). Compromise between lassoand ridge: minimize

∑i(yi −

∑j xijβj)2 subject to

α||β||1 + (1 − α)||β||22 ≤ t. Useful for situations where variablesoperate in correlated groups (genes in pathways).

• Glmpath: (Park and Hastie, 2005). Approximates the L1

regularization path for generalized linear models. e.g. logisticregression, Poisson regression.

Page 24: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 24

elasticnet package (Hui Zou)

• Min∑

i(yi −∑

j xijβj)2 s.t. α · ||β||22 + (1 − α) · ||β||1 ≤ t

• Mixed penalty selects correlated sets of variables in groups.

• For fixed α, LARS algorithm, along with a standard ridgeregression trick, lets us compute the entire regularization path.

0.0 0.2 0.4 0.6 0.8 1.0

−10

010

2030

40

1

2

3

4

56

s = |beta|/max|beta|

Stan

dard

ized

Coe

ffici

ents

Lasso

0.0 0.2 0.4 0.6 0.8 1.0

−20

−10

010

20

1

2

3

4

5

6

Elastic Net lambda = 0.5

s = |beta|/max|beta|

Stan

dard

ized

Coe

ffici

ents

Page 25: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 25

**

**

**

*

*

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

lambda

Sta

ndar

dize

d co

effic

ient

s

******

*

*

****

**

*

*

****

**

*

*

*******

*

Coefficient path

x4

x3

x5

x2

x1

5 4 3 2 1

************************

******

******

*******

******

******

********************

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

lambda

Sta

ndar

dize

d co

effic

ient

s

*****************************

********

********

********

********

**************

***************************************************************************

***************************************************************************

**************************************************

***************

**********

Coefficient path

x4

x3

x5

x2

x1

5 4 3 2 1

glmpath package

• max �(β) s.t. ||β||1 ≤ t

• Predictor-correctormethods in convexoptimization used.

• Computes exact pathat a sequence of indexpoints t.

• Can approximate thejunctions (in t) wherethe active set changes.

• coxpath included inpackage.

Page 26: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 26

. . . and more Path Algorithms

• Bach and Jordan (2004) have path algorithms for Kernelestimation, and for efficient ROC curve estimation. The latteris a useful generalization of the svmpath algorithm discussedlater.

• Rosset and Zhu (2004) discuss conditions needed to obtainpiecewise-linear paths. A combination of piecewisequadratic/linear loss function, and an L1 penalty, is sufficient.

• Yuan and Li (2006), following Lin and Zhang (2002), extendlasso to deal with groups of variables (e.g. dummy variables forfactors):

maxβ

�(y; β) − λ

M∑m=1

γm||βm||2.

The �2 norm ensures the vector of coefficients βm for each

Page 27: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 27

group are all zero or non-zero together.

• Using predictor-corrector methods we can construct the pathfor this criterion: Park & Hastie (2006), Regularization pathalgorithms for detecting gene interactions.

Page 28: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 28

Path algorithms for the SVM

• The two-class SVM classifier f(X) = α0 +∑N

i=1 αiK(X, xi)yi

can be seen to have a quadratic penalty and piecewise-linearloss. As the cost parameter C is varied, the Lagrangemultipliers αi change piecewise-linearly.

• This allows the entire regularization path to be traced exactly.The active set is determined by the points exactly on themargin.

12 points, 6 per class, Separated

Step: 17 Error: 0 Elbow Size: 2 Loss: 0

*

*

**

*

*

*

*

*

*

**

7

8

9

10

11

12

1

2

3

4

56

1

3

Mixture Data − Radial Kernel Gamma=1.0

Step: 623 Error: 13 Elbow Size: 54 Loss: 30.46

***

*

*

*

*

*

**

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

* ****

*

*** *

*

*

*

*

*

*

*

***

***

****

*

***

*

*

*

*

*

*

*

* *

*

*

*

*

*

*

**

***

*

*

*

*

*

*

***

** **

*

*

*

*

*

*

*

*

*

**

*

***

*

*

*

*

*

*

*

*

**

*

* *

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

**

*

*

*

*

*

*

*

*

*

** *

**

**

*

**

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

Mixture Data − Radial Kernel Gamma=5

Step: 483 Error: 1 Elbow Size: 90 Loss: 1.01

***

*

*

*

*

*

**

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

* ****

*

*** *

*

*

*

*

*

*

*

***

***

****

*

***

*

*

*

*

*

*

*

* *

*

*

*

*

*

*

**

***

*

*

*

*

*

*

***

** **

*

*

*

*

*

*

*

*

*

**

*

***

*

*

*

*

*

*

*

*

**

*

* *

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

**

*

*

*

*

*

*

*

*

*

** *

**

**

*

**

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

Page 29: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 29

SVM as a regularization method

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Binomial Log-likelihoodSupport Vector

yf(x) (margin)

Loss

With f(x) = xT β + β0 andyi ∈ {−1, 1}, consider

minβ0, β

N∑i=1

[1−yif(xi)]++λ

2‖β‖2

This hinge loss criterionis equivalent to the SVM,with λ monotone in B.Compare with

minβ0, β

N∑i=1

log[1 + e−yif(xi)

]+

λ

2‖β‖2

This is binomial deviance loss, and the solution is “ridged” linearlogistic regression.

Page 30: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 30

The Need for Regularization

1e−01 1e+01 1e+03

0.20

0.25

0.30

0.35

1e−01 1e+01 1e+03 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03

Tes

t Err

or

Test Error Curves − SVM with Radial Kernel

γ = 5 γ = 1 γ = 0.5 γ = 0.1

C = 1/λ

• γ is a kernel parameter: K(x, z) = exp(−γ||x − z||2).• λ (or C) are regularization parameters, which have to be

determined using some means like cross-validation.

Page 31: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 31

• Using logistic regression + binomial loss or Adaboostexponential loss, and same quadratic penalty as SVM, we getthe same limiting margin as SVM (Rosset, Zhu and Hastie,JMLR 2004)

• Alternatively, using the “Hinge loss” of SVMs and an L1

penalty (rather than quadratic), we get a Lasso version ofSVMs (with at most N variables in the solution for any valueof the penalty.

Page 32: September 2006 Trevor Hastie, Stanford Statistics 1hastie/TALKS/hastie_RSS_2006.pdf · September 2006 Trevor Hastie, Stanford Statistics 2 Theme • Boosting fits a regularization

September 2006 Trevor Hastie, Stanford Statistics 32

Concluding Comments

• Boosting fits a monotone L1 regularization path toward amaximum-margin classifier

• Many modern function estimation techniques create a path ofsolutions via regularization.

• In many cases these paths can be computed efficiently andentirely.

• This facilitates the important step of model selection —selecting a desirable position along the path — using a testsample or by CV.