lasso applications: regularisation and homotopy · 2013. 2. 18. · 1 lasso penalty objections tu...

Introduction Original homotopy Considerations when p > n Coping with correlated predictors When not? references

Lasso applications: regularisation andhomotopy

M.R. Osborne 1

mailto:[email protected]

1Mathematical Sciences Institute, Australian National UniversityM.R. Osborne mailto:[email protected] Lasso applications: regularisation and homotopy




Abstract

Ti suggested the use of an l1 norm constraint with variablebound in order to carry out variable selection in linear leastsquares estimation problems, an application both of somewhatmarginal interest and not without some controversy A.Miller,Subset selection in regression, Chapman and Hall. The interestlevel changed a few years later. Do and many others noted theconnection between l0 and l1 regularisation and the relevanceto compressed sensing . Plus OPT derived the remarkablyefficient homotopy algorithm. One consequence has beenfurther serious consideration of classes of variable selectionproblems. Aspects of these will be summarised from the pointof view of the selection of a suitable regulariser and theavailability of an appropriate form of the homotopy algorithm.

M.R. Osborne mailto:[email protected] Lasso applications: regularisation and homotopy



Outline

Introduction

Original homotopy

Considerations when p > n

Coping with correlated predictors

When not?

references




The original constraint

This form of the lasso algorithm imposes an l1 norm boundconstraint

p∑i=1

|βi | ≤ κ

to regularise the estimation of least squares models and thentunes the calculation using this bound. This leads to a form ofquadratic program for each value of κ. Given κ, this can besolved by a straight-forward modification of a standard activeset QP algorithm - the point of our modification being to avoidthe representation of the l1 constraint as a large system oflinear inequalities by working with the constraint directly.




Homotopy

However, the particular interest was the discovery that thecomplete solution trajectory parametrised by κ is piecewiselinear and can be calculated very efficiently. Frequently it takesno more work than a column pivoting algorithm for theunconstrained least squares problem or the solution of our l1constrained quadratic program for a single κ value.




l1 selection

Figure: The selection idea!




Smooth objective applicationsStart with linear model:

r = y− Xβ, X : Rp → Rn,

where rank X = min(p,n). Problem: select small subset ofcolumns of X so that ‖r‖2 is small in an appropriate sense.Applications (by no means exhaustive):

1. Exploratory data analysis (y signal observed in presenceof noise). Here the classical case corresponds to p << n.

2. Economise representation of a sampled signal in mannercompatible with adequate reconstruction. Closely relatedto problem of finding sparsest solutions ofunder-determined linear systems - the l0 problem. Herecase of interest corresponds to p � n.

3. Selection when correlation structure in observed variablesimportant. The need for an appropriate tool for micro-arraydata analysis frequently involving hundreds of observationsand thousands of genes has contributed to recent interest.




Necessary conditionsLet µ be the Lagrange multiplier for the l1 constraint. Then

rT X = µvT , µ ≥ 0, vT ε∂‖β‖1, µ =rT Xβ

‖β‖1.

Note µ = 0 if κ ≥ ‖βLS‖1.Introduce index set ψ pointing to the nonzero components of β(the currently selected variables) and a permutation matrix Pψwhich collects together these nonzero components. Then

β = PTψ

[βψ0

], v = PT

ψ

[θψv2

]ε∂‖β‖1,

(θψ)j = sgn(βψ(j)), −1 ≤ (v2)k ≤ 1, kεψc ,

ψ⋃ψc = {1,2, · · · ,p}, vTβ = ‖β‖1, ‖v‖∞ = 1.




Solution

Make partial orthogonal transformation:

XPTψ = Q

[U1 U120 B

], QT y =

[c1c2

].

This leads to equations

U1βψ = c1 − µwψ,

µv2 = BT c2 + µUT12wψ,

µ =wTψc1 − κwTψwψ

where wψ = U−T1 θψ.




Homotopy equations

Differentiating the necessary conditions wrt κ gives:

dµdκ

= − 1wTψwψ

,

U1dβψdκ

=1

wTψwψ

wψ,

d(µv2)

dκ= − 1

wTψwψ

UT12wψ.

The right hand side of this system is independent of κ. Itfollows that the ODE solution is linear in κ at all points wherethe constraint is differentiable.




Starting out – κ small enough

Assume component of maximum modulus of X T y is X T∗1y and

is unique then, using necessary conditions,

ψ = {1}, µ = |X T∗1y| − κ‖X∗1‖2, β1 = θκ.

Condition that components of v2 lie in correct range is

maxi>1

∣∣∣∣∣θ1κX T∗i X∗1 − X T

∗i y|X T∗1y| − κ‖X∗1‖2

∣∣∣∣∣ < 1.

Satisfied for κ small enough.




Solution trajectory

The homotopy equations show the solution is piecewise linear.This means it is a simple computation to follow the solutiontrajectory until smoothness break down! The continuity of thetrajectory guaranteed by the standard perturbation results nowshows how to restart at the breakpoints.This observation is the basis for the OPT homotopy algorithm .In many cases it proves to be remarkably efficient, oftencomputing the entire solution trajectory in little more than thecost of solving the unconstrained problem and returningsignificant additional information. It links to the standard leastsquares solution algorithm with column pivoting based onorthogonal factorization if standard stepwise updating anddowndating techniques are used in the partial factorizationsteps.




Results: LSQ homotopy

p n XA XDHald 4 13 4 0Iowa 8 33 8 0

diabetes 10 442 11 1housing 13 506 13 0

Table: Step counts for homotopy algorithm – least squares objective

Here XA counts iterations while XD counts delete variablesteps. Variable addition is much most common action. Thisexplains the observed efficiency. Ti noted that addition is onlyaction when columns of the design are orthogonal. Probablyremains true when variables have a reasonable independenceproperty.




Multiple responsesTVW consider simultaneous selection of a common set ofpredictor variables for several responses. Further generalisedby TW to permit selection of variable groups based on an apriori imposed partition of the variables. The original TVWproblem had 24 observations on 14 responses and 177predictor variables. The data had both p >> n and significantcorrelation between predictor variables. Basic problemquantities are:

Y =[

y1 · · · yk],

X =[

x1 · · · xp],

B =[β1 · · · βk

]E = Y − XB.




Necessary conditions

The objective is ‖E‖2F .Constraint considered is

p∑l=1

‖β(l)‖∞ ≤ κ,

where bracketed subscripts indicate row vectors. Necessaryconditions are

X T (Y − XB) = κV.

Calculation of V is the interesting part!If ‖β(l)‖∞ = 0 then ‖v(l)‖1 ≤ 1. If ‖β(l)‖∞ > 0 then ‖v(l)‖1 = 1and vlj ≥ 0 if βlj = ‖β(l)‖∞, −vlj ≥ 0 if βlj = −‖β(l)‖∞, vlj = 0 ifβlj 6= ‖β(l)‖∞.The homotopy development is now straightforward.




l1 lasso penalty objections

Tu may have been first to consider p > n in using the lasso toselect knots in regression splines. TVW, TW consideredproblems with p>n and corrrelated predictors.Subsequent applications (for example ZH) such as compressedsensing, micro-array analysis have led to critical assessment ofl1 lasso.(a)If p > n the lasso selects at most n variables.(b)If there is a group of correlated variables then lasso tends toselect just one and this with some arbitrariness.(c)For “usual” n > p cases, if there are high correlationsbetween predictors, Ti has observed that standard stepwiseregression appears more satisfactory.




Elastic net

In its “naive” form this considers the two parameter Lagrangianobjective

L(β, µ, ν) = ‖r‖22 + µ‖β‖1 + ν‖β‖22.

which clearly mixes the lasso and ridge objectives. For eachν > 0 this is a positive definite quadratic form in β subject to alasso constraint so the homotopy algorithm applies directly.Merits argued by ZH.




Clustering of predictor variables

Advantages: Some applications suggest merit in combiningpredictors. One example is possibility of introducing variablesbased on averaging predictors. Referred to as “super genes” inmicro-array analysis.Disadvantages: Correlation cannot improve the conditioning ofthe least squares problem.Note: known cases where good performance of the homotopyalgorithm is guaranteed correspond to strong linearindependence results.Note: ridge component of the elastic net provides help in this

direction. Gives effective design X̂ =

[Xν

12 I

].




OSCARBR suggest the constraint

p∑j=1

|βj |+ cp∑

k=2

∑j<k

max {|βj |, |βk |} ≤ κ

where the tuning constant c > 0. Here the predictor variablesare centred and scaled and the response is centred. The roleof the mixed term is to help identify correlated variables. Someinsight into this comes from noting that the constraint can alsobe written:

p∑i=1

{c (i − 1) + 1} |β|(i) ≤ κ

where the bracketed subscript indicates that the componentsare sorted in increasing order of magnitude.




2D unit balls

(a) Elastic net, (b) signed rank norm.




singed rank selection

Signed rank constraint in 2D. (a) correlation .15, (b) correlation.85. Centered and standardised variables.




QP formulation

For each value of κ OSCAR can be written as a QP with O(p2)constraints:

minβ+,β−,η

∥∥y− X(β+ − β−

)∥∥ ,p∑

j=1

(β+j + β−j

)+ c

p∑k=2

∑j<k

ηjk ≤ κ,

ηjk ≥ β+j + β−j ηjk ≥ β+k + β−k ,1 ≤ j < k ≤ p,

β+j ≥ 0, β−j ≥ 0, j = 1,2, · · · ,p.




Homotopy anyone?

One possibility is as a post-optimality calculation applied to theQP formulation starting from κ = 0. But note 0 is a degenerateextreme point of the constraint function epigraph. One possibleedge leading from here is a group of p − 2 independent tiesmoving away from zero – not typically the required selection!A more compact form can follow l1 homotopy and requirescomputation of the sub-differential of the constraint at everypoint of the trajectory. That is an interesting exercise in convexanalysis for a start!If it is worth the effort then it can be done!




Non–smooth objective homotopyThere is one striking difference in the properties of the optimalhomotopy trajectory between the case when the objective is atleast once continuously differentiable and the non-smooth casewhen the objective is piecewise linear. In the former case theLagrange multiplier for the l1 constraint is a piecewise linear,continuous function of the constraint bound κ with thecharacteristic property that it decreases steadily from its initialpositive value at κ = 0 to 0 for κ large enough. In contrast, thecorresponding Lagrange multiplier associated with a piecewiselinear objective subject to an l1 bound constraint is adecreasing step function of κ with jumps at non-smooth pointsof both the objective and the constraint. It is necessary toinclude an explicit multiplier update phase as these jumps haveto be determined as part of the computation. This phase usesλ, the multiplier, as homotopy parameter.




l1 objectiveThe l1 lasso is a particular case of the quantile regression lassowith the quantile parameter set to .5.

minβ‖r‖1, ‖β‖1 ≤ κ.

Need Lagrangian form with multiplier λ

L (β, λ) = ‖r‖1 + λ {‖β‖1 − κ}

Convex if λ ≥ 0. Necessary conditions give

0 ∈ ∂βL (β, λ) = ∂β‖r‖1 + λ∂β‖β‖1.

This is also the condition for the minimum of the l1 minimizationproblem (λ fixed)

minβ{‖r‖1 + λ‖β‖1} .




Residual zeros are non-smooth points for l1To follow zeros set

σ = {i : ri = 0} , ψ = {i : βi 6= 0} .

Define set complements byσ⋃σc = {1,2, · · · ,n} , ψ

⋃ψc = {1,2, · · · ,p}, and

permutation matrices Pσ : Rn → Rn, Qψ : Rp → Rp by

Pσr =[

r1r2

],

{(r1)i = rσc(i) 6= 0, i = 1,2, · · · ,n − |σ|,(r2)i = rσ(i) = 0, i = 1,2, · · · , |σ| ,

Qψβ =

[β1β2

],

{(β1)i = βψ(i) 6= 0, i = 1,2, · · · , |ψ|,(β2)i = βψc(i) = 0, i = 1,2, · · · ,p − |ψ|

PσXQTψ =

[X11 X12X21 X22

],Pσy =

[y1y2

]




Necessary conditions

Have subdifferential components for permuted system[θTσ vT

σ

]∈ ∂r‖Pσr‖1,

[θTψ uT

ψ

]∈ ∂β‖Qψβ‖1.

These permit the necessary conditions to be written:

[θTσ vT

σ

] [ X11 X12X21 X22

]= λ

[θTψ uT

ψ

], λ ≥ 0,

−1 ≤ vi ≤ 1, i = 1,2, · · · , |σ|,−1 ≤ ui ≤ 1, i = 1,2, · · · , |ψc |,

θTσ r1 =

[θTσ vT

σ

]Pσr = ‖r‖1,

θTψβ1 =

[θTψ uT

ψ

]Qψβ = ‖β‖1 ≤ κ.




Results: l1 homotopy

p n SASD SAXA XDXA XDSDHald 4 13 17 3 0 0Iowa 8 33 18 11 1 4

diabetes 10 442 546 12 0 3housing 13 506 872 28 1 16

Table: Step counts for homotopy algorithm– l1 objective

New feature here is residual sign changes trigger points of nondifferentiability. SA, SD indicate addition and deletion of entriesin σ. This is where the extra work is being done as r adapts tothe required sign structure. Double entries (eg SA followed bySD) reflect the two phases at each step of the computation.




lasso

R. Tibshirani, Regression shrinkage and selection via lasso,JRSS B, 58(267–2880)1996.M. Osborne, B. Presnel,and B. Turlach, A new approach tovariable selection in least squares problems, IMA J. Num. Anal.20(389–403)2000.Ming Yuan and Hui Zou, Efficient global approximation ofgeneralised nonlinear l1–regularised solution paths and itsapplications, JASA 104(1562–15730)2009.M. Osborne and B. Turlach, A homotopy algorithm for thequantile regression lasso and related piecewise linearproblems, J. Computational and Graphical Statistics,20(972–987)2011.




Exploiting regularisation

H. Bondell and B. Reich, Simultaneous regression shrinkage,variable selection, and supervised clustering of predictors withOSCAR, Biometrics, 64(115-123)2008.Hui Zou and Trevor Hastie, Regularization and variableselection via the elastic net, JRSS B, 67(301–320)2005.B. Turlach, W. Venables, and S. Wright, Simultaneous variableselection, Technometrics, 47(349–363)2005.




compressed sensing

A. Bruckstein, D. Donoho, and M. Eldad, From sparse solutionsof systems of equations to sparse modelling of signals andimages, SIAM Rev., 51(3–33)2009.



lasso applications: regularisation and homotopy · 2013. 2. 18. · 1 lasso penalty objections tu...

Documents