quadratic approximation on scad penalized estimation

8
Computational Statistics and Data Analysis 55 (2011) 421–428 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Quadratic approximation on SCAD penalized estimation Sunghoon Kwon a , Hosik Choi b , Yongdai Kim a,* a Seoul National University, Republic of Korea b Hoseo University, Republic of Korea article info Article history: Received 29 October 2009 Received in revised form 16 April 2010 Accepted 9 May 2010 Available online 1 June 2010 Keywords: Penalized approach Quadratic approximation SCAD Variable selection abstract In this paper, we propose a method of quadratic approximation that unifies various types of smoothly clipped absolute deviation (SCAD) penalized estimations. For convenience, we call it the quadratically approximated SCAD penalized estimation (Q-SCAD). We prove that the proposed Q-SCAD estimator achieves the oracle property and requires only the least angle regression (LARS) algorithm for computation. Numerical studies including simulations and real data analysis confirm that the Q-SCAD estimator performs as efficient as the original SCAD estimator. © 2010 Elsevier B.V. All rights reserved. 1. Introduction The sparse penalized methods are useful techniques for selection of relevant variables in many practical problems. For example, Tibshirani (1996) introduced the least absolute shrinkage and selection operator (LASSO) and found that it can perform parameter estimation and variable selection simultaneously. Another popular method, the smoothly clipped absolute deviation (SCAD) penalized estimation, was proposed by Fan and Li (2001) and Fan and Peng (2004). They proved that the SCAD estimator has the oracle property—the asymptotic equivalence of the SCAD estimator with the oracle estimator. Here, the oracle estimator is an estimator obtained by deleting all irrelevant predictive variables (i.e., variables whose true regression coefficients are zero) in advance. Several theoretical results about sparse penalized approaches have been studied. For the LASSO, Knight and Fu (2000) studied asymptotic properties of LASSO-type estimators with a fixed number of parameters. Zou (2006) developed the adaptive LASSO that has the oracle property when the weights over the shrinkage parameters are controlled properly. For high-dimensional cases, where the number of parameters exceeds the sample size, the sign consistency of the LASSO estimator was proved by Zhao and Yu (2006) and Meinshausen and Bühlmann (2006), respectively. For the SCAD, Fan and Li (2001) and Fan and Peng (2004) proved that the SCAD estimator achieves the oracle property for the case of a diverging number of parameters, and this result is extended to high-dimensional cases by Kim et al. (2008a). Computational complexities should be considered in using sparse penalized methods. For the LASSO, Efron et al. (2004) developed the least angle regression (LARS) algorithm which can find the entire solution path of the LASSO estimator exactly. A similar path-finding algorithm was proposed by Rosset and Zhu (2007) for the families of regularized problems that have the piecewise quadratic property. For generalized linear models, Kim et al. (2008b) suggested a gradient decent algorithm and Park and Hastie (2007) introduced an approximated path-finding algorithm using the idea of the LARS algorithm. For the SCAD, computational techniques are more involved since the SCAD penalty is nonconvex. Fan and Li (2001) suggested an iterative local quadratic approximation (LQA) algorithm to apply a modified Newton–Raphson algorithm. Kim et al. (2008a) * Corresponding author. Tel.: +82 2 880 9091; fax: +82 2 883 6144. E-mail address: [email protected] (Y. Kim). 0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2010.05.009

Upload: sunghoon-kwon

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quadratic approximation on SCAD penalized estimation

Computational Statistics and Data Analysis 55 (2011) 421–428

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Quadratic approximation on SCAD penalized estimationSunghoon Kwon a, Hosik Choi b, Yongdai Kim a,∗a Seoul National University, Republic of Koreab Hoseo University, Republic of Korea

a r t i c l e i n f o

Article history:Received 29 October 2009Received in revised form 16 April 2010Accepted 9 May 2010Available online 1 June 2010

Keywords:Penalized approachQuadratic approximationSCADVariable selection

a b s t r a c t

In this paper, we propose a method of quadratic approximation that unifies various typesof smoothly clipped absolute deviation (SCAD) penalized estimations. For convenience,we call it the quadratically approximated SCAD penalized estimation (Q-SCAD). We provethat the proposed Q-SCAD estimator achieves the oracle property and requires only theleast angle regression (LARS) algorithm for computation. Numerical studies includingsimulations and real data analysis confirm that the Q-SCAD estimator performs as efficientas the original SCAD estimator.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

The sparse penalized methods are useful techniques for selection of relevant variables in many practical problems.For example, Tibshirani (1996) introduced the least absolute shrinkage and selection operator (LASSO) and found that itcan perform parameter estimation and variable selection simultaneously. Another popular method, the smoothly clippedabsolute deviation (SCAD) penalized estimation, was proposed by Fan and Li (2001) and Fan and Peng (2004). They provedthat the SCAD estimator has the oracle property—the asymptotic equivalence of the SCAD estimator with the oracleestimator. Here, the oracle estimator is an estimator obtained by deleting all irrelevant predictive variables (i.e., variableswhose true regression coefficients are zero) in advance.Several theoretical results about sparse penalized approaches have been studied. For the LASSO, Knight and Fu (2000)

studied asymptotic properties of LASSO-type estimators with a fixed number of parameters. Zou (2006) developed theadaptive LASSO that has the oracle property when the weights over the shrinkage parameters are controlled properly.For high-dimensional cases, where the number of parameters exceeds the sample size, the sign consistency of the LASSOestimator was proved by Zhao and Yu (2006) and Meinshausen and Bühlmann (2006), respectively. For the SCAD, Fan andLi (2001) and Fan and Peng (2004) proved that the SCAD estimator achieves the oracle property for the case of a divergingnumber of parameters, and this result is extended to high-dimensional cases by Kim et al. (2008a).Computational complexities should be considered in using sparse penalized methods. For the LASSO, Efron et al. (2004)

developed the least angle regression (LARS) algorithmwhich can find the entire solution path of the LASSO estimator exactly.A similar path-finding algorithm was proposed by Rosset and Zhu (2007) for the families of regularized problems that havethe piecewise quadratic property. For generalized linear models, Kim et al. (2008b) suggested a gradient decent algorithmand Park and Hastie (2007) introduced an approximated path-finding algorithm using the idea of the LARS algorithm. Forthe SCAD, computational techniques aremore involved since the SCAD penalty is nonconvex. Fan and Li (2001) suggested aniterative local quadratic approximation (LQA) algorithm to apply a modified Newton–Raphson algorithm. Kim et al. (2008a)

∗ Corresponding author. Tel.: +82 2 880 9091; fax: +82 2 883 6144.E-mail address: [email protected] (Y. Kim).

0167-9473/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2010.05.009

Page 2: Quadratic approximation on SCAD penalized estimation

422 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428

andWu and Liu (2009) proposed concave–convex procedure (CCCP) techniques to find an exact local minimizer of the SCADpenalized loss function, and Zou and Li (2008) introduced a local linear approximation algorithm and proved that theirapproximation is the tightest convex upper bound of the SCAD penalty function.Recently,Wang and Leng (2007) proposed amethod of least squares approximation (LSA)which provides a simple unified

framework applicable to most LASSO estimations. The LSA estimator still possesses most of properties of the original LASSOestimator and can be calculated easily by adapting the LARS algorithm. In this paper,we propose a similarmethod of unifyingvarious types of the SCAD penalized estimations by extending the idea of Wang and Leng (2007). We call the proposedmethod the quadratically approximated SCAD penalized estimation (Q-SCAD). We prove that the Q-SCAD estimator hasthe oracle property as does the original SCAD estimator, and propose a simple and efficient computational algorithm thatrequires only the LARS algorithm. In particular, for general convex loss functions, the Q-SCAD has the same computationalcost as the SCAD penalized linear regression since the objective function to be minimized is quadratic. By combining theCCCP and LARS algorithms as was proposed by Kim et al. (2008a), the Q-SCAD improves significantly the applicability of theSCAD estimator to real problems.The rest of the paper is organized as follows. Section 2 introduces the Q-SCAD estimator, and Section 3 presents

asymptotic results. Section 4 includes results of numerical studies. The conclusions and technical details are given inSection 5 and Appendix, respectively.

2. Quadratically approximated SCAD estimator

In this section, we propose a unified method of the quadratic approximation for the various types of SCAD penalizedestimations.

2.1. Model and notations

Assume that the true parameter θ∗ consists of q nonzero elements and p − q exactly zero elements. Without loss ofgenerality, we write

θ∗ = (θ∗1 , θ∗

2 , . . . , θ∗

p )T= (θ∗T1 , θ

∗T2 )T= (θ∗T1 , 0

T )T

where θ∗1 has the nonzero q elements of θ∗ and θ∗2 consists of the rest p−q zero elements. Similarly, wewrite θ1 as the vector

of the first q elements of a p-dimensional vector θ and θ2 for the rest. We denote 611 as the first q× q submatrix of a givenp× p-dimensional matrix 6, and we define 612,621 and 622 similarly so that

θ = (θT1, θT2)T and 6 =

(611 612621 622

).

We consider the SCAD penalized estimator which is obtained by minimizing the penalized empirical risk function

L(θ)+p∑j=1

Jλ(θj),

where L(θ) is the empirical risk and Jλ(θ) is the SCAD penalty given as

Jλ(θ) =

λ|θ |, 0 ≤ |θ | < λ,

aλ(|θ | − λ)− (θ2 − λ2)/2a− 1

+ λ2, λ ≤ |θ | < aλ,

(a− 1)λ2

2+ λ2, aλ ≤ |θ |,

(1)

for some a > 2. The empirical risk L(θ) is the sum of the squared residuals for linear regression models or the negativelog-likelihood for maximum likelihood estimators. The oracle estimator is defined by

θ̂o= (θ̂

oT1 , 0

T )T (2)

where

θ̂o1 = argmin

θ1L((θT1, 0

T )T).

Note that the oracle estimator θ̂ois an ideal estimator which is not available in practice. However, Fan and Li (2001) proved

that the SCAD estimator is asymptotically equivalent to the oracle estimator. This implies thatwe can use the SCAD estimatoras an alternative to the ideal oracle estimator.

Page 3: Quadratic approximation on SCAD penalized estimation

S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428 423

2.2. Quadratic approximation on SCAD

If the empirical risk function is smooth enough then it can be approximated at a given point θc by a quadratic function:

L(θ) ≈ L(θc)+ (θ− θc)T∂L(θc)∂θ+12(θ− θc)T

∂2L(θc)

∂θ2(θ− θc).

If we set θc = θ̂cwhere θ̂

c= argminθ L(θ) is the non-penalized estimator, then we have a simpler approximation

L(θ) ≈ L(θ̂c)+12(θ− θ̂

c)T∂2L(θ̂

c)

∂θ2(θ− θ̂

c).

Since the non-penalized estimator achieves√n-consistency in most of statistical models, this approximation behaves well.

In some cases (e.g., quantile regression), the loss function does not have a derivative, and hence the approximation is notvalid. However, such a problem can be solved if

√n(θ̂

c− θ∗)→d N(0,6)

for some positive definite matrix 6. Suppose we have a consistent and invertible estimator n var(θ̂c) of 6. By changing the

term ∂2L(θ̂c)/∂θ2 with var(θ̂

c)−1/n, we get an quadratic approximation of L(θ) as

L(θ) ≈ L(θc)+12(θ− θ̂

c)TH(θ̂

c)(θ− θ̂

c)

where H(θ̂c) = var(θ̂

c)−1/n.

We define the quadratically approximated SCAD (Q-SCAD) estimator as

θ̂ = argminθ

(Q (θ)+

p∑j=1

Jλ(θj)

)(3)

where

Q (θ) = (θ− θ̂c)TH(θ̂

c)(θ− θ̂

c)/2.

2.3. Computation

An efficient algorithm to minimize the objective function in (3) is the CCCP algorithm proposed by Kim et al. (2008a). Bydecomposing the SCAD penalty function into the sum of concave and convex functions, we can use the LARS algorithm tosolve the inner loops of the CCCP algorithm. We summarize the CCCP algorithm for the Q-SCAD as follows.

Q-SCAD algorithm

A. Find the non-penalized solution θ̂cand H(θ̂

c) = var(θ̂

c)−1/n.

B. Do until convergencea. Let

U(θ) = Q (θ)+p∑j=1

(∂

∂θjJλ(θ̂ cj )− λ sign(θ̂

cj )

)θj + λ

p∑j=1

|θj|.

b. Return θ̂c= argminθ U(θ).

3. Asymptotic properties

In this section, we study asymptotic properties of the Q-SCAD estimator when the non-penalized estimator follows thenormal distribution asymptotically.Regularity condition

A. The non-penalized estimator θ̂csatisfies

√n(θ̂

c− θ∗)→d N(0,6)

for some p× p positive definite matrix 6.

Page 4: Quadratic approximation on SCAD penalized estimation

424 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428

First, we will prove that the following estimator

θ̂g= (θ̂

gT1 , 0

T )T (4)

is a local minimizer of C(θ)where

θ̂g1 = argmin

θ1Q((θT1, 0

T )T)

and

C(θ) = Q (θ)+p∑j=1

Jλ(θj). (5)

That is, we prove that the Q-SCAD estimator possesses the oracle property. In fact, by simple algebra, we have

θ̂g1 = θ̂

c1 + H(θ̂

c)−111 H(θ̂

c)12θ̂

c2 (6)

which is exactly the same as the oracle estimator θ̂o1 in linear regression. This nice property holds asymptotically for general

models.

Theorem 1 (Oracle Property). Under the condition A, the estimator θ̂gin (4) is a local minimizer of (5) with probability tending

to 1 and it satisfies√n(θ̂

g1 − θ∗1)→

d N(0,611 − 6126−122 621)

provided λn → 0 and√nλn →∞ as n→∞.

Theorem 1 implies that if we can choose a proper tuning parameter λn, we can find a√n-consistent local minimizer that

has the oracle property.

Remark 1. Note that the oracle property holds only when the asymptotic covariance matrix in Theorem 1 is the same asthat of the oracle estimator θ̂

o1. This condition is easily satisfied in many problems under regularity conditions such as the

covariance condition used in Wang and Leng (2007). On the other hand, without any additional assumption, the resultin Theorem 1 still holds although the asymptotic covariance matrix in Theorem 1 may not be equal to that of the oracleestimator θ̂

o1.

In linear regression, Kim et al. (2008a) showed the key nature of the SCAD estimator: the oracle estimator is the globalminimizer of the SCADpenalized sumof squared residuals.We give a similar result for theQ-SCAD estimator in the followingtheorem.

Theorem 2 (Global Optimality). Under the condition A, the estimator θ̂gin (4) is the global minimizer of (5) with probability

tending to 1, i.e.,

P(C(θ̂

g) = inf

θC(θ)

)→ 1

provided λn → 0 and√nλn →∞ as n→∞.

Theorem 2 implies that P(θ̂ = θ̂g)→ 1. That is, we can find the θ̂

geventually by finding the global minimizer of C(θ).

4. Numerical studies

In this section, we investigate finite sample performance of the Q-SCAD estimator through simulation as well as real dataanalysis. We compare the Q-SCAD estimator to the original SCAD estimator and the LSA estimator of Wang and Leng (2007)as well as the non-penalized estimator. We choose the tuning parameter λ using the BIC criterion:

BIC(θ̂(λ)) = −2L(θ̂)+ DF(θ̂(λ)) log n/n

where DF(θ̂(λ)) is the number of estimated nonzero coefficients. This type of selection of tuning parameter λ is shown toguarantee the selection consistency by Wang and Leng (2007) and Wang et al. (2007).

Example 1 (Logistic Regression). First, we consider a sparse logistic regression model:

P (y = 1|x) = exp(xTθ)/(1+ exp(xTθ)).

We set the true parameter θ = (3, 1.5, 0, 0, 2, 0, 0, 0, 0)T and the correlation between any two covariates xi and xj is fixedto be 0.5|i−j|. This model is similar to those considered in many other papers. For example, see Tibshirani (1996), Fan and Li(2001) and Zou (2006). With the sample sizes n = 50, 100 and 200, we repeat the simulation 300 times.

Page 5: Quadratic approximation on SCAD penalized estimation

S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428 425

Table 1Simulation results for the logistic regression model.

n Method TNL RME Noise Signal

50 SCAD 0.3644 (0.0009) 0.5876 (0.0068) 0.16 (0.0065) 2.31 (0.0097)Q-SCAD 0.3621 (0.0009) 0.5376 (0.0063) 0.28 (0.0096) 2.47 (0.0092)LSA 0.3818 (0.0009) 0.7124 (0.0053) 1.01 (0.0166) 2.44 (0.0093)NPE 0.4661 (0.0017) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.3165 (0.0004) 0.2467 (0.0029) 0.00 (0.0000) 3.00 (0.0000)

100 SCAD 0.3107 (0.0005) 0.3918 (0.0061) 0.18 (0.0062) 2.85 (0.0052)Q-SCAD 0.3104 (0.0005) 0.3882 (0.0054) 0.28 (0.0086) 2.88 (0.0048)LSA 0.3192 (0.0004) 0.5922 (0.0051) 0.89 (0.0147) 2.85 (0.0057)NPE 0.3498 (0.0006) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.2952 (0.0002) 0.2654 (0.0028) 0.00 (0.0000) 3.00 (0.0000)

200 SCAD 0.2916 (0.0002) 0.3771 (0.0033) 0.12 (0.0052) 2.99 (0.0012)Q-SCAD 0.2905 (0.0002) 0.3649 (0.0033) 0.13 (0.0053) 2.99 (0.0012)LSA 0.2942 (0.0002) 0.5269 (0.0035) 0.49 (0.0100) 2.99 (0.0012)NPE 0.3104 (0.0003) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.2886 (0.0002) 0.3239 (0.0026) 0.00 (0.0000) 3.00 (0.0000)

Table 2Simulation results for the Poisson regression model.

n Method TNL RME Noise Signal

50 SCAD −12.975 (0.0359) 0.4944 (0.0116) 0.73 (0.0109) 2.99 (0.0016)Q-SCAD −12.983 (0.0359) 0.4760 (0.0090) 0.78 (0.0116) 2.99 (0.0014)LSA −13.000 (0.0359) 0.4486 (0.0080) 0.33 (0.0090) 3.00 (0.0000)NPE −12.773 (0.0365) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle −13.065 (0.0362) 0.3041 (0.0092) 0.00 (0.0000) 3.00 (0.0000)

100 SCAD −13.062 (0.0314) 0.5760 (0.0090) 0.54 (0.0116) 3.00 (0.0000)Q-SCAD −13.062 (0.0314) 0.5708 (0.0086) 0.55 (0.0117) 3.00 (0.0000)LSA −13.068 (0.0314) 0.5141 (0.0076) 0.15 (0.0060) 3.00 (0.0000)NPE −13.007 (0.0311) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle −13.077 (0.0314) 0.4184 (0.0077) 0.00 (0.0000) 3.00 (0.0000)

200 SCAD −13.136 (0.0326) 0.5120 (0.0069) 0.31 (0.0079) 3.00 (0.0000)Q-SCAD −13.136 (0.0326) 0.5122 (0.0071) 0.32 (0.0081) 3.00 (0.0000)LSA −13.137 (0.0326) 0.4860 (0.0061) 0.08 (0.0045) 3.00 (0.0000)NPE −13.116 (0.0325) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle −13.139 (0.0326) 0.4383 (0.0056) 0.00 (0.0000) 3.00 (0.0000)

The simulation results are summarized in Table 1. The TNL is the average of themean test negative log-likelihoodobtainedbased on 5000 independent test samples. The RME is the median of the relative model error which is the ratio of the modelerror of the penalized estimator to that of the non-penalized one. Here, the model error is defined by the mean squaredifference between the estimated and true means of the response variable (Fan and Li, 2001). The columns ‘‘Signal’’ and‘‘Noise’’ indicate the average numbers of nonzero coefficients estimated to be nonzero correctly and incorrectly, respectively.The corresponding estimated standard errors are in the parenthesis.As can be seen from Table 1, the Q-SCAD estimator performs better than the SCAD estimator as well as the LSA estimator

in terms of the prediction accuracy, relativemodel error and selectivity regardless of the sample sizewhile the non-penalizedestimator performs the worst.

Example 2 (Poisson Regression).We consider a sparse Poisson regression model:

y|x ∼ Poisson(exp(xTθ)

).

We set the true parameter θ = (1.2, 0.6, 0, 0, 0.8, 0, 0, 0, 0)T as in Zou and Li (2008). The other settings are the same asthose of Example 1. We ignored the normalizing constants when we calculated the TNL.

The results are summarized in Table 2. The prediction accuracy, relative model error and selectivity of the LSA estimatorare the best regardless of the sample sizes while the SCAD estimator performs the worst. However the differences decreaseas the sample size increases.

Example 3 (Least Absolute Deviation Regression). In this example, we consider a sparse least absolute deviation regression.Data were generated from

y = xTθ+ ε

where ε ∼ N(0, 1). The true coefficient vector θ is set to be the same as that in Example 1. To measure the prediction error,we calculated the test least absolute deviation (TLAD) value between the estimated and truemeans based on an independenttest dataset of size 5000.

Page 6: Quadratic approximation on SCAD penalized estimation

426 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428

Table 3Simulation results for the least absolute deviation regression.

n Method TLAD RME Noise Signal

50 SCAD 0.4265 (0.0003) 0.3528 (0.0040) 0.06 (0.0035) 3.00 (0.0000)Q-SCAD 0.4316 (0.0004) 0.4323 (0.0036) 0.07 (0.0090) 3.00 (0.0000)LSA 0.4316 (0.0004) 0.4457 (0.0033) 0.09 (0.0048) 3.00 (0.0000)NPE 0.4697 (0.0005) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.4240 (0.0003) 0.3260 (0.0038) 0.00 (0.0000) 3.00 (0.0000)

100 SCAD 0.4115 (0.0001) 0.3452 (0.0035) 0.03 (0.0024) 3.00 (0.0000)Q-SCAD 0.4123 (0.0001) 0.3563 (0.0030) 0.06 (0.0039) 3.00 (0.0000)LSA 0.4124 (0.0002) 0.3906 (0.0029) 0.02 (0.0023) 3.00 (0.0000)NPE 0.4327 (0.0002) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.4108 (0.0001) 0.3375 (0.0035) 0.00 (0.0000) 3.00 (0.0000)

200 SCAD 0.4059 (0.0001) 0.3905 (0.0037) 0.01 (0.0016) 3.00 (0.0000)Q-SCAD 0.4057 (0.0001) 0.4072 (0.0030) 0.01 (0.0014) 3.00 (0.0000)LSA 0.4060 (0.0001) 0.4195 (0.0030) 0.01 (0.0012) 3.00 (0.0000)NPE 0.4149 (0.0001) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.4057 (0.0001) 0.3846 (0.0037) 0.00 (0.0000) 3.00 (0.0000)

Table 4Estimated coefficients for South Africa Heart Disease Dataset.

SCAD Q-SCAD LSA NPE p-value

int −1.2549 −1.2457 −1.1509 −1.2631 0sbp 0 0 0 0.1333 0.2563tob 0.3692 0.3678 0.2960 0.3646 0.0028ldl 0.3355 0.3335 0.2683 0.3602 0.0035adi 0 0 0 0.1446 0.5257fam 0.9082 0.9005 0.7854 0.9254 0typ 0.3644 0.3616 0.2691 0.3887 0.0013obe 0 0 0 −0.2651 0.1550alc 0 0 0 0.003 0.9783age 0.7372 0.7326 0.6899 0.6607 0.0001

Table 3 shows that all the methods perform similarly, but the SCAD estimator does slightly better than the Q-SCADestimator and the LSA estimator is the worst.

Example 4 (South Africa Heart Disease Dataset). This dataset includes a retrospective sample ofmales in a heart disease high-risk region of theWestern Cape, South Africa. In this dataset, 9 variables are considered to classify a binary response variable,which indicates the presence or absence of myocardial infarction for 462 subjects. For more information, see ‘‘http://www-stat.stanford.edu/–tibs/ElemStatLearn ’’. In Table 4, the estimated coefficients are presented. The column ‘‘p-value’’ indicatesthe significance of the non-penalized estimates. As expected, the three sparsemethods selected the same covariates and thecoefficients which were estimated to be zero had relatively larger p-values. It is interesting that the coefficients estimatedby the SCAD and Q-SCAD are similar to each other as well as the non-penalized ones while those of the LSA are relativelysmaller. This is partly because the SCAD and Q-SCAD estimators are unbiased while the LSA estimator is not. This exampleshows that the Q-SCAD estimator can be used in real data analysis without loss of efficiency.

5. Conclusions

We proposed a method of quadratic approximation that unifies various types of SCAD penalized estimations. Theoret-ically the proposed method satisfies the oracle property, and the numerical studies confirm that the proposed methodperforms well with moderate sample sizes.It is an interesting problem to extend the proposed unified approach for the SCAD penalized estimation to high-

dimensional situations. It is not clear how to choose an initial estimator around which the loss function is quadraticallyapproximated. We leave this problem as a future work.

Acknowledgements

We are grateful to the anonymous referees, the associate editor, and the editor for their helpful comments. Kwon’sresearch was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Education, Scienceand Technology (MEST)/Korea Science and Engineering Foundation (KOSEF), grant number R11-2008-007-01002-0. Kim’swork was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF), grant numberKRF-2008-314-C00046.

Page 7: Quadratic approximation on SCAD penalized estimation

S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428 427

Appendix

Proof of Theorem 1. Direct calculation yields√n(θ̂

g1 − θ∗1) =

√n(θ̂

c1 − θ∗1)+

√nH(θ̂

c)−111 H(θ̂

c)12θ̂

c2.

From the blockwise inversion formula, we have

H(θ̂c)→p

(6−1(11) −6−1(11)6126

−122

−6−122 6216−1(11) 6−122 + 6−122 6216

−1(11)6126

−122

)where 6(11) = 611 − 6126

−122 621. Using H(θ̂

c)−111 H(θ̂

c)12→

p−6126

−122 , it is easy to check

√n(θ̂1 − θ∗1)→

d N(0,611 − 6126−122 621).

Now, it remains to show that the estimator θ̂gis a local minimizer of the objective function in (5) with probability tending

to 1. Let

S(θ) = ∂Q (θ)/∂θ = H(θ̂c)(θ− θ̂

c)

then

S(θ̂g) =

(S(θ̂

g)1

S(θ̂g)2

)=

(H(θ̂

c)11(θ̂

g1 − θ̂

c1)− H(θ̂

c)12θ̂

c2

H(θ̂c)21(θ̂

g1 − θ̂

c1)− H(θ̂

c)22θ̂

c2

).

By the Karush–Kuhn–Tucker (KKT) condition (see, for example, p. 320 of Bertsekas (1999)) it suffices to check following twoconditions:

S(θ̂g)j + ∂ Jλn(θ̂

gj )/∂θj = 0, ∀θ̂

gj 6= 0

and

|S(θ̂g)j| ≤ λn, ∀θ̂

gj = 0.

By the definition of θ̂g, we have S(θ̂

g)1 = 0 hence it suffices to show that

P(minj≤q|θ̂j| ≥ aλn

)→ 1 (7)

and

P(maxj>q|S(θ̂

g)j| ≤ λn

)→ 1 (8)

as n→∞. From Eq. (6), we have

‖θ̂g1 − θ∗1‖/λn ≤ ‖θ̂

c1 − θ∗1‖/λn + ‖H(θ̂

c)−111 H(θ̂

c)12θ̂

c2‖/λn = Op(1/

√nλn)

and

‖S(θ̂g)2‖ ≤ ‖H(θ̂

c)21(θ̂

g1 − θ̂

c1)‖ + ‖H(θ̂

c)22θ̂

c2‖ = Op(1/

√n).

Hence it is easy to check

minj≤q|θ̂gj |/λn ≥ minj≤q

|θ∗j |/λn − ‖θ̂g1 − θ∗1‖/λn = Op(1/λn)

and ‖S(θ̂g)2‖ = op(λn)which implies (7) and (8), respectively. This completes the proof. �

Proof of Theorem 2. From simple algebra, we get

C(θ)− C(θ̂g) = S(θ̂

g)T2θ2 + (θ̂− θ̂

g)TH(θ̂

c)(θ̂− θ̂

g)/2+

p∑j=1

(Jλn(θj)− Jλn(θ̂

gj ))

≥ −op(λn)p∑

j=q+1

|θj| + ρ

p∑j=1

(θj − θ̂gj )2/2+

p∑j=1

(Jλn(θj)− Jλn(θ̂

gj ))

=

p∑j

wj

Page 8: Quadratic approximation on SCAD penalized estimation

428 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428

where

wj = −op(λn)|θj|I(j > q)+ ρ(θj − θ̂gj )2/2+ Jλn(θj)− Jλn(θ̂

gj )

and ρ is the minimum eigen value of 6−1. First, consider the cases where j ≤ q. If |θj| ≥ aλn, then Jλn(θj) − Jλn(θ̂gj ) = 0

hence wj ≥ 0. If |θj| < aλn, then Jλn(θj) − Jλn(θ̂gj ) ≤ λn|θj − θ̂

gj | and |θj − θ̂

gj | ≥ |θ

j | − ‖θ̂gj − θ

j ‖ − aλn = Op(1).And so wj ≥ (Op(1) − λn)|θj − θ̂

gj | ≥ 0 for sufficiently large n. Next, consider the cases j > q. If |θj| ≥ λn, then

wj ≥ −op(λn)|θj| + ρ|θj|2/2 ≥ 0. And if |θj| < λn, then wj ≥ λn|θj| − op(λn)|θj|. Hence for all 1 ≤ j ≤ p, wj ≥ 0 forsufficiently large n. This completes the proof. �

References

Bertsekas, D.P., 1999. Nonlinear Programming, second ed. Athena Scientific, Belmount, MA.Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of Statistics 32, 407–499.Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96,1348–1360.

Fan, J., Peng, H., 2004. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 32, 928–961.Kim, Y., Choi, H., Oh, H., 2008a. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association 103, 1656–1673.Kim, J., Kim, Y., Kim, Y., 2008b. A gradient-based optimization algorithm for lasso. Journal of Computational and Graphical Statistics 17, 994–1009.Knight, K., Fu, W.J., 2000. Asymptotics for lasso-type estimators. The Annals of Statistics 28, 1356–1378.Meinshausen, N., Bühlmann, P., 2006. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 34, 1436–1462.Park, M., Hastie, T., 2007. l1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B 69, 659–667.Rosset, S., Zhu, J., 2007. Piecewise linear regularized solution paths. The Annals of Statistics 35.Tibshirani, R.J., 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B 58, 267–288.Wang, H., Leng, C., 2007. Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102, 1039–1048.Wang, H., Li, R., Tsai, C., 2007. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94, 553–568.Wu, Y., Liu, Y., 2009. Variable selection in quantile regression. Statistica Sinica 19, 801–817.Zhao, P., Yu, B., 2006. On model selection consistency of lasso. Journal of Machine Learning Research 7, 2541–2563.Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.Zou, H., Li, R., 2008. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics 36, 1509–1533.