quadratic approximation on scad penalized estimation
TRANSCRIPT
Computational Statistics and Data Analysis 55 (2011) 421–428
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis
journal homepage: www.elsevier.com/locate/csda
Quadratic approximation on SCAD penalized estimationSunghoon Kwon a, Hosik Choi b, Yongdai Kim a,∗a Seoul National University, Republic of Koreab Hoseo University, Republic of Korea
a r t i c l e i n f o
Article history:Received 29 October 2009Received in revised form 16 April 2010Accepted 9 May 2010Available online 1 June 2010
Keywords:Penalized approachQuadratic approximationSCADVariable selection
a b s t r a c t
In this paper, we propose a method of quadratic approximation that unifies various typesof smoothly clipped absolute deviation (SCAD) penalized estimations. For convenience,we call it the quadratically approximated SCAD penalized estimation (Q-SCAD). We provethat the proposed Q-SCAD estimator achieves the oracle property and requires only theleast angle regression (LARS) algorithm for computation. Numerical studies includingsimulations and real data analysis confirm that the Q-SCAD estimator performs as efficientas the original SCAD estimator.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
The sparse penalized methods are useful techniques for selection of relevant variables in many practical problems.For example, Tibshirani (1996) introduced the least absolute shrinkage and selection operator (LASSO) and found that itcan perform parameter estimation and variable selection simultaneously. Another popular method, the smoothly clippedabsolute deviation (SCAD) penalized estimation, was proposed by Fan and Li (2001) and Fan and Peng (2004). They provedthat the SCAD estimator has the oracle property—the asymptotic equivalence of the SCAD estimator with the oracleestimator. Here, the oracle estimator is an estimator obtained by deleting all irrelevant predictive variables (i.e., variableswhose true regression coefficients are zero) in advance.Several theoretical results about sparse penalized approaches have been studied. For the LASSO, Knight and Fu (2000)
studied asymptotic properties of LASSO-type estimators with a fixed number of parameters. Zou (2006) developed theadaptive LASSO that has the oracle property when the weights over the shrinkage parameters are controlled properly.For high-dimensional cases, where the number of parameters exceeds the sample size, the sign consistency of the LASSOestimator was proved by Zhao and Yu (2006) and Meinshausen and Bühlmann (2006), respectively. For the SCAD, Fan andLi (2001) and Fan and Peng (2004) proved that the SCAD estimator achieves the oracle property for the case of a divergingnumber of parameters, and this result is extended to high-dimensional cases by Kim et al. (2008a).Computational complexities should be considered in using sparse penalized methods. For the LASSO, Efron et al. (2004)
developed the least angle regression (LARS) algorithmwhich can find the entire solution path of the LASSO estimator exactly.A similar path-finding algorithm was proposed by Rosset and Zhu (2007) for the families of regularized problems that havethe piecewise quadratic property. For generalized linear models, Kim et al. (2008b) suggested a gradient decent algorithmand Park and Hastie (2007) introduced an approximated path-finding algorithm using the idea of the LARS algorithm. Forthe SCAD, computational techniques aremore involved since the SCAD penalty is nonconvex. Fan and Li (2001) suggested aniterative local quadratic approximation (LQA) algorithm to apply a modified Newton–Raphson algorithm. Kim et al. (2008a)
∗ Corresponding author. Tel.: +82 2 880 9091; fax: +82 2 883 6144.E-mail address: [email protected] (Y. Kim).
0167-9473/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2010.05.009
422 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428
andWu and Liu (2009) proposed concave–convex procedure (CCCP) techniques to find an exact local minimizer of the SCADpenalized loss function, and Zou and Li (2008) introduced a local linear approximation algorithm and proved that theirapproximation is the tightest convex upper bound of the SCAD penalty function.Recently,Wang and Leng (2007) proposed amethod of least squares approximation (LSA)which provides a simple unified
framework applicable to most LASSO estimations. The LSA estimator still possesses most of properties of the original LASSOestimator and can be calculated easily by adapting the LARS algorithm. In this paper,we propose a similarmethod of unifyingvarious types of the SCAD penalized estimations by extending the idea of Wang and Leng (2007). We call the proposedmethod the quadratically approximated SCAD penalized estimation (Q-SCAD). We prove that the Q-SCAD estimator hasthe oracle property as does the original SCAD estimator, and propose a simple and efficient computational algorithm thatrequires only the LARS algorithm. In particular, for general convex loss functions, the Q-SCAD has the same computationalcost as the SCAD penalized linear regression since the objective function to be minimized is quadratic. By combining theCCCP and LARS algorithms as was proposed by Kim et al. (2008a), the Q-SCAD improves significantly the applicability of theSCAD estimator to real problems.The rest of the paper is organized as follows. Section 2 introduces the Q-SCAD estimator, and Section 3 presents
asymptotic results. Section 4 includes results of numerical studies. The conclusions and technical details are given inSection 5 and Appendix, respectively.
2. Quadratically approximated SCAD estimator
In this section, we propose a unified method of the quadratic approximation for the various types of SCAD penalizedestimations.
2.1. Model and notations
Assume that the true parameter θ∗ consists of q nonzero elements and p − q exactly zero elements. Without loss ofgenerality, we write
θ∗ = (θ∗1 , θ∗
2 , . . . , θ∗
p )T= (θ∗T1 , θ
∗T2 )T= (θ∗T1 , 0
T )T
where θ∗1 has the nonzero q elements of θ∗ and θ∗2 consists of the rest p−q zero elements. Similarly, wewrite θ1 as the vector
of the first q elements of a p-dimensional vector θ and θ2 for the rest. We denote 611 as the first q× q submatrix of a givenp× p-dimensional matrix 6, and we define 612,621 and 622 similarly so that
θ = (θT1, θT2)T and 6 =
(611 612621 622
).
We consider the SCAD penalized estimator which is obtained by minimizing the penalized empirical risk function
L(θ)+p∑j=1
Jλ(θj),
where L(θ) is the empirical risk and Jλ(θ) is the SCAD penalty given as
Jλ(θ) =
λ|θ |, 0 ≤ |θ | < λ,
aλ(|θ | − λ)− (θ2 − λ2)/2a− 1
+ λ2, λ ≤ |θ | < aλ,
(a− 1)λ2
2+ λ2, aλ ≤ |θ |,
(1)
for some a > 2. The empirical risk L(θ) is the sum of the squared residuals for linear regression models or the negativelog-likelihood for maximum likelihood estimators. The oracle estimator is defined by
θ̂o= (θ̂
oT1 , 0
T )T (2)
where
θ̂o1 = argmin
θ1L((θT1, 0
T )T).
Note that the oracle estimator θ̂ois an ideal estimator which is not available in practice. However, Fan and Li (2001) proved
that the SCAD estimator is asymptotically equivalent to the oracle estimator. This implies thatwe can use the SCAD estimatoras an alternative to the ideal oracle estimator.
S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428 423
2.2. Quadratic approximation on SCAD
If the empirical risk function is smooth enough then it can be approximated at a given point θc by a quadratic function:
L(θ) ≈ L(θc)+ (θ− θc)T∂L(θc)∂θ+12(θ− θc)T
∂2L(θc)
∂θ2(θ− θc).
If we set θc = θ̂cwhere θ̂
c= argminθ L(θ) is the non-penalized estimator, then we have a simpler approximation
L(θ) ≈ L(θ̂c)+12(θ− θ̂
c)T∂2L(θ̂
c)
∂θ2(θ− θ̂
c).
Since the non-penalized estimator achieves√n-consistency in most of statistical models, this approximation behaves well.
In some cases (e.g., quantile regression), the loss function does not have a derivative, and hence the approximation is notvalid. However, such a problem can be solved if
√n(θ̂
c− θ∗)→d N(0,6)
for some positive definite matrix 6. Suppose we have a consistent and invertible estimator n var(θ̂c) of 6. By changing the
term ∂2L(θ̂c)/∂θ2 with var(θ̂
c)−1/n, we get an quadratic approximation of L(θ) as
L(θ) ≈ L(θc)+12(θ− θ̂
c)TH(θ̂
c)(θ− θ̂
c)
where H(θ̂c) = var(θ̂
c)−1/n.
We define the quadratically approximated SCAD (Q-SCAD) estimator as
θ̂ = argminθ
(Q (θ)+
p∑j=1
Jλ(θj)
)(3)
where
Q (θ) = (θ− θ̂c)TH(θ̂
c)(θ− θ̂
c)/2.
2.3. Computation
An efficient algorithm to minimize the objective function in (3) is the CCCP algorithm proposed by Kim et al. (2008a). Bydecomposing the SCAD penalty function into the sum of concave and convex functions, we can use the LARS algorithm tosolve the inner loops of the CCCP algorithm. We summarize the CCCP algorithm for the Q-SCAD as follows.
Q-SCAD algorithm
A. Find the non-penalized solution θ̂cand H(θ̂
c) = var(θ̂
c)−1/n.
B. Do until convergencea. Let
U(θ) = Q (θ)+p∑j=1
(∂
∂θjJλ(θ̂ cj )− λ sign(θ̂
cj )
)θj + λ
p∑j=1
|θj|.
b. Return θ̂c= argminθ U(θ).
3. Asymptotic properties
In this section, we study asymptotic properties of the Q-SCAD estimator when the non-penalized estimator follows thenormal distribution asymptotically.Regularity condition
A. The non-penalized estimator θ̂csatisfies
√n(θ̂
c− θ∗)→d N(0,6)
for some p× p positive definite matrix 6.
424 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428
First, we will prove that the following estimator
θ̂g= (θ̂
gT1 , 0
T )T (4)
is a local minimizer of C(θ)where
θ̂g1 = argmin
θ1Q((θT1, 0
T )T)
and
C(θ) = Q (θ)+p∑j=1
Jλ(θj). (5)
That is, we prove that the Q-SCAD estimator possesses the oracle property. In fact, by simple algebra, we have
θ̂g1 = θ̂
c1 + H(θ̂
c)−111 H(θ̂
c)12θ̂
c2 (6)
which is exactly the same as the oracle estimator θ̂o1 in linear regression. This nice property holds asymptotically for general
models.
Theorem 1 (Oracle Property). Under the condition A, the estimator θ̂gin (4) is a local minimizer of (5) with probability tending
to 1 and it satisfies√n(θ̂
g1 − θ∗1)→
d N(0,611 − 6126−122 621)
provided λn → 0 and√nλn →∞ as n→∞.
Theorem 1 implies that if we can choose a proper tuning parameter λn, we can find a√n-consistent local minimizer that
has the oracle property.
Remark 1. Note that the oracle property holds only when the asymptotic covariance matrix in Theorem 1 is the same asthat of the oracle estimator θ̂
o1. This condition is easily satisfied in many problems under regularity conditions such as the
covariance condition used in Wang and Leng (2007). On the other hand, without any additional assumption, the resultin Theorem 1 still holds although the asymptotic covariance matrix in Theorem 1 may not be equal to that of the oracleestimator θ̂
o1.
In linear regression, Kim et al. (2008a) showed the key nature of the SCAD estimator: the oracle estimator is the globalminimizer of the SCADpenalized sumof squared residuals.We give a similar result for theQ-SCAD estimator in the followingtheorem.
Theorem 2 (Global Optimality). Under the condition A, the estimator θ̂gin (4) is the global minimizer of (5) with probability
tending to 1, i.e.,
P(C(θ̂
g) = inf
θC(θ)
)→ 1
provided λn → 0 and√nλn →∞ as n→∞.
Theorem 2 implies that P(θ̂ = θ̂g)→ 1. That is, we can find the θ̂
geventually by finding the global minimizer of C(θ).
4. Numerical studies
In this section, we investigate finite sample performance of the Q-SCAD estimator through simulation as well as real dataanalysis. We compare the Q-SCAD estimator to the original SCAD estimator and the LSA estimator of Wang and Leng (2007)as well as the non-penalized estimator. We choose the tuning parameter λ using the BIC criterion:
BIC(θ̂(λ)) = −2L(θ̂)+ DF(θ̂(λ)) log n/n
where DF(θ̂(λ)) is the number of estimated nonzero coefficients. This type of selection of tuning parameter λ is shown toguarantee the selection consistency by Wang and Leng (2007) and Wang et al. (2007).
Example 1 (Logistic Regression). First, we consider a sparse logistic regression model:
P (y = 1|x) = exp(xTθ)/(1+ exp(xTθ)).
We set the true parameter θ = (3, 1.5, 0, 0, 2, 0, 0, 0, 0)T and the correlation between any two covariates xi and xj is fixedto be 0.5|i−j|. This model is similar to those considered in many other papers. For example, see Tibshirani (1996), Fan and Li(2001) and Zou (2006). With the sample sizes n = 50, 100 and 200, we repeat the simulation 300 times.
S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428 425
Table 1Simulation results for the logistic regression model.
n Method TNL RME Noise Signal
50 SCAD 0.3644 (0.0009) 0.5876 (0.0068) 0.16 (0.0065) 2.31 (0.0097)Q-SCAD 0.3621 (0.0009) 0.5376 (0.0063) 0.28 (0.0096) 2.47 (0.0092)LSA 0.3818 (0.0009) 0.7124 (0.0053) 1.01 (0.0166) 2.44 (0.0093)NPE 0.4661 (0.0017) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.3165 (0.0004) 0.2467 (0.0029) 0.00 (0.0000) 3.00 (0.0000)
100 SCAD 0.3107 (0.0005) 0.3918 (0.0061) 0.18 (0.0062) 2.85 (0.0052)Q-SCAD 0.3104 (0.0005) 0.3882 (0.0054) 0.28 (0.0086) 2.88 (0.0048)LSA 0.3192 (0.0004) 0.5922 (0.0051) 0.89 (0.0147) 2.85 (0.0057)NPE 0.3498 (0.0006) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.2952 (0.0002) 0.2654 (0.0028) 0.00 (0.0000) 3.00 (0.0000)
200 SCAD 0.2916 (0.0002) 0.3771 (0.0033) 0.12 (0.0052) 2.99 (0.0012)Q-SCAD 0.2905 (0.0002) 0.3649 (0.0033) 0.13 (0.0053) 2.99 (0.0012)LSA 0.2942 (0.0002) 0.5269 (0.0035) 0.49 (0.0100) 2.99 (0.0012)NPE 0.3104 (0.0003) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.2886 (0.0002) 0.3239 (0.0026) 0.00 (0.0000) 3.00 (0.0000)
Table 2Simulation results for the Poisson regression model.
n Method TNL RME Noise Signal
50 SCAD −12.975 (0.0359) 0.4944 (0.0116) 0.73 (0.0109) 2.99 (0.0016)Q-SCAD −12.983 (0.0359) 0.4760 (0.0090) 0.78 (0.0116) 2.99 (0.0014)LSA −13.000 (0.0359) 0.4486 (0.0080) 0.33 (0.0090) 3.00 (0.0000)NPE −12.773 (0.0365) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle −13.065 (0.0362) 0.3041 (0.0092) 0.00 (0.0000) 3.00 (0.0000)
100 SCAD −13.062 (0.0314) 0.5760 (0.0090) 0.54 (0.0116) 3.00 (0.0000)Q-SCAD −13.062 (0.0314) 0.5708 (0.0086) 0.55 (0.0117) 3.00 (0.0000)LSA −13.068 (0.0314) 0.5141 (0.0076) 0.15 (0.0060) 3.00 (0.0000)NPE −13.007 (0.0311) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle −13.077 (0.0314) 0.4184 (0.0077) 0.00 (0.0000) 3.00 (0.0000)
200 SCAD −13.136 (0.0326) 0.5120 (0.0069) 0.31 (0.0079) 3.00 (0.0000)Q-SCAD −13.136 (0.0326) 0.5122 (0.0071) 0.32 (0.0081) 3.00 (0.0000)LSA −13.137 (0.0326) 0.4860 (0.0061) 0.08 (0.0045) 3.00 (0.0000)NPE −13.116 (0.0325) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle −13.139 (0.0326) 0.4383 (0.0056) 0.00 (0.0000) 3.00 (0.0000)
The simulation results are summarized in Table 1. The TNL is the average of themean test negative log-likelihoodobtainedbased on 5000 independent test samples. The RME is the median of the relative model error which is the ratio of the modelerror of the penalized estimator to that of the non-penalized one. Here, the model error is defined by the mean squaredifference between the estimated and true means of the response variable (Fan and Li, 2001). The columns ‘‘Signal’’ and‘‘Noise’’ indicate the average numbers of nonzero coefficients estimated to be nonzero correctly and incorrectly, respectively.The corresponding estimated standard errors are in the parenthesis.As can be seen from Table 1, the Q-SCAD estimator performs better than the SCAD estimator as well as the LSA estimator
in terms of the prediction accuracy, relativemodel error and selectivity regardless of the sample sizewhile the non-penalizedestimator performs the worst.
Example 2 (Poisson Regression).We consider a sparse Poisson regression model:
y|x ∼ Poisson(exp(xTθ)
).
We set the true parameter θ = (1.2, 0.6, 0, 0, 0.8, 0, 0, 0, 0)T as in Zou and Li (2008). The other settings are the same asthose of Example 1. We ignored the normalizing constants when we calculated the TNL.
The results are summarized in Table 2. The prediction accuracy, relative model error and selectivity of the LSA estimatorare the best regardless of the sample sizes while the SCAD estimator performs the worst. However the differences decreaseas the sample size increases.
Example 3 (Least Absolute Deviation Regression). In this example, we consider a sparse least absolute deviation regression.Data were generated from
y = xTθ+ ε
where ε ∼ N(0, 1). The true coefficient vector θ is set to be the same as that in Example 1. To measure the prediction error,we calculated the test least absolute deviation (TLAD) value between the estimated and truemeans based on an independenttest dataset of size 5000.
426 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428
Table 3Simulation results for the least absolute deviation regression.
n Method TLAD RME Noise Signal
50 SCAD 0.4265 (0.0003) 0.3528 (0.0040) 0.06 (0.0035) 3.00 (0.0000)Q-SCAD 0.4316 (0.0004) 0.4323 (0.0036) 0.07 (0.0090) 3.00 (0.0000)LSA 0.4316 (0.0004) 0.4457 (0.0033) 0.09 (0.0048) 3.00 (0.0000)NPE 0.4697 (0.0005) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.4240 (0.0003) 0.3260 (0.0038) 0.00 (0.0000) 3.00 (0.0000)
100 SCAD 0.4115 (0.0001) 0.3452 (0.0035) 0.03 (0.0024) 3.00 (0.0000)Q-SCAD 0.4123 (0.0001) 0.3563 (0.0030) 0.06 (0.0039) 3.00 (0.0000)LSA 0.4124 (0.0002) 0.3906 (0.0029) 0.02 (0.0023) 3.00 (0.0000)NPE 0.4327 (0.0002) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.4108 (0.0001) 0.3375 (0.0035) 0.00 (0.0000) 3.00 (0.0000)
200 SCAD 0.4059 (0.0001) 0.3905 (0.0037) 0.01 (0.0016) 3.00 (0.0000)Q-SCAD 0.4057 (0.0001) 0.4072 (0.0030) 0.01 (0.0014) 3.00 (0.0000)LSA 0.4060 (0.0001) 0.4195 (0.0030) 0.01 (0.0012) 3.00 (0.0000)NPE 0.4149 (0.0001) 1.0000 (0.0000) 6.00 (0.0000) 3.00 (0.0000)Oracle 0.4057 (0.0001) 0.3846 (0.0037) 0.00 (0.0000) 3.00 (0.0000)
Table 4Estimated coefficients for South Africa Heart Disease Dataset.
SCAD Q-SCAD LSA NPE p-value
int −1.2549 −1.2457 −1.1509 −1.2631 0sbp 0 0 0 0.1333 0.2563tob 0.3692 0.3678 0.2960 0.3646 0.0028ldl 0.3355 0.3335 0.2683 0.3602 0.0035adi 0 0 0 0.1446 0.5257fam 0.9082 0.9005 0.7854 0.9254 0typ 0.3644 0.3616 0.2691 0.3887 0.0013obe 0 0 0 −0.2651 0.1550alc 0 0 0 0.003 0.9783age 0.7372 0.7326 0.6899 0.6607 0.0001
Table 3 shows that all the methods perform similarly, but the SCAD estimator does slightly better than the Q-SCADestimator and the LSA estimator is the worst.
Example 4 (South Africa Heart Disease Dataset). This dataset includes a retrospective sample ofmales in a heart disease high-risk region of theWestern Cape, South Africa. In this dataset, 9 variables are considered to classify a binary response variable,which indicates the presence or absence of myocardial infarction for 462 subjects. For more information, see ‘‘http://www-stat.stanford.edu/–tibs/ElemStatLearn ’’. In Table 4, the estimated coefficients are presented. The column ‘‘p-value’’ indicatesthe significance of the non-penalized estimates. As expected, the three sparsemethods selected the same covariates and thecoefficients which were estimated to be zero had relatively larger p-values. It is interesting that the coefficients estimatedby the SCAD and Q-SCAD are similar to each other as well as the non-penalized ones while those of the LSA are relativelysmaller. This is partly because the SCAD and Q-SCAD estimators are unbiased while the LSA estimator is not. This exampleshows that the Q-SCAD estimator can be used in real data analysis without loss of efficiency.
5. Conclusions
We proposed a method of quadratic approximation that unifies various types of SCAD penalized estimations. Theoret-ically the proposed method satisfies the oracle property, and the numerical studies confirm that the proposed methodperforms well with moderate sample sizes.It is an interesting problem to extend the proposed unified approach for the SCAD penalized estimation to high-
dimensional situations. It is not clear how to choose an initial estimator around which the loss function is quadraticallyapproximated. We leave this problem as a future work.
Acknowledgements
We are grateful to the anonymous referees, the associate editor, and the editor for their helpful comments. Kwon’sresearch was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Education, Scienceand Technology (MEST)/Korea Science and Engineering Foundation (KOSEF), grant number R11-2008-007-01002-0. Kim’swork was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF), grant numberKRF-2008-314-C00046.
S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428 427
Appendix
Proof of Theorem 1. Direct calculation yields√n(θ̂
g1 − θ∗1) =
√n(θ̂
c1 − θ∗1)+
√nH(θ̂
c)−111 H(θ̂
c)12θ̂
c2.
From the blockwise inversion formula, we have
H(θ̂c)→p
(6−1(11) −6−1(11)6126
−122
−6−122 6216−1(11) 6−122 + 6−122 6216
−1(11)6126
−122
)where 6(11) = 611 − 6126
−122 621. Using H(θ̂
c)−111 H(θ̂
c)12→
p−6126
−122 , it is easy to check
√n(θ̂1 − θ∗1)→
d N(0,611 − 6126−122 621).
Now, it remains to show that the estimator θ̂gis a local minimizer of the objective function in (5) with probability tending
to 1. Let
S(θ) = ∂Q (θ)/∂θ = H(θ̂c)(θ− θ̂
c)
then
S(θ̂g) =
(S(θ̂
g)1
S(θ̂g)2
)=
(H(θ̂
c)11(θ̂
g1 − θ̂
c1)− H(θ̂
c)12θ̂
c2
H(θ̂c)21(θ̂
g1 − θ̂
c1)− H(θ̂
c)22θ̂
c2
).
By the Karush–Kuhn–Tucker (KKT) condition (see, for example, p. 320 of Bertsekas (1999)) it suffices to check following twoconditions:
S(θ̂g)j + ∂ Jλn(θ̂
gj )/∂θj = 0, ∀θ̂
gj 6= 0
and
|S(θ̂g)j| ≤ λn, ∀θ̂
gj = 0.
By the definition of θ̂g, we have S(θ̂
g)1 = 0 hence it suffices to show that
P(minj≤q|θ̂j| ≥ aλn
)→ 1 (7)
and
P(maxj>q|S(θ̂
g)j| ≤ λn
)→ 1 (8)
as n→∞. From Eq. (6), we have
‖θ̂g1 − θ∗1‖/λn ≤ ‖θ̂
c1 − θ∗1‖/λn + ‖H(θ̂
c)−111 H(θ̂
c)12θ̂
c2‖/λn = Op(1/
√nλn)
and
‖S(θ̂g)2‖ ≤ ‖H(θ̂
c)21(θ̂
g1 − θ̂
c1)‖ + ‖H(θ̂
c)22θ̂
c2‖ = Op(1/
√n).
Hence it is easy to check
minj≤q|θ̂gj |/λn ≥ minj≤q
|θ∗j |/λn − ‖θ̂g1 − θ∗1‖/λn = Op(1/λn)
and ‖S(θ̂g)2‖ = op(λn)which implies (7) and (8), respectively. This completes the proof. �
Proof of Theorem 2. From simple algebra, we get
C(θ)− C(θ̂g) = S(θ̂
g)T2θ2 + (θ̂− θ̂
g)TH(θ̂
c)(θ̂− θ̂
g)/2+
p∑j=1
(Jλn(θj)− Jλn(θ̂
gj ))
≥ −op(λn)p∑
j=q+1
|θj| + ρ
p∑j=1
(θj − θ̂gj )2/2+
p∑j=1
(Jλn(θj)− Jλn(θ̂
gj ))
=
p∑j
wj
428 S. Kwon et al. / Computational Statistics and Data Analysis 55 (2011) 421–428
where
wj = −op(λn)|θj|I(j > q)+ ρ(θj − θ̂gj )2/2+ Jλn(θj)− Jλn(θ̂
gj )
and ρ is the minimum eigen value of 6−1. First, consider the cases where j ≤ q. If |θj| ≥ aλn, then Jλn(θj) − Jλn(θ̂gj ) = 0
hence wj ≥ 0. If |θj| < aλn, then Jλn(θj) − Jλn(θ̂gj ) ≤ λn|θj − θ̂
gj | and |θj − θ̂
gj | ≥ |θ
∗
j | − ‖θ̂gj − θ
∗
j ‖ − aλn = Op(1).And so wj ≥ (Op(1) − λn)|θj − θ̂
gj | ≥ 0 for sufficiently large n. Next, consider the cases j > q. If |θj| ≥ λn, then
wj ≥ −op(λn)|θj| + ρ|θj|2/2 ≥ 0. And if |θj| < λn, then wj ≥ λn|θj| − op(λn)|θj|. Hence for all 1 ≤ j ≤ p, wj ≥ 0 forsufficiently large n. This completes the proof. �
References
Bertsekas, D.P., 1999. Nonlinear Programming, second ed. Athena Scientific, Belmount, MA.Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of Statistics 32, 407–499.Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96,1348–1360.
Fan, J., Peng, H., 2004. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 32, 928–961.Kim, Y., Choi, H., Oh, H., 2008a. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association 103, 1656–1673.Kim, J., Kim, Y., Kim, Y., 2008b. A gradient-based optimization algorithm for lasso. Journal of Computational and Graphical Statistics 17, 994–1009.Knight, K., Fu, W.J., 2000. Asymptotics for lasso-type estimators. The Annals of Statistics 28, 1356–1378.Meinshausen, N., Bühlmann, P., 2006. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 34, 1436–1462.Park, M., Hastie, T., 2007. l1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B 69, 659–667.Rosset, S., Zhu, J., 2007. Piecewise linear regularized solution paths. The Annals of Statistics 35.Tibshirani, R.J., 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B 58, 267–288.Wang, H., Leng, C., 2007. Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102, 1039–1048.Wang, H., Li, R., Tsai, C., 2007. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94, 553–568.Wu, Y., Liu, Y., 2009. Variable selection in quantile regression. Statistica Sinica 19, 801–817.Zhao, P., Yu, B., 2006. On model selection consistency of lasso. Journal of Machine Learning Research 7, 2541–2563.Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.Zou, H., Li, R., 2008. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics 36, 1509–1533.