forward-backward greedy algorithms for general convex smooth …jliu/paper/foba-icml-2014... ·...

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Forward-Backward Greedy Algorithms for General Convex Smooth Functionsover A Cardinality Constraint

AbstractWe consider forward-backward greedy algo-rithms for solving sparse feature selection prob-lems with general convex smooth functions.A state-of-the-art greedy method, the Forward-Backward greedy algorithm (FoBa-obj) requiresto solve a large number of optimization prob-lems, thus it is not scalable for large-size prob-lems. The FoBa-gdt algorithm, which uses thegradient information for feature selection at eachforward iteration, significantly improves the ef-ficiency of FoBa-obj. In this paper, we system-atically analyze the theoretical properties of bothforward-backward greedy algorithms. Our maincontributions are: 1) We derive better theoreticalbounds than existing analyses regarding FoBa-obj for general smooth convex functions; 2) Weshow that FoBa-gdt achieves the same theoreticalperformance as FoBa-obj under the same condi-tion: restricted strong convexity condition. Ournew bounds are consistent with the bounds ofa special case (least squares) and fills a previ-ously existing theoretical gap for general convexsmooth functions; 3) We show that the restrictedstrong convexity condition is satisfied if the num-ber of independent samples is more than k̄ log dwhere k̄ is the sparsity number and d is the di-mension of the variable; 4) We apply FoBa-gdt(with the conditional random field objective) tothe sensor selection problem for human indooractivity recognition and our results show thatFoBa-gdt outperforms other methods (includingthe ones based on forward greedy selection andL1-regularization).

1. IntroductionFeature selection has been one of the most significant issuesin machine learning and data mining. Following the suc-cess of Lasso (Tibshirani, 1994), learning algorithms with

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

sparse regularization (a.k.a. sparse learning) have recentlyreceived significant attention. A classical problem is to es-timate a signal β∗ ∈ Rd from a feature matrix X ∈ Rn×dand an observation y = Xβ∗ + noise ∈ Rn, under theassumption that β∗ is sparse (i.e., β∗ has k̄ � d nonzeroelements). Previous studies have proposed many powerfultools to estimate β∗. In addition, in certain applications, re-ducing the number of features has a significantly practicalvalue (e.g., sensor selection in our case).

The general sparse learning problems can be formulated asfollows (Jalali et al., 2011):

β̄ := argminβ

: Q(β;X, y) s.t. : ‖β‖0 ≤ k̄. (1)

whereQ(β;X, y) is a convex smooth function in terms of βsuch as the least square loss (Tropp, 2004) (regression), theGaussian MLE (or log-determinant divergence) (Raviku-mar et al., 2011) (covariance selection), and the logisticloss (Kleinbaum & Klein, 2010) (classification). ‖β‖0 de-notes `0-norm, that is, the number of nonzero entries ofβ ∈ Rd. Hereinafter, we denote Q(β;X, y) simply asQ(β).

From an algorithmic viewpoint, we are mainly interestedin three aspects for the estimator β̂: 1) estimation error‖β̂ − β̄‖; 2) objective error Q(β̂) − Q(β̄); and 3) featureselection error, that is, the difference between supp(β̂) andF̄ := supp(β̄), where supp(β) is a feature index set cor-responding to nonzero elements in β. Since the constraintdefines a non-convex feasible region, the problem is non-convex and generally NP-hard.

There are two types of approaches to solve this prob-lem in the literature. Convex-relaxation approaches re-place `0-norm by `1-norm as a sparsity penalty. Suchapproaches include Lasso (Tibshirani, 1994), Danzig se-lector (Candès & Tao, 2007), and L1-regularized lo-gistic regression (Kleinbaum & Klein, 2010). Alter-native greedy-optimization approaches include orthogo-nal matching pursuit (OMP) (Tropp, 2004; Zhang, 2009),backward elimination, and forward-backward greedymethod (FoBa) (Zhang, 2011a), which use greedy heuristicprocedure to estimate sparse signals. Both types of algo-rithms have been well studied from both theoretical andempirical perspectives.

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Forward-Backward Greedy Algorithms for General Convex Smooth Functions over A Cardinality Constraint

FoBa has been shown to give better theoretical propertiesthan LASSO and Dantzig selector for the least squared lossfunction: Q(β) = 12‖Xβ−y‖

2 (Zhang, 2011a). Jalali et al.(2011) has recently extended it to general convex smoothfunctions. Their method and analysis, however, pose com-putational and theoretical issues. First, since FoBa solvesa large number of single variable optimization problems inevery forward selection step, it is computationally expen-sive for general convex functions if the sub-problems haveno closed form solution. Second, though they have empir-ically shown that FoBa performs well for general smoothconvex functions, their theoretical results are weaker thanthose for the least square case (Zhang, 2011a). More pre-cisely, their upper bound for estimation error is looser andtheir analysis requires more restricted conditions for fea-ture selection and signal recovery consistency. The ques-tion of whether or not FoBa can achieve the same theoret-ical bound in the general case as in the least square casemotivates this work.

This paper addresses the theoretical and computational is-sues associated with the standard FoBa algorithm (here-inafter referred to as FoBa-obj because it solves single vari-able problems to minimize the objective in each forwardselection step). We study a new algorithm referred to as“gradient” FoBa (FoBa-gdt) which significantly improvesthe computational efficiency of FoBa-obj. The key differ-ence is that FoBa-gdt only evaluates gradient informationin individual forward selection steps rather than solving alarge number of single variable optimization problems. Ourcontributions are summarized as follows.

Theoretical Analysis of FoBa-obj and FoBa-gdt This pa-per presents three main theoretical contributions. First,we derive better theoretical bounds for estimation error,objective error, and feature selection error than existinganalyses for FoBa-obj for general smooth convex func-tions (Jalali et al., 2011) under the same condition: re-stricted strong convexity condition. Second, we show thatFoBa-gdt achieves the same theoretical performance asFoBa-obj. Our new bounds are consistent with the boundsof a special case, i.e., the least square case, and fills in thetheoretical gap between the general loss (Jalali et al., 2011)and the least squares loss case (Zhang, 2011a). Our resultalso implies an interesting result: when the signal noise ra-tio is big enough, the NP hard problem (1) can be solvedby using FoBa-obj or FoBa-gdt. Third, we show that therestricted strong convexity condition is satisfied for a classof commonly used machine learning objectives, e.g., logis-tic loss and least square loss, if the number of independentsamples is greater than k̄ log d where k̄ is the sparsity num-ber and d is the dimension of the variable.

Application to Sensor Selection We have applied FoBa-gdt with the CRF loss function (referred to as FoBa-gdt-

CRF) to sensor selection from time-series binary locationsignals (captured by pyroelectric sensors) for human ac-tivity recognition at homes, which is a fundamental prob-lem in smart home systems and home energy manage-ment systems. In comparison with forward greedy and L1-regularized CRFs (referred to as L1-CRF), FoBa-gdt-CRFrequires the smallest number of sensors for achieving com-parable recognition accuracy.

1.1. Notation

Denote ej ∈ Rd as the jth natural basis in the space Rd.The set difference A − B returns the elements that are inA but outside of B. Given any integer s > 0, the restrictedstrong convexity constants (RSCC) ρ−(s) and ρ+(s) aredefined as follows: for any ‖t‖0 ≤ s and t = β′ − β, werequire

ρ−(s)

2‖t‖2 ≤ Q(β′)−Q(β)− 〈OQ(β), t〉 ≤ ρ+(s)

2‖t‖2.

(2)

Similar definitions can be found in (Bahmani et al., 2011;Jalali et al., 2011; Negahban et al., 2010; Zhang, 2009).If the objective function takes the quadratic form Q(β) =12‖Xβ − y‖

2, then the above definition is equivalent to therestricted isometric property (RIP) (Candès & Tao, 2005):

ρ−(s)‖t‖2 ≤ ‖Xt‖2 ≤ ρ+(s)‖t‖2,

where the well known RIP constant can be defined as δ =max{1 − ρ−(s), ρ+(s) − 1}. To give tighter values forρ+(.) and ρ−(.), we only require (2) to hold for all β ∈Ds := {‖β‖0 ≤ s | Q(β) ≤ Q(0)} throughout this paper.Finally we define β̂(F ) as β̂(F ) := arg minsupp(β)⊂F :Q(β). Note that the problem is convex as long as Q(β) is aconvex function. Denote F̄ := supp(β̄) and k̄ := |F̄ |.

We make use of order notation throughout this paper. If aand b are both positive quantities that depend on n or p, wewrite a = O(b) if a can be bounded by a fixed multiple ofb for all sufficiently large dimensions. We write a = o(b)if for any positive constant φ > 0, we have a ≤ φb for allsufficiently large dimensions. We write a = Ω(b) if botha = O(b) and b = O(a) hold.

2. Related WorkTropp (2004) investigated the behavior of the orthogonalmatching pursuit (OMP) algorithm for the least squarecase, and proposed a sufficient condition (an `∞ type con-dition) for guaranteed feature selection consistency. Zhang(2009) generalized this analysis to the case of measure-ment noise. In statistics, OMP is known as boosting(Buhlmann, 2006) and similar ideas have been explored inBayesian network learning (Chickering & Boutilier, 2002).

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329


Shalev-Shwartz et al. (2010) extended OMP to the generalconvex smooth function and studied the relationship be-tween objective value reduction and output sparsity. Othergreedy methods such as ROMP (Needell & Vershynin,2009) and CoSaMp (Needell & Tropp, 2008) were studiedand shown to have theoretical properties similar to thoseof OMP. Zhang (2011a) proposed a Forward-backward(FoBa) greedy algorithm for the least square case, which isan extension of OMP but has stronger theoretical guaran-tees as well as better empirical performance: feature selec-tion consistency is guaranteed under the sparse eigenvaluecondition, which is an `2 type condition weaker than the`∞ type condition. Note that if the data matrix is a Gaus-sian random matrix, the `2 type condition requires the mea-surements n to be of the order of O(s log d) where s is thesparsity of the true solution and d is the number of features,while the `∞ type condition requires n = O(s2 log d);see (Zhang & Zhang, 2012; Liu et al., 2012). Jalali et al.(2011) and Johnson et al. (2012) extended the FoBa algo-rithm to general convex functions and applied it to sparseinverse covariance estimation problems.

Convex methods, such as LASSO (Zhao & Yu, 2006) andDantzig selector (Candès & Tao, 2007), were proposed forsparse learning. The basic idea behind these methods isto use the `1-norm to approximate the `0-norm in orderto transform problem (1) into a convex optimization prob-lem. They usually require restricted conditions referred toas irrepresentable conditions (stronger than the RIP condi-tion) for guaranteed feature selection consistency (Zhang,2011a). A multi-stage procedure on LASSO and Dantzigselector (Liu et al., 2012) relaxes such condition, but it isstill stronger than RIP.

3. The Gradient FoBa AlgorithmThis section introduces the standard FoBa algorithm, thatis, FoBa-obj, and its variant FoBa-gdt. Both algorithmsstart from an empty feature pool F and follow the sameprocedure in every iteration consisting of two steps: a for-ward step and a backward step. The forward step evaluatesthe “goodness” of all features outside of the current featureset F , selects the best feature to add to the current featurepool F , and then optimizes the corresponding coefficientsof all features in the current feature pool F to obtain a newβ. The elements of β in F are nonzero and the rest are ze-ros. The backward step iteratively evaluates the “badness”of all features outside of the current feature set F , removes“bad” features from the current feature pool F , and recom-putes the optimal β over the current feature set F . Both al-gorithms use the same definition of “badness” for a feature:the increment of the objective after removing this feature.Specifically, for any features i in the current feature pool F ,the “badness” is defined as Q(β − βiei)−Q(β), which is

Algorithm 1 FoBa ( FoBa-obj FoBa-gdt )

Require: δ > 0 � > 0Ensure: β(k)

1: Let F (0) = ∅, β(0) = 0, k = 0,2: while TRUE do3: %% stopping determination

4: if Q(β(k))−minα,j /∈F (k) Q(β(k) + αej) < δ

‖OQ(β(k))‖∞ < � then5: break6: end if7: %% forward step

8: i(k) = arg mini/∈F (k){minαQ(β(k)

+ αei)}

i(k) = arg maxi/∈F (k) : |∇Q(β(k))i|9: F (k+1) = F (k) ∪ {i(k)}

10: β(k+1) = β̂(F (k+1))11: δ(k+1) = Q(β(k))−Q(β(k+1))12: k = k + 113: %% backward step14: while TRUE do15: if mini∈F (k+1) Q(β(k) − β

(k)i ei) − Q(β(k)) ≥

δ(k)/2 then16: break17: end if18: i(k) = arg miniQ(β(k) − β(k)i ei)19: k = k − 120: F (k) = F (k+1) − {i(k+1)}21: β(k) = β̂(F (k))22: end while23: end while

a positive number. It is worth to note that the forward stepselects one and only one feature while the backward stepmay remove zero, one, or more features. Finally, both algo-rithms terminate when no “good” feature can be identifiedin the forward step, that is, the “goodness” of all featuresoutside of F is smaller than a threshold.

The main difference between FoBa-obj and FoBa-gdt liesin the definition of “goodness” in the forward step and theirrespective stopping criterion. FoBa-obj evaluates the good-ness of a feature by its maximal reduction of the objectivefunction. Specifically, the “goodness” of feature i is de-fined as Q(β)−minαQ(β+αei) (a larger value indicatesa better feature). This is a direct way to evaluate the “good-ness” since our goal is to decrease the objective as muchas possible under the cardinality condition. However, itmay be computationally expensive since it requires solvinga large number of one-dimensional optimization problems,which may or may not be solved in a closed form. To im-prove computational efficiency in such situations, FoBa-gdt

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439


uses the partial derivative of Q with respect to individualcoordinates (features) as its “goodness’ measure: specifi-cally, the “goodness” of feature i is defined as |∇Q(β)i|.Note that the two measures of “goodness” are always non-negative. If feature i is already in the current feature setF , its “goodness” score is always zero, no matter whichmeasure to use. We summarize the details of FoBa-objand FoBa-gdt in Algorithm 1: the plain texts correspondto the common part of both algorithms, and the ones withsolid boxes and dash boxes correspond to their individualparts. The superscript (k) denotes the kth iteration incre-mented/decremented in the forward/backward steps.

Gradient-based feature selection has been used in a for-ward greedy method (Zhang, 2011b). FoBa-gdt extends itto a Forward-backward procedure (we present a detailedtheoretical analysis of it in the next section). The mainworkload in the forward step for FoBa-obj is on Step 4,whose complexity is O(TD), where T represents the iter-ations needed to solve minα : Q(β(k) + αej) and D isthe number of features outside of the current feature poolset F (k). In comparison, the complexity of Step 4 in FoBais just O(D). When T is large, we expect FoBa-gdt to bemuch more computationally efficient. The backward stepsof both algorithms are identical. The computational costsof the backward step and the forward step are comparablein FoBa-gdt (but not FoBa-obj), because their main workloads are on Step 10 and Step 21 (both are solving β̂(.))respectively and the times of running Step 21 is always lessthan that of Step 10.

4. Theoretical AnalysisThis section first gives the termination condition of Algo-rithms 1 with FoBa-obj and FoBa-gdt because the num-ber of iterations directly affect the values of RSCC (ρ+(.),ρ−(.), and their ratio), which are the key factors in ourmain results. Then we discuss the values of RSCC in aclass of commonly used machine learning objectives. Nextwe present the main results of this paper, including upperbounds on objective, estimation, and feature selection er-rors for both FoBa-obj and FoBa-gdt. We compare our re-sults to those of existing analyses of FoBa-obj and showthat our results fill the theoretical gap between the leastsquare loss case and the general case.

4.1. Upper Bounds on Objective, Estimation, andFeature Selection Errors

We first study the termination conditions of FoBa-obj andFoBa-gdt, as summarized in Theorems 1 and 2 respectively.

Theorem 1. Take δ > 4ρ+(1)ρ−(s)2 ‖OQ(β̄)‖2∞ in Algorithm 1

with FoBa-obj where s can be any positive integer satisfy-

ing s ≤ n and

(s− k̄) > (k̄ + 1)

[(√ρ+(s)

ρ−(s)+ 1

)2ρ+(1)

ρ−(s)

]2. (3)

Then the algorithm terminates at some k ≤ s− k̄.

Theorem 2. Take � > 2√

2ρ+(1)ρ−(s)

‖OQ(β̄)‖∞ in Algorithm 1with FoBa-gdt, where s can be any positive integer satisfy-ing s ≤ n and Eq.(3). Then the algorithm terminates atsome k ≤ s− k̄.

To simply the results, we first assume that the condi-tion number κ(s) := ρ+(s)/ρ−(s) is bounded (so isρ+(1)/ρ−(s) because of ρ+(s) ≥ ρ+(1)). Then bothFoBa-obj and FoBa-gdt terminate at some k proportional tothe sparsity k̄, similar to OMP (Zhang, 2011b) and FoBa-obj (Jalali et al., 2011; Zhang, 2011a). Note that the valueof k in Algorithm 1 is exactly the cardinality of F (k) andthe sparsity of β(k). Therefore, Theorems 1 and 2 implythat if κ(s) is bounded, FoBa-obj and FoBa-gdt will out-put a solution with sparsity proportional to that of the truesolution β̄.

Most existing works simply assume that κ(s) is boundedor have similar assumptions. We make our analysis morecomplete by discussing the values of ρ+(s), ρ−(s), andtheir ratio κ(s). Apparently, ifQ(β) is strongly convex andLipschitzian, then ρ−(s) is bounded from below and ρ+(s)is bounded from above, thus restricting the ratio κ(s). Tosee that ρ+(s), ρ−(s), and κ(s) may still be bounded un-der milder conditions, we consider a common structure forQ(β) used in many machine learning formulations:

Q(β) =1

n

n∑i=1

li(Xi.β, yi) +R(β) (4)

where (Xi., yi) is the ith training sample with Xi. ∈ Rdand yi ∈ R, li(., .) is convex with respect to the first argu-ment and could be different for different i, and both li(., .)and R(.) are twice differentiable functions. li(., .) is typi-cally the loss function, e.g., the quadratic loss li(u, v) =(u − v)2 in regression problems and the logistic lossli(u, v) = log(1 + exp{−uv}) in classification problems.R(β) is typically the regularization, e.g., R(β) = µ2 ‖β‖

2.

Theorem 3. Let s be a positive integer less than n, andλ−, λ+, λ−R, and λ

+R be positive numbers satisfying

λ− ≤ ∇21li(Xi.β, yi) ≤ λ+, λ−RI � ∇2R(β) � λ+RI

(∇21li(., .) is the second derivative with respect to thefirst argument) for any i and β ∈ Ds. Assume thatλ−R + 0.5λ

− > 0 and the sample matrix X ∈ Rn×dhas independent sub-Gaussian isotropic random rows orcolumns (in the case of columns, all columns should also

440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494

495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549


satisfy ‖X.j‖ =√n). If the number of samples satisfies

n ≥ Cs log d, then

ρ+(s) ≤λ+R + 1.5λ+ (5a)

ρ−(s) ≥λ−R + 0.5λ− (5b)

κ(s) ≤λ+R + 1.5λ

+

λ−R + 0.5λ− =: κ (5c)

hold with high probability1, where C is a fixed constant.Furthermore, define k̄, β̄, and δ (or �) in Algorithm 1 withFoBa-obj (or FoBa-gdt) as in Theorem 1 (or Theorem 2).Let

s = k̄ + 4κ2(√κ+ 1)2(k̄ + 1) (6)

and n ≥ Cs log d. We have that s satisfies (3) and Algo-rithm 1 with FoBa-obj (or FoBa-gdt) terminates within atmost 4κ2(

√κ+1)2(k̄+1) iterations with high probability.

Roughly speaking, if the number of training samples islarge enough, i.e., n ≥ Ω(k̄ log d) (actually it could bemuch smaller than the dimension d of the train data), wehave the following with high probability: Algorithm 1 withFoBa-obj or FoBa-gdt outputs a solution with sparsity atmost Ω(k̄) (this result will be improved when the nonzeroelements of β̄ are strong enough, as shown in Theorems 4and 5); s is bounded by Ω(k̄); and ρ+(s), ρ−(s), and κ(s)are bounded by constants. One important assumption isthat the sample matrix X has independent sub-Gaussianisotropic random rows or columns. In fact, this assumptionis satisfied by many natural examples, including Gaussianand Bernoulli matrices, general bounded random matriceswhose entries are independent bounded random variableswith zero mean and unit variances. Note that from the def-inition of “sub-Gaussian isotropic random vectors” (Ver-shynin, 2011, Definitions 19 and 22), it even allows thedependence within rows or columns but not both. Anotherimportant assumption is λ−R + 0.5λ

− > 0, which meansthat either λ−R or λ

− is positive (both of them are nonnega-tive from the convexity assumption). We can simply verifythat 1) for the quadratic caseQ(β) = 1n

∑ni=1(Xi.β−yi)2,

we have λ− = 1 and λ−R = 0; 2) for the logistic case withbounded data matrix X , that is Q(β) = 1n

∑ni=1 log(1 +

exp{−Xi.βyi}) + µ2 ‖β‖2, we have λ−R = µ > 0 and

λ− > 0 because Ds is bounded in this case.

Now we are ready to present the main results: the upperbounds of estimation error, objective error, and feature se-lection error for both algorithms. ρ+(s), ρ+(1), and ρ−(s)are involved in all bounds below. One can simply treat themas constants in understanding the following results, sincewe are mainly interested in the scenario when the numberof training samples is large enough. We omit proofs due to

1“With high probability” means that the probability convergesto 1 with the problem size approaching to infinity.

space limitations (the proofs are provided in Supplement).The main results for FoBa-obj and FoBa-gdt are presentedin Theorems 4 and 5 respectively.

Theorem 4. Let s be any number that satisfies (3) andchoose δ as in Theorem 1 for Algorithm 1 with FoBa-obj.Consider the output β(k) and its support set F (k). We have

‖β(k) − β̄‖2 ≤16ρ2+(1)δ

ρ2−(s)∆̄,

Q(β(k))−Q(β̄) ≤2ρ+(1)δρ−(s)

∆̄,

ρ−(s)2

8ρ+(1)2|F (k) − F̄ | ≤|F̄ − F (k)| ≤ 2∆̄,

where γ =4√ρ+(1)δ

ρ−(s)and ∆̄ := |{j ∈ F̄ − F (k) : |β̄j | <

γ}|.Theorem 5. Let s be any number that satisfies (3) andchoose � as in Theorem 2 for Algorithm 1 with FoBa-gdt.Consider the output β(k) and its support set F (k). We have

‖β(k) − β̄‖2 ≤ 8�2

ρ2−(s)∆̄,

Q(β(k))−Q(β̄) ≤ �2

ρ−(s)∆̄,

ρ−(s)2

8ρ+(1)2|F (k) − F̄ | ≤|F̄ − F (k)| ≤ 2∆̄,

where γ = 2√

2�ρ−(s)

and ∆̄ := |{j ∈ F̄ − F (k) : |β̄j | < γ}|.

Although FoBa-obj and FoBa-gdt use different criteria toevaluate the “goodness” of each feature, they actually guar-antee the same properties. Choose �2 and δ in the or-der of Ω(‖∇Q(β̄)‖2∞). For both algorithms, we have thatthe estimation error ‖β(k) − β̄‖2 and the objectiev errorQ(β(k)) − Q(β̄) are bounded by Ω(∆̄‖∇Q(β̄)‖2∞), andthe feature selection errors |F (k) − F̄ | and |F̄ − F (k)| arebounded by Ω(∆̄). ‖∇Q(β̄)‖∞ and ∆̄ are two key factorsin these bounds. ‖∇Q(β̄)‖∞ roughly represents the noiselevel2. ∆̄ defines the number of weak channels of the truesolution β̄ in F̄ . One can see that if all channels of β̄ on F̄are strong enough, that is, |β̄j | > Ω(‖∇Q(β̄)‖∞) ∀j ∈ F̄ ,∆̄ turns out to be 0. In other words, all errors (estimationerror, objective error, and feature selection error) become 0,when the signal noise ratio is big enough. Note that underthis condition, the original NP hard problem (1) is solvedexactly, which is summarized in the following corollary:

2To see this, we can consider the least square case (withstandard noise assumption and each column of the measure-ment matrix X ∈ Rn×d is normalized to 1): ‖∇Q(β̄)‖∞ ≤Ω(√n−1 log dσ) holds with high probability, where σ is the stan-

dard derivation.

550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604

605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659


Corollary 1. Let s be any number that satisfies (3) andchoose δ (or �) as in Theorem 1 (or 2) for Algorithm 1 withFoBa-gdt (or FoBa-obj). If

|β̄j |‖∇Q(β̄)‖∞

≥ 8ρ+(1)ρ2−(s)

∀j ∈ F̄ ,

then problem (1) can be solved exactly.

One may argue that since it is difficult to set δ or �, it is stillhard to solve (1). In practice, one does not have to set δ or� and only needs to run Algorithm 1 without checking thestopping condition until all features are selected. Then themost recent β(k̄) gives the solution to (1).

4.2. Comparison for the General Convex Case

Jalali et al. (2011) analyzed FoBa-obj for general convexsmooth functions and here we compare our results to theirs.They chose the true model β∗ as the target rather than thetrue solution β̄. In order to simplify the comparison, we as-sume that the distance between the true solution and thetrue model is not too great3, that is, we have β∗ ≈ β̄,supp(β∗) = supp(β̄), and ‖∇Q(β∗)‖∞ ≈ ‖∇Q(β̄)‖∞.We compare our results from Section 4.1 and the results in(Jalali et al., 2011). In the estimation error comparison, wehave from our results:

‖β(k) − β∗‖ ≈ ‖β(k) − β̄‖≤Ω(∆̄1/2‖∇Q(β̄)‖∞) ≈ Ω(∆̄1/2‖∇Q(β∗)‖∞)

and from the results in (Jalali et al., 2011): ‖β(k) − β∗‖ ≤Ω(k̄‖∇Q(β∗)‖∞). Note that ∆1/2 ≤ k̄1/2 � k̄. There-fore, under our assumptions with respect to β∗ and β̄, ouranalysis gives a tighter bound. Notably, when there area large number of strong channels in β̄ (or approximatelyβ∗), we will have ∆̄� k̄.

Let us next consider the condition required for feature se-lection consistency, that is, supp(F(k)) = supp(F̄) =supp(β∗). We have from our results:

‖β̄j‖ ≥ Ω(‖∇Q(β̄)‖∞) ∀j ∈ supp(β∗)

and from the results in (Jalali et al., 2011):

‖β∗j ‖ ≥ Ω(k̄‖∇Q(β∗)‖∞) ∀j ∈ supp(β∗).

When β∗ ≈ β̄ and ‖∇Q(β∗)‖∞ ≈ ‖∇Q(β̄)‖∞, ourresults guarantee feature selection consistency under aweaker condition.

3This assumption is not absolutely fair, but holds in manycases, such as in the least square case, which will be made clearin Section 4.3.

4.3. A Special Case: Least Square Loss

We next consider the least square case: Q(β) = 12‖Xβ −y‖2 and shows that our analysis for the two algorithms inSection 4.1 fills in a theoretical gap between this specialcase and the general convex smooth case.

Following previous studies (Candès & Tao, 2007; Zhang,2011b; Zhao & Yu, 2006), we assume that y = Xβ∗ + εwhere the entries in ε are independent random sub-gaussianvariables, β∗ is the true model with the support set F̄ andthe sparsity number k̄ := |F̄ |, and X ∈ Rn×d is normal-ized as ‖X.i‖2 = 1 for all columns i = 1, · · · , d. We thenhave following inequalities with high probability (Zhang,2009):

‖∇Q(β∗)‖∞ = ‖XT ε‖∞ ≤ Ω(√n−1 log d), (7)

‖∇Q(β̄)‖∞ ≤ Ω(√n−1 log d), (8)

‖β̄ − β∗‖2 ≤ Ω(√n−1k̄), (9)

‖β̄ − β∗‖∞ ≤ Ω(√n−1 log k̄), (10)

implying that β̄ and β∗ are quite close when the true modelis really sparse, that is, when k̄ � n.

An analysis for FoBa-obj in the least square case (Zhang,2011a) has indicated that the following estimation errorbound holds with high probability:

‖β(k) − β∗‖2 ≤ Ω(n−1(k̄+

log d|{j ∈ F̄ : |β∗j | ≤ Ω(√n−1 log d)}|)) (11)

as well as the following condition for feature selection con-sistency:

supp(β(k)) = supp(β∗), if |β∗j | ≥ Ω(√n−1 log d) ∀j ∈ F̄ .

(12)

Applying the analysis for general convex smooth cases in(Jalali et al., 2011) to the least square case, one obtains thefollowing estimation error bound from Eq. (7)

‖β(k) − β∗‖2 ≤ Ω(k̄2‖OQ(β∗)‖2∞) ≤ Ω(n−1k̄2 log d)

and the following condition of feature selection consis-tency:

supp(β(k)) = supp(β∗), if |β∗j | ≥ Ω(√k̄n−1 log d) ∀ j ∈ F̄ .

One can observe that the general analysis gives a looserbound for estimation error and requires a stronger conditionfor feature selection consistency than the analysis for thespecial case.

Our results in Theorems 4 and 5 bridge this gap when com-bined with Eqs. (9) and (10). The first inequalities in The-

660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714

715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769


orems 4 and 5 indicate that

‖β(k) − β∗‖2 ≤ (‖β(k) − β̄‖+ ‖β̄ − β∗‖)2

≤Ω(n−1(k̄ + log d | {j ∈ F̄ − F (k) :

|β̄j | < Ω(n−1/2√

log d)}|))

[from Eq. (9)]

≤Ω(n−1(k̄ + log d | {j ∈ F̄ − F (k) :

|β∗j | < Ω(n−1/2√

log d)}|))

[from Eq. (10)]

which is consistent with the results in Eq. (11). The lastinequality in Theorem 5 also implies that feature selec-tion consistency is guaranteed as well, as long as |β̄j | >Ω(√n−1 log d) (or |β∗j | > Ω(

√n−1 log d)) for all j ∈ F̄ .

This requirement agrees with the results in Eq. (12).

5. Application: Sensor Selection for HumanActivity Recognition

Machine learning technologies for smart home systems andhome energy management systems have recently attractedmuch attention. Among the many promising applicationssuch as optimal energy control, emergency alerts for el-derly persons living alone, and automatic life-logging, afundamental challenge for these applications is to recog-nize human activity at homes, with the smallest number ofsensors. The data mining task here is to minimize the num-ber of sensors without significantly worsening recognitionaccuracy. We used pyroelectric sensors, which return bi-nary signals in reaction to human motion.

Fig. 1 shows our experimental room layout and sensor lo-cations. The numbers represent sensors, and the ellipsoidaround each represents the area covered by it. We used40 sensors, i.e., we observe a 40-dimensional binary timeseries. A single person lives in the room for roughly onemonth, and data is collected on the basis of manually tag-ging his activities into the pre-determined 14 categoriessummarized in Table 5. For data preparation reasons, weuse the first 20% (roughly one week) samples in the data,and divide it into 10% for training and 10% for testing. Thenumbers of training and test samples are given in Table 1.

Pyroelectric sensors are preferable over cameras for twopractical reasons: cameras tend to create a psychologicalbarrier and pyroelectric sensors are much cheaper and eas-ier to implement at homes. Such sensors only observenoisy binary location information. This means that, forhigh recognition accuracy, history (sequence) informationmust be taken into account. The binary time series datafollows a linear-chain conditional random field (CRF) (Laf-ferty et al., 2001; Sutton & McCallum, 2006). Linear-chainCRF gives a smooth and convex loss function; see Supple-

Figure 1. Room layout and sensor locations.

Table 1. Activities in the sensor data set

ID Activity train/test samples

1 Sleeping 81442/874332 Out of Home (OH) 66096/424783 Using Computer 64640/462544 Relaxing 25290/653125 Eating 6403/60166 Cooking 5249/46337 Showering (Bathing) 3930/49778 No Event 3443/34869 Using Toilet 2464/262510 Hygiene (brushing teeth, etc.) 1614/161711 Dishwashing 1495/183112 Beverage Preparation 1388/144113 Bath Cleaning/Preparation 492/32014 Others 6549/2072

Total - 270495/270495

ment D.2 for more details of CRF.

Our task then is sensor selection on the basis of noisy bi-nary time series data, and to do this we apply our FoBa-gdt-CRF (FoBa-gdt with CRF objective function). Since itis very expensive to evaluate the CRF objective value andits gradient, FoBa-obj becomes impractical in this case (alarge number of optimization problems in the forward stepmake it computationally very expensive). Here, we con-sider a sensor to have been “used” if at least one feature re-lated to it is used in the CRF. Note that we have 14 activity-signal binary features (i.e., indicators of sensor/activity si-multaneous activations) for each single sensor, and there-fore we have 40 × 14 = 560 such features in total. Inaddition, we have 14 × 14 = 196 activity-activity binaryfeatures (i.e., indicators of the activities at times t − 1 andt). As explained in Section D.2, we only enforced sparsityon the first type of features.

First we compare FoBa-gdt-CRF with Forward-gdt-CRF(Forward-gdt with CRF loss function) and L1-CRF4 in

4L1-CRF solves the optimization problem with CRF loss + L1regularization. Since it is difficult to search the whole space L1regularization parameter value space, we investigated a number ofdiscrete values.

770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824

825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879


Figure 2. Top: comparisons of FoBa-gdt-CRF, Forward-gdt-CRFand L1-CRF. Bottom: test error rates (FoBa-gdt-CRF) for indi-vidual activities.

terms of test recognition error over the number of sensorsselected (see the top of Fig. 2). We can observe that

• The performance for all methods improves when the um-ber of sensors increases.

• FoBa-gdt-CRF and Forward-gdt-CRF achieve compara-ble performance. However, FoBa-gdt-CRF reduces theerror rate slightly faster, in terms of the number of sen-sors.

• FoBa-gdt-CRF achieves its best performance with 14-15sensors while Forward-gdt-CRF needs 17-18 sensors toachieve the same error level. We obtain sufficient accu-racy by using fewer than 40 sensors.

• FoBa-gdt-CRF consistently requires fewer features thanForward-gdt-CRF to achieve the same error level whenusing the same number of sensors.

We also analyze the test error rates of FoBa-gdt-CRF forindividual activities. We consider two cases with the num-ber of sensors being 10 and 15, and entered their test errorrates for each individual activity in the bottom of Fig. 2.We observe that:

• The high frequency activities (e.g., activities {1,2,3,4})are well recognized in both cases. In other words, FoBa-gdt-CRF is likely to select sensors (features) which con-tribute to the discrimination of high frequency activities.

• The error rates for activities {5, 7, 9} significantly im-prove when the number of sensors increases from 10 to15. Activities 7 and 9 are Showering and Using Toilet,and the use of additional sensors {36, 37, 40} seems tohave contributed to this improvement. Also, a dinner ta-ble was located near sensor 2, which is why the error rate

Table 2. Sensor IDs selected by FoBa-gdt-CRF.

# of sensors=10 {1, 4, 5, 9, 10, 13, 19, 28, 34, 38}# of sensors=15 {# of sensors=10} + {2, 7, 36, 37, 40}

w.r.t. activity 5 (Eating) significantly decreases from thecase # of sensors=10 to # of sensors=15 by includingsensor 2.

6. Additional ExperimentsAlthough this paper mainly focuses on the theoretical anal-ysis for FoBa-obj and FoBa-gdt, we conduct additional ex-periments to study the behaviors of FoBa-obj and FoBa-gdtin Supplement (part D). We mainly consider two models:logistic regression and linear-chain CRFs, and the com-parison among four algorithms (FoBa-gdt, FoBa-obj, andtheir corresponding forward greedy algorithms) on bothsynthetic data and real world data. We empirically show,in application to logistic regression and conditional ran-dom fields (CRFs) (Lafferty et al., 2001), that FoBa-gdtachieves similar empirical performance as FoBa-obj with amuch lower computational cost. Note that comparisons ofFoBa-obj and L1-regularization methods can be found inprevious studies (Jalali et al., 2011; Zhang, 2011a).

7. ConclusionThis paper considers two forward-backward greedy meth-ods, a state-of-the-art greedy method FoBa-obj and its vari-ant FoBa-gdt which is more efficient than FoBa-obj, forsolving sparse feature selection problems with general con-vex smooth functions. We systematically analyze the the-oretical properties of both algorithms. Our main contri-butions include: 1) We derive better theoretical boundsfor FoBa-obj and FoBa-gdt than existing analyses regard-ing FoBa-obj in (Jalali et al., 2011) for general smoothconvex functions. Our result also suggests that the NPhard problem (1) can be solved by FoBa-obj and FoBa-gdt if the signal noise ratio is big enough; 2) Our newbounds are consistent with the bounds of a special case(least squares) (Zhang, 2011a) and fills a previously ex-isting theoretical gap for general convex smooth functions(Jalali et al., 2011); 3) We show that the condition to satisfythe restricted strong convexity condition in commonly usedmachine learning problems; 4) We apply FoBa-gdt (withthe conditional random field objective) to the sensor selec-tion problem for human indoor activity recognition and ourresults show that FoBa-gdt can successfully remove unnec-essary sensors and is able to select more valuable sensorsthan other methods (including the ones based on forwardgreedy selection and L1-regularization). As for the futurework, we plan to extend FoBa algorithms to minimize ageneral convex smooth function over a low rank constraint.

880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934

935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989


ReferencesBahmani, S., Boufounos, P., and Raj, B. Greedy sparsity-

constrained optimization. ASILOMAR, pp. 1148–1152,2011.

Buhlmann, P. Boosting for high-dimensional linear mod-els. Annals of Statistics, 34:559–583, 2006.

Candès, E. J. and Tao, T. Decoding by linear program-ming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.

Candès, E. J. and Tao, T. The Dantzig selector: statisti-cal estimation when p is much larger than n. Annals ofStatistics, 35(6):2313–2351, 2007.

Chickering, D. M. and Boutilier, C. Optimal structure iden-tification with greedy search. Journal of Machine Learn-ing Research, 3:507–554, 2002.

Jalali, A., Johnson, C. C., and Ravikumar, P. D. On learningdiscrete graphical models using greedy methods. NIPS,2011.

Johnson, C. C., Jalali, A., and Ravikumar, P. D. High-dimensional sparse inverse covariance estimation usinggreedy methods. Journal of Machine Learning Research- Proceedings Track, 22:574–582, 2012.

Kleinbaum, D. G. and Klein, M. Logistic regression.Statistics for Biology and Health, pp. 103–127, 2010.

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. Condi-tional random fields: Probabilistic models for segment-ing and labeling sequence data. ICML, pp. 282–289,2001.

Liu, J., Wonka, P., and Ye, J. A multi-stage frameworkfor dantzig selector and LASSO. Journal of MachineLearning Research, 13:1189–1219, 2012.

Needell, D. and Tropp, J. A. CoSaMP: Iterative signalrecovery from incomplete and inaccurate samples. Ap-plied and Computational Harmonic Analysis, 26:301–321, 2008.

Needell, D. and Vershynin, R. Uniform uncertainty prin-ciple and signal recovery via regularized orthogonalmatching pursuit. Foundations of Computational Math-ematics, 9(3):317–334, 2009.

Negahban, S., Ravikumar, P. D., Wainwright, M. J., and Yu,B. A unified framework for high-dimensional analysisof M-estimators with decomposable regularizers. CoRR,abs/1010. 2731, 2010.

Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu,B. High-dimensional covariance estimation by minimiz-ing `1-penalized log-determinant divergence. ElectronicJournal of Statistics, 5:935–980, 2011.

Shalev-Shwartz, S., Srebro, N., and Zhang, T. Trading ac-curacy for sparsity in optimization problems with spar-sity constraints. SIAM Journal on Optimization, 20(6):2807–2832, 2010.

Sutton, C. and McCallum, A. An introduction to condi-tional random fields for relational learning. Introductionto Statistical Relational Learning, pp. 93–128, 2006.

Tibshirani, R. Regression shrinkage and selection via thelasso. Journal of the Royal Statistical Society, Series B,58:267–288, 1994.

Tropp, J. A. Greed is good: algorithmic results for sparseapproximation. IEEE Transactions on Information The-ory, 50(10):2231–2242, 2004.

Vershynin, R. Introduction to the non-asymptotic analysisof random matrices. arXiv:1011.3027, 2011.

Zhang, C.-H. and Zhang, T. A general theory of con-cave regularization for high dimensional sparse estima-tion problems. Statistical Science, 27(4), 2012.

Zhang, T. On the consistency of feature selection usinggreedy least squares regression. Journal of MachineLearning Research, 10:555–568, 2009.

Zhang, T. Adaptive forward-backward greedy algorithmfor learning sparse representations. IEEE Transactionson Information Theory, 57(7):4689–4708, 2011a.

Zhang, T. Sparse recovery with orthogonal matching pur-suit under RIP. IEEE Transactions on Information The-ory, 57(9):6215–6221, 2011b.

Zhao, P. and Yu, B. On model selection consistency oflasso. Journal of Machine Learning Research, 7:2541–2563, 2006.

990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044

1045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099


Supplemental Material: ProofsThis supplemental material provides the proofs for our main results including the properties of FoBa-obj (Theorems 1, and4) and FoBa-gdt (Theorems 2 and 5), and analysis for the RSCC condition (Theorem 3).

Our proofs for Theorems 1, 2, 4, and 5 borrowed many tools from early literatures including Zhang (2011b), Jalali et al.(2011), Johnson et al. (2012), Zhang (2009), Zhang (2011a). The key novelty in our proof is to develop a key property forthe backward step (see Lemmas 3 and 7), which gives an upper bound for the number of wrong features included in thefeature pool. By taking a full advantage of this upper bound, the presented proofs for our main results (i.e., Theorems 1,2, 4, and 5) are significantly simplified. It avoids the complicated induction procedure in the proof for the forward greedymethod (Zhang, 2011b) and also improves the existing analysis for the same problem in Jalali et al. (2011).

A. Proofs of FoBa-objFirst we prove the properties of the FoBa-obj algorithm, particularly Theorems 4 and 1.

Lemma 1 and Lemma 2 build up the dependence between the objective function change δ and the gradient |∇Q(β)j |.Lemma 3 studies the effect of the backward step. Basically, it gives an upper bound of |F (k) − F̄ |, which meets ourintuition that the backward step is used to optimize the size of the feature pool. Lemma 4 shows that if δ is big enough,which means that the algorithm terminates early, then Q(β(k)) cannot be smaller than Q(β̄). Lemma 5 studies the forwardstep of FoBa-obj. It shows that the objective decreases sufficiently in every forward step, which together with Lemma 4(that is, Q(β(k)) > Q(β̄)), implies that the algorithm will terminate within limited number of iterations. Based on theseresults, we provide the complete proofs of Theorems 4 and 1 at the end of this section.

Lemma 1. If Q(β)−minη Q(β + ηej) ≥ δ, then we have

|∇Q(β)j | ≥√

2ρ−(1)δ.

Proof. From the condition, we have

−δ ≥minηQ(β + ηej)−Q(β)

≥minη〈ηej ,∇Q(β)〉+

ρ−(1)

2η2 (from the definition of ρ−(1))

= minη

ρ−(1)

2

(η − ∇Q(β)j

ρ−(1)

)2− |∇Q(β)j |

2

2ρ−(1)

=− |∇Q(β)j |2

2ρ−(1).

It indicates that |∇Q(β)j | ≥√

2ρ−(1)δ.

Lemma 2. If Q(β)−minj,η Q(β + ηej) ≤ δ, then we have

‖∇Q(β)‖∞ ≤√

2ρ+(1)δ.

1100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154

1155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209


Proof. From the condition, we have

δ ≥Q(β)−minη,j

Q(β + ηej)

= maxη,j

Q(β)−Q(β + ηej)

≥maxη,j−〈ηej ,∇Q(β)〉 −

ρ+(1)

2η2

= maxη,j−ρ+(1)

2

(η − ∇Q(β)j

ρ+(1)

)2− |∇Q(β)j |

2

2ρ+(1)

= maxj

|∇Q(β)j |2

2ρ+(1)

=‖∇Q(β)‖2∞

2ρ+(1)

It indicates that ‖∇Q(β)‖∞ ≤√

2ρ+(1)δ.

Lemma 3. (General backward criteria). Consider β(k) with the support F (k) in the beginning of each iteration in Algo-rithm 1 with FoBa-obj (Here, β(k) is not necessarily the output of this algorithm). We have for any β̄ ∈ Rd with the supportF̄

‖β(k)F (k)−F̄ ‖

2 = ‖(β(k) − β̄)F (k)−F̄ ‖2 ≥δ(k)

ρ+(1)|F (k) − F̄ | ≥ δ

ρ+(1)|F (k) − F̄ |. (13)

Proof. We have

|F (k) − F̄ | minj∈F (k)

Q(β(k) − β(k)j ej) ≤∑

j∈F (k)−F̄

Q(β(k) − β(k)j ej)

≤∑

j∈F (k)−F̄

Q(β(k))− OQ(β(k))jβ(k)j +ρ+(1)

2(β

(k)j )

2

≤|F (k) − F̄ |Q(β(k)) + ρ+(1)2‖(β(k) − β̄)F (k)−F̄ ‖2.

The second inequality is due to OQ(β(k))j = 0 for any j ∈ F (k) − F̄ and the third inequality is from β̄F (k)−F̄ = 0. Itfollows that

ρ+(1)

2‖(β(k) − β̄)F (k)−F̄ ‖2 ≥ |F (k) − F̄ |( min

j∈F (k)Q(β(k) − β(k)j ej)−Q(β

(k))) ≥ |F (k) − F̄ |δ(k)

2,

which implies that the claim using δ(k) ≥ δ.

Lemma 4. Let β̄ = arg minsupp(β)∈F̄ Q(β). Consider β(k) in the beginning of each iteration in Algorithm with FoBa-obj.Denote its support as F (k). Let s be any integer larger than |F (k) − F̄ |. If takes δ > 4ρ+(1)ρ−(s)2 ‖OQ(β̄)‖

2∞ in FoBa-obj, then

we have Q(β(k)) ≥ Q(β̄).

1210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264

1265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319


Proof. We have

Q(β(k))−Q(β̄)

≥〈OQ(β̄), β(k) − β̄〉+ ρ−(s)2‖β(k) − β̄‖2

≥〈OQ(β̄)F (k)−F̄ , (β(k) − β̄)F (k)−F̄ 〉+ρ−(s)

2‖β(k) − β̄‖2

≥− ‖OQ(β̄)‖∞‖(β(k) − β̄)F (k)−F̄ ‖1 +ρ−(s)

2‖β(k) − β̄‖2

≥− ‖OQ(β̄)‖∞√|F (k) − F̄ |‖β(k) − β̄‖+ ρ−(s)

2‖β(k) − β̄‖2

≥− ‖OQ(β̄)‖∞‖β(k) − β̄‖2√ρ+(1)

δ+ρ−(s)

2‖β(k) − β̄‖2 (from Lemma 3)

=

(ρ−(s)

2−‖OQ(β̄)‖∞

√ρ+(1)√

δ

)‖β(k) − β̄‖2

>0. (from δ >4ρ+(1)

ρ−(s)2‖OQ(β̄)‖2∞)

It proves the claim.

Proof of Theorem 4

Proof. From Lemma 4, we only need to consider the case Q(β(k)) ≥ Q(β̄). We have

0 ≥ Q(β̄)−Q(β(k)) ≥ 〈OQ(β(k)), β̄ − β(k)〉+ ρ−(s)2‖β̄ − β(k)‖2,

which implies that

ρ−(s)

2‖β̄ − β(k)‖2 ≤− 〈OQ(β(k)), β̄ − β(k)〉

=− 〈OQ(β(k))F̄−F (k) , (β̄ − β(k))F̄−F (k)〉≤‖OQ(β(k))F̄−F (k)‖∞‖(β̄ − β(k))F̄−F (k)‖1

≤√

2ρ+(1)δ√|F̄ − F (k)||β̄ − β(k)‖. (from Lemma 2)

We obtain

‖β̄ − β(k)‖ ≤2√

2ρ+(1)δ

ρ−(s)

√|F̄ − F (k)|. (14)

It follows that

8ρ+(1)δ

ρ−(s)2|F̄ − F (k)| ≥‖β̄ − β(k)‖2

≥‖(β̄ − β(k))F̄−F (k)‖2

=‖β̄F̄−F (k)‖2

≥γ2|{j ∈ F̄ − F (k) : |β̄j | ≥ γ}|

=16ρ+(1)δ

ρ−(s)2|{j ∈ F̄ − F (k) : |β̄j | ≥ γ}|,

which implies that

|F̄ − F (k)| ≥ 2|{j ∈ F̄ − F (k) : |β̄j | ≥ γ}| = 2(|F̄ − F (k)| − |{j ∈ F̄ − F (k) : |β̄j | < γ}|)⇒|F̄ − F (k)| ≤ 2|{j ∈ F̄ − F (k) : |β̄j | < γ}|.

1320132113221323132413251326132713281329133013311332133313341335133613371338133913401341134213431344134513461347134813491350135113521353135413551356135713581359136013611362136313641365136613671368136913701371137213731374

1375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429


The first inequality is obtained from Eq. (14)

‖β̄ − β(k)‖ ≤2√

2ρ+(1)δ

ρ−(s)

√|F̄ − F (k)| ≤

4√ρ+(1)δ

ρ−(s)

√|j ∈ F̄ − F (k) : |β̄j | < γ|.

Next, we consider

Q(β̄)−Q(β(k))

≥〈OQ(β(k)), β̄ − β(k)〉+ ρ−(s)2‖β̄ − β(k)‖2

=〈OQ(β(k))F̄−F (k) , (β̄ − β(k))F̄−F (k)〉+ρ−(s)

2‖β̄ − β(k)‖2 (from OQ(β(k))F (k) = 0)

≥− ‖OQ(β(k))‖∞‖β̄ − β(k)‖1 +ρ−(s)

2‖β̄ − β(k)‖2

≥−√

2ρ+(1)δ√|F̄ − F (k)|‖β̄ − β(k)‖+ ρ−(s)

2‖β̄ − β(k)‖2

≥ρ−(s)2

(‖β̄ − β(k)‖ −

√2ρ+(1)δ

√|F̄ − F (k)|/ρ−(s)

)2− 2ρ+(1)δ|F̄ − F (k)|/(2ρ−(s))

≥− 2ρ+(1)δ|F̄ − F (k)|/(2ρ−(s))

≥− 2ρ+(1)δρ−(s)

|{j ∈ F̄ − F (k) : |β̄j | < γ}|.

It proves the second inequality. The last inequality in this theorem is obtained from the following:

2ρ+(1)δ

2ρ+(1)2|F (k) − F̄ |

≤‖(β(k) − β̄)F (k)−F̄ ‖2 (from Lemma 3)≤‖β(k) − β̄‖2

≤8ρ+(1)δρ−(s)2

|F̄ − F (k)|

≤16ρ+(1)δρ−(s)2

|{j ∈ F̄ − F (k) : |β̄j | < γ}|.

It completes the proof.

Lemma 5. (General forward step of FoBa-obj) Let β = arg minsupp(β)⊂F Q(β). For any β′ with support F ′ and i ∈{i : Q(β)−minη Q(β + ηei) ≥ (Q(β)−minj,η Q(β + ηej))}, we have

|F ′ − F |(Q(β)−minηQ(β + ηei)) ≥

ρ−(s)

ρ+(1)(Q(β)−Q(β′)).

1430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484

1485148614871488148914901491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539


Proof. We have that the following holds with any η

|F ′ − F | minj∈F ′−F,η

Q(β + η(β′j − βj)ej)

≤∑

j∈F ′−FQ(β + η(β′j − βj)ej)

≤∑

j∈F ′−FQ(β) + η(β′j − βj)∇Q(β)j +

ρ+(1)

2(βj − β′j)2η2

≤|F ′ − F |Q(β) + η〈(β′ − β)F ′−F ,∇Q(β)F ′−F 〉+ρ+(1)

2‖β′ − β‖2η2

=|F ′ − F |Q(β) + η〈β′ − β,∇Q(β)〉+ ρ+(1)2‖β′ − β‖2η2

≤|F ′ − F |Q(β) + η(Q(β′)−Q(β)− ρ−(s)2‖β − β′‖2) + ρ+(1)

2‖β′ − β‖2η2.

By minimizing η, we obtain that

|F ′ − F |( minj∈F ′−F,η

Q(β + η(β′j − βj)ej)−Q(β))

≤minηη(Q(β′)−Q(β)− ρ−(s)

2‖β − β′‖2) + ρ+(1)

2‖β′ − β‖2η2

≤−(Q(β′)−Q(β)− ρ−(s)2 ‖β − β

′‖2)2

2ρ+(1)‖β′ − β‖2

≤−4(Q(β)−Q(β′))ρ−(s)2 ‖β − β

′‖2

2ρ+(1)‖β′ − β‖2(from (a+ b)2 ≥ 4ab)

=ρ−(s)

ρ+(1)(Q(β′)−Q(β)).

(15)

It follows that

|F ′ − F |(Q(β)−minη,j

Q(β + ηei))

≥|F ′ − F |(Q(β)− minj∈F ′−F,η

Q(β + ηej))

≥|F ′ − F |(Q(β)− minj∈F ′−F,η

Q(β + η(β′j − βj)ej))

≥ρ−(s)ρ+(1)

(Q(β)−Q(β′)), (from Eq. (15))

which proves the claim.

Proof of Theorem 1

Proof. Assume that the algorithm terminates at some number larger than s− k̄. Then we consider the first time k = s− k̄.Denote the support of β(k) as F (k). Let F ′ = F̄ ∪ F (k) and β′ = arg minsupp(β)⊂F ′ Q(β). One can easily verify that

1540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594

1595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649


|F̄ ∪ F (k)| ≤ s. We consider the very recent (k − 1) step and have

δ(k) =Q(β(k−1))−minη,i

Q(β(k−1) + ηei)

≥ ρ−(s)ρ+(1)|F ′ − F (k−1)|

(Q(β(k−1))−Q(β′))

≥ ρ−(s)ρ+(1)|F ′ − F (k−1)|

(Q(β(k))−Q(β′))

≥ ρ−(s)ρ+(1)|F ′ − F (k−1)|

(〈OQ(β′), β(k) − β′〉+ ρ−(s)

2‖β(k) − β′‖2

)≥ ρ−(s)

2

2ρ+(1)|F ′ − F (k−1)|‖β(k) − β′‖2 (from OQ(β′)F ′ = 0)

(16)

where the first inequality is due to Lemma 5 and the second inequality is due to Q(β(k)) ≤ Q(β(k−1)). From Lemma 3,we have

δ(k) ≤ 2ρ+(1)|F (k) − F̄ |

‖β(k) − β̄‖2 = 2ρ+(1)|F ′ − F̄ |

‖β(k) − β̄‖2 (17)

Combining Eq. (16) and (17), we obtain that

‖(β(k) − β̄)‖2 ≥(ρ−(s)

2ρ+(1)

)2 |F ′ − F̄ ||F ′ − F (k−1)|

‖β(k) − β′‖2,

which implies that‖β(k) − β̄‖ ≥ t‖β(k) − β′‖,

where

t :=ρ−(s)

2ρ+(1)

√|F ′ − F̄ |

|F ′ − F (k−1)|

=ρ−(s)

2ρ+(1)

√|F (k) − F (k) ∩ F̄ ||F ′ − F (k)|+ 1

=ρ−(s)

2ρ+(1)

√|F (k) − F (k) ∩ F̄ ||F̄ − F (k) ∩ F̄ |+ 1

=ρ−(s)

2ρ+(1)

√|F (k)| − |F (k) ∩ F̄ ||F̄ | − |F (k) ∩ F̄ |+ 1

=ρ−(s)

2ρ+(1)

√k − |F (k) ∩ F̄ |

k̄ − |F (k) ∩ F̄ |+ 1

≥ ρ−(s)2ρ+(1)

√(s− k̄)(k̄ + 1)

(from k = s− k̄ and s ≥ 2k̄ + 1)

≥

√ρ+(s)

ρ−(s)+ 1 (from the assumption on s).

It follows

t‖β(k) − β′‖ ≤ ‖β(k) − β̄‖ ≤ ‖β(k) − β′‖+ ‖β′ − β̄‖ (from Eq. (3))

which implies that(t− 1)‖β(k) − β′‖ ≤ ‖β′ − β̄‖. (18)

1650165116521653165416551656165716581659166016611662166316641665166616671668166916701671167216731674167516761677167816791680168116821683168416851686168716881689169016911692169316941695169616971698169917001701170217031704

1705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759


Next we have

Q(β(k))−Q(β̄) =Q(β(k))−Q(β′) +Q(β′)−Q(β̄)

≤ρ+(s)2‖β(k) − β′‖2 − ρ−(s)

2‖β′ − β̄‖2

≤(ρ+(s)− ρ−(s)(t− 1)2)/2‖β(k) − β′‖2

≤0.

Since the sequence Q(β(k)) is strictly decreasing, we know that Q(β(k+1)) < Q(β(k)) ≤ Q(β̄). However, from Lemma 4we also have Q(β̄) ≤ Q(β(k+1)). Thus it leads to a contradiction. This indicates that the algorithm terminates at someinteger not greater than s− k̄.

B. Proofs of FoBa-gdtNext we consider Algorithm 1 with FoBa-gdt; specifically we will prove Theorem 5 and Theorem 2. Our proof strategy issimilar to FoBa-obj.

Lemma 6. If |OQ(β)j | ≥ �, then we have

Q(β)−minηQ(β + ηej) ≥

�2

2ρ+(1).

Proof. Consider the LHS:

Q(β)−minηQ(β + ηej)

= maxη

Q(β)−Q(β + ηej)

≥maxη−〈OQ(β), ηej〉 −

ρ+(1)

2η2

= maxη−ηOQ(β)j −

ρ+(1)

2η2

≥|OQ(β)j |2

2ρ+(1)

≥ �2

2ρ+(1).


This lemma implies that δ(k0) ≥ �2

2ρ+(1)for all k0 = 1, · · · , k, and

Q(β(k0−1))−minηQ(β(k0−1) + ηej) ≥ δ(k0) ≥

�2

2ρ+(1),

if |OQ(β(k0−1))j | ≥ �.

Lemma 7. (General backward criteria). Consider β(k) with the support F (k) in the beginning of each iteration in Algo-rithm 1 with FoBa-obj. We have for any β̄ ∈ Rd with the support F̄

‖β(k)F (k)−F̄ ‖

2 = ‖(β(k) − β̄)F (k)−F̄ ‖2 ≥δ(k)

ρ+(1)|F (k) − F̄ | ≥ �

2

2ρ+(1)2|F (k) − F̄ |. (19)

1760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814

1815181618171818181918201821182218231824182518261827182818291830183118321833183418351836183718381839184018411842184318441845184618471848184918501851185218531854185518561857185818591860186118621863186418651866186718681869


Proof. We have

|F (k) − F̄ | minj∈F (k)

Q(β(k) − β(k)j ej) ≤∑

j∈F (k)−F̄

Q(β(k) − β(k)j ej)

≤∑

j∈F (k)−F̄

Q(β(k))− OQ(β(k))jβ(k)j +ρ+(1)

2(β

(k)j )

2

≤|F (k) − F̄ |Q(β(k)) + ρ+(1)2‖(β(k) − β̄)F (k)−F̄ ‖2.

The second inequality is due to OQ(β(k))j = 0 for any j ∈ F (k) − F̄ and the third inequality is from β̄F (k)−F̄ = 0. Itfollows that

ρ+(1)

2‖(β(k) − β̄)F (k)−F̄ ‖2 ≥ |F (k) − F̄ |( min

j∈F (k)Q(β(k) − β(k)j ej)−Q(β

(k))) ≥ |F (k) − F̄ |δ(k)

2,

which implies the claim using δ(k) ≥ �2

2ρ+(1).

Lemma 8. Let β̄ = arg minsupp(β)∈F̄ Q(β). Consider β(k) in the beginning of each iteration in Algorithm 1 with FoBa-

obj. Denote its support as F (k). Let s be any integer larger than |F (k)− F̄ |. If take � > 2√

2ρ+(1)ρ−(s)

‖OQ(β̄)‖∞ in FoBa-obj,then we have Q(β(k)) ≥ Q(β̄).

Proof. We have

0 >Q(β(k))−Q(β̄)

≥〈OQ(β̄), β(k) − β̄〉+ ρ−(s)2‖β(k) − β̄‖2

≥〈OQ(β̄)F (k)−F̄ , (β(k) − β̄)F (k)−F̄ 〉+ρ−(s)

2‖β(k) − β̄‖2

≥− ‖OQ(β̄)‖∞‖(β(k) − β̄)F (k)−F̄ ‖1 +ρ−(s)

2‖β(k) − β̄‖2

≥− ‖OQ(β̄)‖∞√|F (k) − F̄ |‖β(k) − β̄‖+ ρ−(s)

2‖β(k) − β̄‖2

≥− ‖OQ(β̄)‖∞‖β(k) − β̄‖2√

2ρ+(1)

�+ρ−(s)

2‖β(k) − β̄‖2 (from Lemma 7)

=

(ρ−(s)

2− ‖OQ(β̄)‖∞

√2ρ+(1)

�

)‖β(k) − β̄‖2

>0. (from � >2√

2ρ+(1)

ρ−(s)‖OQ(β̄)‖∞)

It proves our claim.

Proof of Theorem 5

Proof. From Lemma 8, we only need to consider the case Q(β(k)) ≥ Q(β̄). We have

0 ≥ Q(β̄)−Q(β(k)) ≥ 〈OQ(β(k)), β̄ − β(k)〉+ ρ−(s)2‖β̄ − β(k)‖2,

1870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924

1925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979


which implies that

ρ−(s)

2‖β̄ − β(k)‖2 ≤− 〈OQ(β(k)), β̄ − β(k)〉

=− 〈OQ(β(k))F̄−F (k) , (β̄ − β(k))F̄−F (k)〉≤‖OQ(β(k))F̄−F (k)‖∞‖(β̄ − β(k))F̄−F (k)‖1

≤�√|F̄ − F (k)||β̄ − β(k)‖.

We obtain

‖β̄ − β(k)‖ ≤ 2�ρ−(s)

√|F̄ − F (k)|. (20)

It follows that

4�2

ρ−(s)2|F̄ − F (k)| ≥‖β̄ − β(k)‖2

≥‖(β̄ − β(k))F̄−F (k)‖2

=‖β̄F̄−F (k)‖2

≥γ2|{j ∈ F̄ − F (k) : |β̄j | ≥ γ}|

=8�2

ρ−(s)2|{j ∈ F̄ − F (k) : |β̄j | ≥ γ}|,

which implies that

|F̄ − F (k)| ≥ 2|{j ∈ F̄ − F (k) : |β̄j | ≥ γ}| = 2(|F̄ − F (k)| − |{j ∈ F̄ − F (k) : |β̄j | < γ}|)⇒|F̄ − F (k)| ≤ 2|{j ∈ F̄ − F (k) : |β̄j | < γ}|.

The first inequality is obtained from Eq. (20)

‖β̄ − β(k)‖ ≤ 2�ρ−(s)

√|F̄ − F (k)| ≤ 2

√2�

ρ−(s)

√|j ∈ F̄ − F (k) : |β̄j | < γ|.

Next, we consider

Q(β̄)−Q(β(k))

≥〈OQ(β(k)), β̄ − β(k)〉+ ρ−(s)2‖β̄ − β(k)‖2

=〈OQ(β(k))F̄−F (k) , (β̄ − β(k))F̄−F (k)〉+ρ−(s)

2‖β̄ − β(k)‖2 (from OQ(β(k))F (k) = 0)

≥− ‖OQ(β(k))‖∞‖β̄ − β(k)‖1 +ρ−(s)

2‖β̄ − β(k)‖2

≥− �√|F̄ − F (k)|‖β̄ − β(k)‖+ ρ−(s)

2‖β̄ − β(k)‖2

≥ρ−(s)2

(‖β̄ − β(k)‖ − �

√|F̄ − F (k)|/ρ−(s)

)2− �2|F̄ − F (k)|/(2ρ−(s))

≥− �2|F̄ − F (k)|/(2ρ−(s))

≥− �2

ρ−(s)|{j ∈ F̄ − F (k) : |β̄j | < γ}|.

1980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022202320242025202620272028202920302031203220332034

2035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089


It proves the second inequality. The last inequality in this theorem is obtained from

�2

2ρ+(1)2|F (k) − F̄ |

≤‖(β(k) − β̄)F (k)−F̄ ‖2 (from Lemma 7)≤‖β(k) − β̄‖2

≤ 4�2

ρ−(s)2|F̄ − F (k)|

≤ 8�2

ρ−(s)2|{j ∈ F̄ − F (k) : |β̄j | < γ}|.


Next we will study the upper bound of “k” when Algorithm 1 terminates.Lemma 9. (General forward step) Let β = arg minsupp(β)⊂F Q(β). For any β′ with support F ′ and i ∈ {i : |OQ(β)i| ≥maxj |OQ(β)j |}, we have

|F ′ − F |(Q(β)−minηQ(β + ηei)) ≥

ρ−(s)

ρ+(1)(Q(β)−Q(β′)).

Proof. The following proof follows the idea of Lemma A.3 in (Zhang, 2011b). Denote i∗ as arg maxi |OQ(β)i|. For allj ∈ {1, · · · , d}, we define

Pj(η) = ηsgn(β′j)OQ(β)j +

ρ+(1)

2η2.

It is easy to verify that minη Pj(η) = − |OQ(β)j |2

2ρ+(1)and

minηPi(η) ≤ min

ηPi∗(η) = min

j,ηPj(η) ≤ min

jPj(η). (21)

Denoting u = ‖β′F ′−F ‖1, we obtain that

uminjPj(η) ≤

∑j∈F ′−F

|β′j |Pj(η)

=η∑

j∈F ′−Fβ′jOQ(β)j +

uρ+(1)

2η2

=η〈β′F ′−F ,OQ(β)F ′−F 〉+uρ+(1)

2η2

=η〈β′ − β,OQ(β)〉+ uρ+(1)2

η2 (from OQ(β)F = 0)

≤η(Q(β′)−Q(β)− ρ−(s)2‖β′ − β‖2) + uρ+(1)

2η2.

Taking η = (Q(β)−Q(β′) + ρ−(s)2 ‖β′ − β‖2)/uρ+(1) on both sides, we get

uminjPj(η) ≤ −

(Q(β)−Q(β′) + ρ−(s)2 ‖β − β

′‖2)2

2ρ+(1)u≤ − (Q(β)−Q(β

′))ρ−(s)‖β − β′‖2

ρ+(1)u,

where it uses the inequality (a+ b)2 ≥ 4ab. From Eq. (21), we have

minηPi(η) ≤min

jPj(η)

≤− (Q(β)−Q(β′))ρ−(s)‖β − β′‖2

ρ+(1)u2

≤− ρ−(s)ρ+(1)|F ′ − F |

(Q(β)−Q(β′)),

2090209120922093209420952096209720982099210021012102210321042105210621072108210921102111211221132114211521162117211821192120212121222123212421252126212721282129213021312132213321342135213621372138213921402141214221432144

2145214621472148214921502151215221532154215521562157215821592160216121622163216421652166216721682169217021712172217321742175217621772178217921802181218221832184218521862187218821892190219121922193219421952196219721982199


where we use ‖β−β′‖2

u2 =‖β−β′‖2

‖(β−β′)F ′−F ‖2≥ 1|F ′−F | . Finally, we use

minηQ(β + ηei)−Q(β) ≤ min

ηQ(β + ηsgn(β′i)ei)−Q(β) ≤ min

ηPi(η)

to prove the claim.

Proof of Theorem 2

Proof. Please refer to the proof of Theorem 1.

C. Proofs of RSCCProof of Theorem 3

Proof. First from the random matrix theory (Vershynin, 2011), we have that for any random matrix X ∈ Rn×d withindependent sub-gaussian isotropic random rows or columns, there exists a constant C0 such that

√n− C0

√s log d ≤ min

‖h‖0≤s

‖Xh‖‖h‖

≤ max‖h‖0≤s

‖Xh‖‖h‖

≤√n+ C0

√s log d (22)

holds with high probability.

From the mean value theorem, for any β′, β ∈ Ds, there exists a point θ in the segment of β′ and β satisfying

Q(β′)−Q(β)− 〈∇Q(β), α− β〉 = 12

(β′ − β)T∇2Q(θ)(β′ − β)

=1

2(β′ − β)T

(1

n

n∑i=1

XTi.∇21li(Xi.θ, yi)Xi. +∇2R(θ)

)(β′ − β)

(from θ ∈ Ds)

≥ 12

(β′ − β)T(

1

n

n∑i=1

λ−XTi.Xi. + λ−RI

)(β′ − β)

=1

2(β′ − β)T

(λ−

nXTX + λ−RI

)(β′ − β)

=λ−

2n‖X(β′ − β)‖2 +

λ−R2‖β′ − β‖2 (23)

Similarly, one can easily obtain

Q(β′)−Q(β)− 〈∇Q(β), α− β〉 ≤ λ+

2n‖X(β′ − β)‖2 +

λ+R2‖β′ − β‖2 (24)

Using (22), we have

(√n− C0

√s log d)2‖β′ − β‖2 ≤ ‖X(β′ − β)‖2 ≤ (

√n+ C0

√s log d)2‖β′ − β‖2

Together with (23) and (24), we obtain

1

2

λ−R + λ−(

1− C0

√s log d

n

)2 ‖β′ − β‖2 ≤ Q(β′)−Q(β)− 〈∇Q(β), α− β〉≤ 1

2

λ+R + λ+(

1 + C0

√s log d

n

)2 ‖β′ − β‖2,

2200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254

2255225622572258225922602261226222632264226522662267226822692270227122722273227422752276227722782279228022812282228322842285228622872288228922902291229222932294229522962297229822992300230123022303230423052306230723082309


which implies the claims (5a), (5a), and (5c) by taking n ≥ Cs log d where C = C20/(1−√

2/2)2.

Next we apply this result to prove the rest part. In Algorithm 1, since choose n ≥ Cs log d, we have that ρ−(s), ρ+(s),and κ(s) in Algorithm 1 with FoBa-obj satisfy the bounds in (5). To see why s chosen in (6) satisfies the condition in (3),we verify the right hand side of (3):

RHS = (k̄ + 1)

[(√ρ+(s)

ρ−(s)+ 1

)2ρ+(1)

ρ−(s)

]2≤ 4(k̄ + 1)(

√κ+ 1)2κ2 ≤ s− k̄.

Therefore, Algorithm 1 terminates at most 4κ2(√κ + 1)2(k̄ + 1) iterations with high probability from Theorem 1. The

claims for Algorithm 1 with FoBa-gdt can be proven similarly.

2310231123122313231423152316231723182319232023212322232323242325232623272328232923302331233223332334233523362337233823392340234123422343234423452346234723482349235023512352235323542355235623572358235923602361236223632364

2365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419


Supplemental Material: Additional ExperimentsD. Additional ExperimentsWe conduct additional experiments using two models: logistic regression and linear-chain CRFs. Four algorithms (FoBa-gdt, FoBa-obj, and their corresponding forward greedy algorithms, i.e., Forward-obj and Forward-gdt) are compared onboth synthetic data and real world data. Note that comparisons of FoBa-obj and L1-regularization methods can be foundin previous studies (Jalali et al., 2011; Zhang, 2011a).

It is hard to evaluate ρ−(s) and ρ+(s) in practice, and we typically employ the following strategies to set δ in FoBa-objand FoBa-gdt:

• If we have the knowledge of F̄ (e.g., in synthetic simulations), we can compute the true solution β̄. δ and � are givenas δ = Q(β̄)−minα,j /∈F̄ Q(β̄ + αej) and � = ‖OQ(β̄)‖∞.

• We can change the stopping criteria in both FoBa-obj and FoBa-gdt based on the sparsity of β(k), that is, when thenumber of nonzero entries in β(k) exceeds a given sparsity level, the algorithm terminates.

We employed the first strategy only in synthetic simulations for logistic regression and the second strategy in all othercases.

D.1. Experiments on Logistic Regression

L2-regularized logistic regression is a typical example of models with smooth convex loss functions:

Q(β) =1

n

∑i

log(1 + exp(−yiXi.β)) +1

2λ||β||2.

where Xi. ∈ Rd, that is, the ith row of X , is the ith sample and yi ∈ {1,−1} is the corresponding label. This is a binaryclassification problem for finding a hyperplane defined by β in order to separate two classes of data in X . log(1 + exp(.))defines the penalty for misclassification. The `2-norm is used for regularization in order to mitigate overfitting. Thegradient of Q(β) is computed by

∂Q(β)

∂β=

1

n

∑i

−yi exp(−yiXi.β)1 + exp(−yiXi.β)

XTi. + λβ.

D.1.1. SYNTHETIC DATA

We first compare FoBa-obj and FoBa-gdt with Forward-obj and Forward-gdt5 using synthetic data generated as follows.The data matrix X ∈ Rn×d consistes of two classes of data (each row is a sample) from i.i.d. Gaussian distributionN (β∗, Id×d) and N (−β∗, Id×d), respectively, where the true parameter β∗ is sparse and its nonzero entries follow i.i.d.uniform distribution U [0, 1]. The two classes are of the same size. The norm of β∗ is normalized to 5 and its sparsenessranges from 5 to 14. The vector y is the class label vector with values being 1 or -1. We set λ = 0.01, n = 100, andd = 500. The results are averaged over 50 random trials.

Four evaluation measures are considered in our comparison: F-measure (2|F (k) ∩ F̄ |/(|F (k)| + |F̄ |)), estimation error(‖β(k) − β̄‖/‖β̄‖), objective value, and the number of nonzero elements in β(k) (the value of k when the algorithmterminates). The results in Fig. 3 suggest that:

• The FoBa algorithms outperform the forward algorithms overall, especially in F-measure (feature selection) perfor-mance;

• The FoBa algorithms are likely to yield sparser solutions than the forward algorithms, since they allow for the removalof bad features;

5Forward-gdt is equivalent to the orthogonal matching pursuit in (Zhang, 2009).

2420242124222423242424252426242724282429243024312432243324342435243624372438243924402441244224432444244524462447244824492450245124522453245424552456245724582459246024612462246324642465246624672468246924702471247224732474

2475247624772478247924802481248224832484248524862487248824892490249124922493249424952496249724982499250025012502250325042505250625072508250925102511251225132514251525162517251825192520252125222523252425252526252725282529


5 6 7 8 9 10 11 12 13 14

0.4

0.41

0.42

0.43

0.44

0.45

0.46

sparsity

F−

me

asu

re

FoBa−gdt

forward−gdt

FoBa−obj

forward−obj5 6 7 8 9 10 11 12 13 14

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

sparsity

estim

atio

n e

rro

r

FoBa−gdt

forward−gdt

FoBa−obj

forward−obj5 6 7 8 9 10 11 12 13 14

0.033

0.0335

0.034

0.0345

0.035

sparsity

ob

jective

FoBa−gdt

forward−gdt

FoBa−obj

forward−obj5 6 7 8 9 10 11 12 13 14

4.5

5

5.5

6

6.5

7

7.5

8

8.5

9

sparsity

# o

f n

on

ze

ro e

ntr

ies

FoBa−gdt

forward−gdt

FoBa−obj

forward−obj

Figure 3. Comparison for F-measure, estimation error, objective value, and the number of nonzero elements. The horizontal axis is thesparsity number of the true model ranging from 5 to 14.

Table 3. Computation time (seconds) on synthetic data (top: logistic regression; bottom: CRF).

S FoBa-gdt Forward-gdt FoBa-obj Forward-obj

5 1.96e-2 0.65e-2 1.25e+1 1.24e+18 1.16e-2 0.76e-2 1.52e+1 1.52e+111 1.53e-2 1.05e-2 1.83e+1 1.80e+114 1.51e-2 1.00e-2 2.29e+1 2.25e+1

10 1.02e+0 0.52e+0 3.95e+1 1.88e+115 1.85e+0 1.05e+0 5.53e+1 2.65e+120 2.51e+0 1.53e+0 6.89e+1 3.33e+125 3.86e+0 2.00e+0 8.06e+1 3.91e+130 5.13e+0 3.00e+0 9.04e+1 4.39e+1

• There are minor differences in estimation errors and objective values between the FoBa algorithms and the forwardalgorithms;

• FoBa-obj and FoBa-gdt perform similarly. The former gives better objective values and the latter is better in estimationerror;

• FoBa-obj and Forward-obj decrease objective values more quickly than FoBa-gdt and Forward-gdt since they employthe direct “goodness” measure (reduction of the objective value).

As shown in Table 3, FoBa-gdt and Forward-gdt are much more efficient than FoBa-obj and Forward-obj, because FoBa-objand Forward-obj solve a large number of optimization problems in each iteration.

D.1.2. UCI DATA

We next compare FoBa-obj and FoBa-gdt on UCI data sets. We use λ = 10−4 in all experiments. Note that the sparsitynumbers here are different from those in the past experiment. Here “S” denotes the number of nonzero entries in the output.We use three evaluation measures: 1) training classification error, 2) test classification error, and 3) training objective value(i.e., the value of Q(β(k))). The datasets are available online6. From Table 4, we observe that

• FoBa-obj and FoBa-gdt obtain similar training errors in all cases;

• FoBa-obj usually obtain a lower objective value than FoBa-gdt, but this does not imply that FoBa-obj selects betterfeatures. Indeed, FoBa-gdt achieves lower test errors in several cases.

We also compare the computation time FoBa-obj and FoBa-gdt and found that FoBa-gdt was empirically at least 10 timesfaster than FoBa-obj (detailed results omitted due to space limitations). In summary, FoBa-gdt has the same theoretical

6http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets

2530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572257325742575257625772578257925802581258225832584

2585258625872588258925902591259225932594259525962597259825992600260126022603260426052606260726082609261026112612261326142615261626172618261926202621262226232624262526262627262826292630263126322633263426352636263726382639


Table 4. Comparison of different algorithms in terms of training error, testing error, and objective function value (from left to right) ona1a, a2a, and w1a data sets (from top to bottom). The values of n denotes the number of training/testing samples.

S objective training error testing error

a1a (FoBa-obj/FoBa-gdt), n = 1605/30956, d = 123

10 556/558 0.169/0.17 0.165/0.16420 517/529 0.163/0.16 0.163/0.16030 499/515 0.151/0.145 0.162/0.15840 490/509 0.150/0.146 0.162/0.15950 484/494 0.146/0.141 0.162/0.16160 481/484 0.14/0.141 0.166/0.16170 480/480 0.138/0.139 0.166/0.163

a2a (FoBa-obj/FoBa-gdt), n = 2205/30296, d = 123

10 870/836 0.191/0.185 0.174/0.16020 801/810 0.18/0.177 0.163/0.15730 775/790 0.172/0.168 0.165/0.15640 758/776 0.162/0.167 0.163/0.15550 749/764 0.163/0.166 0.162/0.15760 743/750 0.162/0.161 0.163/0.16070 740/742 0.162/0.158 0.163/0.161

w1a (FoBa-obj/FoBa-gdt), n = 2477/47272, d = 300

10 574/595 0.135/0.125 0.142/0.12720 424/487 0.0969/0.0959 0.11/0.10430 341/395 0.0813/0.0805 0.0993/0.092340 288/337 0.0704/0.0715 0.089/0.085350 238/282 0.06/0.0658 0.0832/0.082560 215/226 0.0547/0.0553 0.0814/0.081870 198/206 0.0511/0.0511 0.0776/0.0775

guarantee as FoBa-obj, and empirically FoBa-dgt achieves competitive in terms of empirical performance; in terms ofcomputational time, FoBa-gdt is much more efficient.

D.2. Simulations on Linear-Chain CRFs

Another typical class of model family with smooth convex loss functions is conditional random fields (CRFs) (Laffertyet al., 2001; Sutton & McCallum, 2006), which is most commonly used in segmentation and tagging problems with respectto sequential data. Let Xt ∈ RD, t = 1, 2, · · · , T be a sequence of random vectors. The corresponding labels of Xt’s arestored in a vector y ∈ RT . Let β ∈ RM be a parameter vector and fm(y, y′, xt), m = 1, · · · ,M , be a set of real-valuedfeature functions. A linear-chain CRFs will then be a distribution Pβ(y|X) that takes the form

Pβ(y|X) =1

Z(X)

T∏t=1

exp

{M∑m=1

βmfm(yt, yt−1, Xt)

}

where Z(X) is an instance-specific normalization function

Z(X) =∑y

T∏t=1

exp

{M∑m=1

βmfm(yt, yt−1, Xt)

}. (25)

In order to simplify the introduction below, we assume that all Xt’s are discrete random vectors with limited states. If wehave training data Xi and yi(i = 1, · · · , I) we can estimate the value of the parameter β by maximizing the likelihood

2640264126422643264426452646264726482649265026512652265326542655265626572658265926602661266226632664266526662667266826692670267126722673267426752676267726782679268026812682268326842685268626872688268926902691269226932694

2695269626972698269927002701270227032704270527062707270827092710271127122713271427152716271727182719272027212722272327242725272627272728272927302731273227332734273527362737273827392740274127422743274427452746274727482749

Forward-Backward Greedy Algorithms for General Convex Smooth Functions over A Cardinality Constraint∏i Pβ(yi|Xi), which is equivalent to minimizing the negative log-likelihood:

minβ

: Q(β) := − log

(I∏i=1

Pβ(yi|Xi)

)

=−I∑i=1

T∑t=1

M∑m=1

βmfm(yit, y

it−1, x

it) +

I∑i=1

logZ(Xi).

The gradient is computed by

∂Q(β)

∂βm=−

∑i

∑t

fm(yit, y

it−1, X

it) +

∑i

1

Z(Xi)

∂Z(Xi)

∂βm

Note that it would be extremely expensive (the complexity would be O(2T )) to directly calculate Z(Xi) from (25), sincethe sample space of y is too large. So is ∂Z(X

i)∂βm

. The sum-product algorithm can reduce complexity to polynomial time interms of T (see (Sutton & McCallum, 2006) for details).

In this experiment, we use a data generator provided in Matlab CRF7. We omit a detailed description of the data generatorhere because of space limitations. We generate a data matrix X ∈ {1, 2, 3, 4, 5}T×D with T = 800 and D = 4 and thecorresponding label vector y ∈ {1, 2, 3, 4}T (four classes)8. Because of our target application, i.e., sensor selection, weonly enforced sparseness to the features related to observation (56 in all) and the features w.r.t. transitions remained dense.Fig. 4 shows the performance, averaged over 20 runs, with respect to training error, test error, and objective value. Weobserve from Fig 4:

• FoBa-obj and Forward-obj gives lower objective values, training error, and test error than FoBa-gdt and Forward-gdtwhen the sparsity level is low.

• The performance gap between different algorithms becomes smaller when the sparsity level increases.

• The performance of FoBa-obj is quite close to that of Forward-obj, while FoBa-gdt outperforms Forward-gdt partic-ularly when the sparsity level was low.

The bottom of Table 3 shows computation time for all four algorithms. Similar to the logistic regression case, the gradientbased algorithms (FoBa-gdt and Forward-gdt) are much more efficient than the objective based algorithms FoBa-obj andForward-obj. From Table 3, FoBa-obj is computationally more expensive than Forward-obj, while they are comparable inthe logistic regression case. The main reason for this is that the evaluation of objective values in the CRF case is moreexpensive than that in the logistic regression case. These results demonstrate the potential of FoBa-gdt.

10 12 14 16 18 20 22 24 26 28 30

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

sparsity

tra

inin

g e

rro

r

FoBa−gdt

forward−gdt

FoBa−obj

forward−obj

10 12 14 16 18 20 22 24 26 28 30

0.15

0.2

0.25

0.3

0.35

sparsity

testin

g e

rro

r

FoBa−gdt

forward−gdt

forward-backward greedy algorithms for general convex smooth …jliu/paper/foba-icml-2014... ·...

Documents