Research ArticleAn Interior Point Method for 119871
12-SVM and Application to
Feature Selection in Classification
Lan Yao1 Xiongji Zhang2 Dong-Hui Li2 Feng Zeng3 and Haowen Chen1
1 College of Mathematics and Econometrics Hunan University Changsha 410082 China2 School of Mathematical Sciences South China Normal University Guangzhou 510631 China3 School of Software Central South University Changsha 410083 China
Correspondence should be addressed to Dong-Hui Li dhliscnueducn
Received 4 November 2013 Revised 12 February 2014 Accepted 18 February 2014 Published 10 April 2014
Academic Editor Frank Werner
Copyright copy 2014 Lan Yao et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
This paper studies feature selection for support vector machine (SVM) By the use of the 11987112
regularization technique we proposea new model 119871
12-SVM To solve this nonconvex and non-Lipschitz optimization problem we first transform it into an equivalent
quadratic constrained optimizationmodel with linear objective function and then develop an interior point algorithmWe establishthe convergence of the proposed algorithmOur experimentswith artificial data and real data demonstrate that the119871
12-SVMmodel
works well and the proposed algorithm is more effective than some popular methods in selecting relevant features and improvingclassification performance
1 Introduction
Feature selection plays an important role in solving theclassification problems with high dimension features suchas text categorization [1 2] gene expression array analysis[3ndash5] and combinatorial chemistry [6 7] The advantagesof feature selection include (i) ignoring noisy or irrelevantfeatures would prevent overfitting and improve the gener-alization performance (ii) a sparse classifier can reduce thecomputation cost (iii) a small set of important features isdesirable for interpretability
We address the embedded feature selection methodsin the context of linear support vector machines (SVMs)Existing feature selection methods embedded in SVMs fallinto three approaches [8] In the first approach some greedysearch strategies are applied to iteratively adding or removingfeatures from the data Guyon et al [3] developed a recursivefeature elimination (RFE) algorithm which has shown goodperformance on gene selection for microarray data Begin-ning with the full feature subset SVM-RFE trains a SVM ateach iteration and then eliminates the feature that decreasesthe margin the least Rakotomamonjy et al [9] extended thismethod by using other ranking criteria including the radiusmargin bound and the span-estimate
The second approach is to optimize a scaling parametervector 120590 isin [0 1]
119899 that indicates the importance of eachfeature Weston et al [10] proposed an iterative method tooptimize the scaling parameters by minimizing the boundson leave-one-out error Peleg andMeir [11] learned the scalingfactors based on the globalminimization of a data-dependentgeneralization error bound
The third category of approaches is to minimize thenumber of features by adding a sparsity term to the SVMformulation Though standard SVM based on 119908
2can be
solved easily by convex quadratic programming its solutionmay not be a desirable sparse solution A popular way to dealwith this problem is the use of 119871
119901regularization technique
which results in a 119871119901-SVM It is to minimize 119908
119901 subject to
some linear constraints where 119901 isin [0 1] When 119901 isin (0 1]
119908119901 = (
119898
sum
119895=1
10038161003816100381610038161199081198941003816100381610038161003816
119901
)
1119901
(1)
When 119901 = 0 1199080
= sum119898
119895=1119868(119908119895 = 0)
The 1198710-SVM can
find the sparsest classifier by minimizing 1199080 the number
of nonzero elements in 119908 However it is discrete and NP-hard From computational point of view it is very difficult
Hindawi Publishing CorporationJournal of Applied MathematicsVolume 2014 Article ID 942520 16 pageshttpdxdoiorg1011552014942520
2 Journal of Applied Mathematics
to develop efficient numerical methods to solve the problemA widely used technique in dealing with the 119871
0-SVM is to
use a smoothing technique so that the discrete model isapproached by a smooth problem [4 12 13] However asthe function 119908
0is not even continuous it is not desirable
that a smoothing technique based method would work wellChan et al [14] explored a convex relaxation to the cardinalityconstraint andobtained a relaxed convex problem that is closeto but different from the previous 119871
0-SVM An alternative
method is to minimize the convex envelope of the 1199080
such as 1198711-SVM The 119871
1-SVM is a convex problem and
can yield sparse solution It can be equivalent to a linearprogramming and hence can be solved efficiently Indeedthe 119871
1regularization has become quite welcome in SVM
[12 15 16] and is well known as the LASSO [17] in thestatistics literature However the 119871
1regularization problem
often leads to suboptimal sparsity in reality [18] In manycases the solutions yielded from 119871
1-SVM are less sparse than
those of 1198710-SVM The 119871
119901problem with 0 lt 119901 lt 1 can find
sparser solutions than the 1198711problem which was evidenced
in extensive computational [19ndash21] It has become a welcomestrategy in sparse SVM [22ndash27]
In this paper we focus on the 11987112
regularization andpropose a novel 119871
12-SVM Recently Xu et al [28] justified
that the sparsity-promotion ability of the 11987112
problem wasstrongest among the 119871
119901minimization problems with all 119901 isin
[12 1) and similar in 119901 isin (0 12] So the 11987112
problem canbe taken as a representative of 119871
119901(0 lt 119901 lt 1) problems
However as proved by Ge et al [29] finding the globalminimal value of the 119871
12problemwas still strongly NP-hard
But computing a local minimizer of the problem could bedone in polynomial time Our contributions of this paper aretwofold One is to derive a smooth constrained optimizationreformulation to the 119871
12-SVMThe objective function of the
problem is a linear function and the constraints are quadraticand linear We will establish the equivalence between theconstrained problem and the 119871
12-SVM We will also show
the existence of the KKT condition of the constrainedproblem Our second contribution is to develop an interiorpoint method to solve the constrained optimization reformu-lation and establish its global convergence We will also testand verify the effectiveness of the proposed method usingartificial data and real data
The rest of this paper is organized as follows In Section 2we first briefly introduce the model of the standard SVM(1198712-SVM) and the sparse regularization SVMs We then
reformulate the 11987112
-SVM into a smooth constrained opti-mization problem We propose an interior point methodto solve the constrained optimization reformulation andestablish its global convergence in Section 3 In Section 4we do numerical experiments to test the proposed methodSection 5 gives the conclusive remarks
2 A Smooth Constrained OptimizationReformulation to the 119871
12-SVM
In this section after simply reviewing the model of the stan-dard SVM (119871
2-SVM) and the sparse regularization SVMs
we derive an equivalent smooth optimization problem to the
11987112
-SVM model The smooth optimization problem is tominimize a linear function subject to some simple quadraticconstraints and linear constraints
21 Standard SVM In a two-class classification problem weare given a training data set 119863 = (119909
119894 119910119894)119899
119894=1 where 119909
119894isin 119877119898 is
the feature vector and 119910119894isin 0 1 is the class label The linear
classifier is to construct the following decision function
119891 (119909) = 119908119879119909 + 119887 (2)
where 119908 = (1199081 1199082 119908
119898) is the weight vector and 119887 is the
bias The prediction label is +1 if 119891(119909) gt 0 and minus1 otherwiseThe standard SVM (119871
2-SVM) [30] aims to find the separating
hyperplane 119891(119909) = 0 between two classes with maximalmargin 2w
2
2and minimal training errors which leads to
the following convex optimization problem
min 1
21199082
2+ 119862
119899
sum
119894=1
120585119894
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894119894 = 1 119899 120585
119894ge 0
(3)
where 1199082
= (sum119898
119895=11199082
119895)12 is the 119871
2norm of 119908 120585
119894is the
loss function to allow training errors for data that may notbe linearly separable and 119862 is a user-specified parameter tobalance themargin and the losses As the problem is a convexquadratic program it can be solved by existingmethods suchas the interior point method and active set method efficiently
22 Sparse SVM The 1198712-SVM is a nonsparse regularizer in
the sense that the learned decision hyperplane often utilizesall the features In practice peoples prefer to sparse SVM sothat only a few features are used to make a decision For thispurposes the following 119871
119901-SVM becomes very welcome
min 119908119901
119901+ 119862
119899
sum
119894=1
120585119894
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894119894 = 1 119899 120585
119894ge 0
(4)
where 119901 isin [0 1] 1199080stands for the number of nonzero
elements of 119908 and for 119901 isin (0 1] 119908119901is defined by (1)
Problem (4) is obtained by replacing 1198712penalty (119908
2
2) by 119871
119901
penalty (119908119901
119901) in (3) The standard SVM (3) corresponds to
the model (4) with 119901 = 2Figure 1 plots the 119871
119901penalty in one dimension We can
see from the figure that the smaller 119901 is the larger penaltiesare imposed on the small coefficients (|119908| lt 1)Therefore the119871119901penalties with 119901 lt 1 may achieve sparser solution than
the 1198711penalty In addition the 119871
1imposes large penalties
on large coefficients which may lead to biased estimation forlarge coefficients Consequently the 119871
119901(0 lt 119901 lt 1) penalties
become attractive due to their good properties in sparsityunbiasedness [31] and oracle [32] We are particularly inter-ested in the 119871
12penalty Recently Xu et al [21] revealed the
representative role of the 11987112
penalty in the 119871119901regularization
with 119901 isin (0 1) We will apply 11987112
penalty to SVM to performfeature selection and classification jointly
Journal of Applied Mathematics 3
0 1 20
1
2
3
4
W
Lp
pena
lty
p = 2
p = 1
p = 05
minus2 minus15 minus1 minus05
35
25
15
15
05
05
Figure 1 119871119901penalty in one dimension
23 11987112
-SVM Model We pay particular attention to the11987112
-SVM namely problem (4) with 119901 = 12 We will derivea smooth constrained optimization reformulation to the119871
12-
SVM so that it is relatively easy to design numerical methodsWe first specify the 119871
12-SVM
min119898
sum
119895=1
10038161003816100381610038161003816119908119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894≜ 120601 (119908 120585 119887)
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119899
120585119894ge 0 119894 = 1 119899
(5)
Denote by 119906 = (119908 120585 119887) andD the feasible region of the pro-blem that is
D = (119908 120585 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894
120585119894ge 0 119894 = 1 119899
(6)
Then the 11987112
-SVM can be written as an impact form
min120601 (119906) 119906 isin D (7)
It is a nonconvex and non-Lipschitz problem Due to theexistence of the term |119908
119894|12 the objective function is not even
directionally differentiable at a point with some119908119894= 0 which
makes the problem very difficult to solve Existing numericalmethods that are very efficient for solving smooth problemcould not be used directly One possible way to developnumerical methods for solving (7) is to smoothing the term|119908119895|12 using some smoothing function such as 120601
120598(119908119895) =
(1199082
119895+ 1205982)14 with some 120598 gt 0 However it is easy to see that
the derivative of 120601120598(119908119895) will be unbounded as 119908
119895rarr 0 and
120598 rarr 0 Consequently it is not desirable that the smoothingfunction based numerical methods could work well
Recently Tian and Yang [33] proposed an interior point11987112
-penalty functionmethod to solve general nonlinear pro-gramming problems by using a quadratic relaxation schemefor their 119871
12-lower order penalty problems We will follow
the idea of [33] to develop an interior point method forsolving the 119871
12-SVM To this end in the next subsection we
reformulate problem (7) to a smooth constrained optimiza-tion problem
24 A Reformulation to the 11987112
-SVM Model Consider thefollowing constrained optimization problem
min119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894≜ 119891 (119908 120585 119905 119887)
st 1199052
119895minus 119908119895ge 0 119895 = 1 119898
1199052
119895+ 119908119895ge 0 119895 = 1 119898
119905119895ge 0 119895 = 1 119898
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119899
120585119894ge 0 119894 = 1 119899
(8)
It is obtained by letting 119905119895= |119908119895|12 in the objective function
and adding constraints 1199052
119895minus 119908119895
ge 0 and 1199052
119895+ 119908119895
ge 0119895 = 1 119898 in (7) Denote by F the feasible region of theproblem that is
F = (119908 120585 119905 119887) | 1199052
119895minus 119908119895ge 0 119905
2
119895+ 119908119895ge 0
119905119895ge 0 119895 = 1 119898
cap (119908 120585 119905 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894
120585119894ge 0 119894 = 1 119899
(9)
Let 119911 = (119908 120585 119905 119887) Then the above problem can be written as
min119891 (119911) 119911 isin F (10)
The following theorem establishes the equivalencebetween the 119871
12-SVM and (10)
Theorem 1 If 119906lowast = (119908lowast 120585lowast 119887lowast) isin 119877119898+119899+1 is a solution of the
11987112
minus 119878119881119872 (7) then 119911lowast
= (119908lowast 120585lowast |119908lowast|12
119887lowast) isin 119877
119898+119899+119898+1
is a solution of the optimization problem (10) Conversely if(119908lowast 120585lowast 119905lowast 119887lowast) is a solution of the optimization problem (10)
then (119908lowast 120585lowast 119887lowast) is a solution of the 119871
12-119878119881119872 (7)
Proof Let 119906lowast = (119908lowast 120585lowast 119887lowast) be a solution of the 119871
12-SVM
(7) and let = ( 120585 119905 ) be a solution of the constrained
4 Journal of Applied Mathematics
optimization problem (10) It is clear that 119911lowast
= (119908lowast 120585lowast
|119908lowast|12
119887lowast) isin F Moreover we have 119905
2
119895ge |
119895| forall119895 =
1 2 119898 and hence
119891 () =
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894ge
119898
sum
119895=1
10038161003816100381610038161003816119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894
ge
119898
sum
119895=1
10038161003816100381610038161003816119908lowast
119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585lowast
119894= 120601 (119906
lowast)
(11)
Since 119911lowast isin F we have
120601 (119906lowast) = 119891 (119911
lowast) ge 119891 () (12)
This together with (11) implies that 119891() = 120601(119906lowast) The proof
is complete
It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections
As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations
119877 (119908 120585 119905 119887 120582) =
(((((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119910119879120582(4)
min 1199052
119895minus 119908119895 120582(1)
119895119895=12119898
min 1199052
119895+ 119908119895 120582(2)
119895119895=12119898
min 119905119895 120582(3)
119895119895=12119898
min 119901119894 120582(4)
119894119894=12119899
min 120585119894 120582(5)
119894119894=12119899
)))))))))))
)
= 0
(13)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmultipliers 119883 = (119909
1 1199092 119909
119899)119879 119884 = diag(119910) is diagonal
matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585
119894) 119894 = 1 2 119899
For the sake of simplicity the properties of the reformu-lation to 119871
12-SVM are shown in Appendix A
3 An Interior Point Method
In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871
12-SVM (7)
31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows
min Φ120583(119908 120585 119905 119887)
≜
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894
minus 120583
119898
sum
119895=1
[log (1199052119895minus 119908119895) + log (1199052
119895+ 119908119895) + log 119905
119895]
minus 120583
119899
sum
119894=1
(log 119901119894+ log 120585
119894)
st 1199052
119895minus 119908119895gt 0 119895 = 1 2 119898
1199052
119895+ 119908119895gt 0 119895 = 1 2 119898
119905119895gt 0 119895 = 1 2 119898
119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1 gt 0 119894 = 1 2 119899
120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899
(14)
where 120583 is the barrier parameter converging to zero fromabove
The KKT system of problem (14) is the following systemof linear equations
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
= 0
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
= 0
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
= 0
119910119879120582(4)
= 0
(1198792minus 119882)120582
(1)minus 120583119890119898
= 0
(1198792+ 119882)120582
(2)minus 120583119890119898
= 0
119879120582(3)
minus 120583119890119898
= 0
119875120582(4)
minus 120583119890119899= 0
Ξ120582(5)
minus 120583119890119899= 0
(15)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =
diag(119884(119883119908+119887lowast119890119899)+120585minus119890
119899) are diagonalmatrices and 119890
119898isin 119877119898
and 119890119899isin 119877119899 stand for the vector whose elements are all ones
32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations
119872 =
((((((((
(
Δ119908
Δ120585
Δ119905
Δ119887
Δ120582(1)
Δ120582(2)
Δ120582(3)
Δ120582(4)
Δ120582(5)
))))))))
)
= minus
(((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
)))))))))
)
(16)
where 119872 is the Jacobian of the left function in (15) and takesthe form
Journal of Applied Mathematics 5
119872 =
(((((((
(
0 0 0 0 119868 minus119868 0 minus119883119879119884119879
0
0 0 0 0 0 0 0 minus119868 minus119868
0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0
0 0 0 0 0 0 0 119910119879
0
minus1198631
0 21198791198631
0 1198792minus 119882 0 0 0 0
1198632
0 21198791198632
0 0 1198792+ 119882 0 0 0
0 0 1198633
0 0 0 119879 0 0
1198634119884119883 119863
40 119863
4119910 0 0 0 119875 0
0 1198635
0 0 0 0 0 0 Ξ
)))))))
)
(17)
where 1198631
= diag(120582(1)) 1198632
= diag(120582(2)) 1198633
= diag(120582(3))1198634= diag(120582(4)) and119863
5= diag(120582(5))
We can rewrite (16) as
(120582(1)
+ Δ120582(1)
) minus (120582(2)
+ Δ120582(2)
) minus 119883119879119884119879(120582(4)
+ Δ120582(4)
) = 0
(120582(4)
+ Δ120582(4)
) + (120582(5)
+ Δ120582(5)
) = 119862 lowast 119890119899
2 (1198631+ 1198632) Δ119905 + 2119879 (120582
(1)+ Δ120582(1)
)
+ 2119879 (120582(2)
+ Δ120582(2)
) + (120582(3)
+ Δ120582(3)
) = 119890119898
119910119879(120582(4)
+ Δ120582(4)
) = 0
minus1198631Δ119908 + 2119879119863
1Δ119905 + (119879
2minus 119882) (120582
(1)+ Δ120582(1)
) = 120583119890119898
1198632Δ119908 + 2119879119863
2Δ119905 + (119879
2+ 119882) (120582
(2)+ Δ120582(2)
) = 120583119890119898
1198633Δ119905 + 119879 (120582
(3)+ Δ120582(3)
) = 120583119890119898
1198634119884119883Δ119908 + 119863
4Δ120585 + 119863
4119910 lowast Δ119887 + 119875 (120582
(4)+ Δ120582(4)
) = 120583119890119899
1198635Δ120585 + Ξ (120582
(5)+ Δ120582(5)
) = 120583119890119899
(18)
It follows from the last five equations that vector ≜ 120582 + Δ120582
can be expressed as
(1)
= (1198792minus 119882)minus1
(120583119890119898+ 1198631Δ119908 minus 2119879119863
1Δ119905)
(2)
= (1198792+ 119882)minus1
(120583119890119898minus 1198632Δ119908 minus 2119879119863
2Δ119905)
(3)
= 119879minus1
(120583119890119898minus 1198633Δ119905)
(4)
= 119875minus1
(120583119890119899minus 1198634119884119883Δ119908 minus 119863
4Δ120585 minus 119863
4119910 lowast Δ119887)
(5)
= Ξminus1
(120583119890119899minus 1198635Δ120585)
(19)
Substituting (19) into the first four equations of (18) weobtain
119878(
Δ119908
Δ120585
Δ119905
Δ119887
)
=
(((
(
minus120583(1198792minus 119882)minus1
119890119898+ 120583(119879
2+ 119882)minus1
119890119898
+119883119879119884119879119875minus1119890119899
minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899
minus119890119898+ 2120583119879(119879
2minus 119882)minus1
119890119898
+2120583119879(1198792+ 119882)minus1
119890119898+ 120583119879minus1119890119898
120583119910119879119875minus1119890119899
)))
)
(20)
where matrix 119878 takes the form
119878 = (
11987811
11987812
11987813
11987814
11987821
11987822
11987823
11987824
11987831
11987832
11987833
11987834
11987841
11987842
11987843
11987844
) (21)
with blocks
11987811
= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883
11987812
= 119883119879119884119879119875minus11198634
11987813
= minus2 (119880 minus 119881)119879 11987814
= 119883119879119884119879119875minus11198634119910
11987821
= 119875minus11198634119884119883
11987822
= 119875minus11198634+ Ξminus11198635
11987823
= 0 11987824
= 119875minus11198634119910
6 Journal of Applied Mathematics
11987831
= minus2 (119880 minus 119881)119879 11987832
= 0
11987833
= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632) 119878
34= 0
11987841
= 119910119879119875minus11198634119884119883 119878
42= 119910119879119875minus11198634
11987843
= 0 11987844
= 119910119879119875minus11198634119910
(22)
and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879
2+ 119882)minus11198632
33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =
(120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm
Algorithm 2 The interior pointer algorithm (IPA) is asfollows
Step 0 Given tolerance 120598120583 set 120591
1isin (0 (12)) 119897 gt 0
1205741gt 1 120573 isin (0 1) Let 119896 = 0
Step 1 Stop if KKT condition (15) holds
Step 2 Compute Δ119911119896 from (20) and
119896+1 from (19)Compute 119911
119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian
multipliers to obtain 120582119896+1
Step 3 Let 119896 = 119896 + 1 Go to Step 1
In Step 2 a step length 120572119896 is used to calculate 119911
119896+1We estimate 120572
119896 by Armijo line search [34] in which 120572119896
=
max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the
following inequalities
(119905119896+1
)2
minus 119908119896+1
gt 0
(119905119896+1
)2
+ 119908119896+1
gt 0
119905119896+1
gt 0
Φ120583(119911119896+1
) minus Φ120583(119911119896)
le 1205911120572119896(nabla119908Φ
120583(119911119896)119879
Δ119908119896
+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896)
(23)
where 1205911isin (0 12)
To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582
119896 should
be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by
(1)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
lt min119897120583
(119905119896
119894)2
minus 119908119896
119894
(1)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
minus x119896119894
le (1)(119896+1)
119894le
1205831205741
(119905119896
119894)2
minus 119908119896
119894
1205831205741
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
gt1205831205741
(119905119896
119894)2
minus 119908119896
119894
(2)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
lt min119897120583
(119905119896
119894)2
+ 119908119896
119894
(2)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
+ 119908119896
119894
le (2)(119896+1)
119894le
1205831205741
(119905119896
119894)2
+ 119908119896
119894
1205831205741
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
gt1205831205741
(119905119896
119894)2
+ 119908119896
119894
(3)(119896+1)
119894=
min119897120583
119905119896
119894
if (3)(119896+1)119894
lt min119897120583
119905119896
119894
(3)(119896+1)
119894 if min119897
120583
119905119896
119894
le (3)(119896+1)
119894le
1205831205741
119905119896
119894
1205831205741
119905119896
119894
if (3)(119896+1)119894
gt1205831205741
119905119896
119894
(4)(119896+1)
119894=
min119897120583
119901119896
119894
if (4)(119896+1)119894
lt min119897120583
119901119896
119894
(4)(119896+1)
119894 if min119897
120583
119901119896
119894
le (4)(119896+1)
119894le
1205831205741
119901119896
119894
1205831205741
119901119896
119894
if (4)(119896+1)119894
gt1205831205741
119901119896
119894
(24)
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Journal of Applied Mathematics
to develop efficient numerical methods to solve the problemA widely used technique in dealing with the 119871
0-SVM is to
use a smoothing technique so that the discrete model isapproached by a smooth problem [4 12 13] However asthe function 119908
0is not even continuous it is not desirable
that a smoothing technique based method would work wellChan et al [14] explored a convex relaxation to the cardinalityconstraint andobtained a relaxed convex problem that is closeto but different from the previous 119871
0-SVM An alternative
method is to minimize the convex envelope of the 1199080
such as 1198711-SVM The 119871
1-SVM is a convex problem and
can yield sparse solution It can be equivalent to a linearprogramming and hence can be solved efficiently Indeedthe 119871
1regularization has become quite welcome in SVM
[12 15 16] and is well known as the LASSO [17] in thestatistics literature However the 119871
1regularization problem
often leads to suboptimal sparsity in reality [18] In manycases the solutions yielded from 119871
1-SVM are less sparse than
those of 1198710-SVM The 119871
119901problem with 0 lt 119901 lt 1 can find
sparser solutions than the 1198711problem which was evidenced
in extensive computational [19ndash21] It has become a welcomestrategy in sparse SVM [22ndash27]
In this paper we focus on the 11987112
regularization andpropose a novel 119871
12-SVM Recently Xu et al [28] justified
that the sparsity-promotion ability of the 11987112
problem wasstrongest among the 119871
119901minimization problems with all 119901 isin
[12 1) and similar in 119901 isin (0 12] So the 11987112
problem canbe taken as a representative of 119871
119901(0 lt 119901 lt 1) problems
However as proved by Ge et al [29] finding the globalminimal value of the 119871
12problemwas still strongly NP-hard
But computing a local minimizer of the problem could bedone in polynomial time Our contributions of this paper aretwofold One is to derive a smooth constrained optimizationreformulation to the 119871
12-SVMThe objective function of the
problem is a linear function and the constraints are quadraticand linear We will establish the equivalence between theconstrained problem and the 119871
12-SVM We will also show
the existence of the KKT condition of the constrainedproblem Our second contribution is to develop an interiorpoint method to solve the constrained optimization reformu-lation and establish its global convergence We will also testand verify the effectiveness of the proposed method usingartificial data and real data
The rest of this paper is organized as follows In Section 2we first briefly introduce the model of the standard SVM(1198712-SVM) and the sparse regularization SVMs We then
reformulate the 11987112
-SVM into a smooth constrained opti-mization problem We propose an interior point methodto solve the constrained optimization reformulation andestablish its global convergence in Section 3 In Section 4we do numerical experiments to test the proposed methodSection 5 gives the conclusive remarks
2 A Smooth Constrained OptimizationReformulation to the 119871
12-SVM
In this section after simply reviewing the model of the stan-dard SVM (119871
2-SVM) and the sparse regularization SVMs
we derive an equivalent smooth optimization problem to the
11987112
-SVM model The smooth optimization problem is tominimize a linear function subject to some simple quadraticconstraints and linear constraints
21 Standard SVM In a two-class classification problem weare given a training data set 119863 = (119909
119894 119910119894)119899
119894=1 where 119909
119894isin 119877119898 is
the feature vector and 119910119894isin 0 1 is the class label The linear
classifier is to construct the following decision function
119891 (119909) = 119908119879119909 + 119887 (2)
where 119908 = (1199081 1199082 119908
119898) is the weight vector and 119887 is the
bias The prediction label is +1 if 119891(119909) gt 0 and minus1 otherwiseThe standard SVM (119871
2-SVM) [30] aims to find the separating
hyperplane 119891(119909) = 0 between two classes with maximalmargin 2w
2
2and minimal training errors which leads to
the following convex optimization problem
min 1
21199082
2+ 119862
119899
sum
119894=1
120585119894
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894119894 = 1 119899 120585
119894ge 0
(3)
where 1199082
= (sum119898
119895=11199082
119895)12 is the 119871
2norm of 119908 120585
119894is the
loss function to allow training errors for data that may notbe linearly separable and 119862 is a user-specified parameter tobalance themargin and the losses As the problem is a convexquadratic program it can be solved by existingmethods suchas the interior point method and active set method efficiently
22 Sparse SVM The 1198712-SVM is a nonsparse regularizer in
the sense that the learned decision hyperplane often utilizesall the features In practice peoples prefer to sparse SVM sothat only a few features are used to make a decision For thispurposes the following 119871
119901-SVM becomes very welcome
min 119908119901
119901+ 119862
119899
sum
119894=1
120585119894
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894119894 = 1 119899 120585
119894ge 0
(4)
where 119901 isin [0 1] 1199080stands for the number of nonzero
elements of 119908 and for 119901 isin (0 1] 119908119901is defined by (1)
Problem (4) is obtained by replacing 1198712penalty (119908
2
2) by 119871
119901
penalty (119908119901
119901) in (3) The standard SVM (3) corresponds to
the model (4) with 119901 = 2Figure 1 plots the 119871
119901penalty in one dimension We can
see from the figure that the smaller 119901 is the larger penaltiesare imposed on the small coefficients (|119908| lt 1)Therefore the119871119901penalties with 119901 lt 1 may achieve sparser solution than
the 1198711penalty In addition the 119871
1imposes large penalties
on large coefficients which may lead to biased estimation forlarge coefficients Consequently the 119871
119901(0 lt 119901 lt 1) penalties
become attractive due to their good properties in sparsityunbiasedness [31] and oracle [32] We are particularly inter-ested in the 119871
12penalty Recently Xu et al [21] revealed the
representative role of the 11987112
penalty in the 119871119901regularization
with 119901 isin (0 1) We will apply 11987112
penalty to SVM to performfeature selection and classification jointly
Journal of Applied Mathematics 3
0 1 20
1
2
3
4
W
Lp
pena
lty
p = 2
p = 1
p = 05
minus2 minus15 minus1 minus05
35
25
15
15
05
05
Figure 1 119871119901penalty in one dimension
23 11987112
-SVM Model We pay particular attention to the11987112
-SVM namely problem (4) with 119901 = 12 We will derivea smooth constrained optimization reformulation to the119871
12-
SVM so that it is relatively easy to design numerical methodsWe first specify the 119871
12-SVM
min119898
sum
119895=1
10038161003816100381610038161003816119908119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894≜ 120601 (119908 120585 119887)
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119899
120585119894ge 0 119894 = 1 119899
(5)
Denote by 119906 = (119908 120585 119887) andD the feasible region of the pro-blem that is
D = (119908 120585 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894
120585119894ge 0 119894 = 1 119899
(6)
Then the 11987112
-SVM can be written as an impact form
min120601 (119906) 119906 isin D (7)
It is a nonconvex and non-Lipschitz problem Due to theexistence of the term |119908
119894|12 the objective function is not even
directionally differentiable at a point with some119908119894= 0 which
makes the problem very difficult to solve Existing numericalmethods that are very efficient for solving smooth problemcould not be used directly One possible way to developnumerical methods for solving (7) is to smoothing the term|119908119895|12 using some smoothing function such as 120601
120598(119908119895) =
(1199082
119895+ 1205982)14 with some 120598 gt 0 However it is easy to see that
the derivative of 120601120598(119908119895) will be unbounded as 119908
119895rarr 0 and
120598 rarr 0 Consequently it is not desirable that the smoothingfunction based numerical methods could work well
Recently Tian and Yang [33] proposed an interior point11987112
-penalty functionmethod to solve general nonlinear pro-gramming problems by using a quadratic relaxation schemefor their 119871
12-lower order penalty problems We will follow
the idea of [33] to develop an interior point method forsolving the 119871
12-SVM To this end in the next subsection we
reformulate problem (7) to a smooth constrained optimiza-tion problem
24 A Reformulation to the 11987112
-SVM Model Consider thefollowing constrained optimization problem
min119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894≜ 119891 (119908 120585 119905 119887)
st 1199052
119895minus 119908119895ge 0 119895 = 1 119898
1199052
119895+ 119908119895ge 0 119895 = 1 119898
119905119895ge 0 119895 = 1 119898
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119899
120585119894ge 0 119894 = 1 119899
(8)
It is obtained by letting 119905119895= |119908119895|12 in the objective function
and adding constraints 1199052
119895minus 119908119895
ge 0 and 1199052
119895+ 119908119895
ge 0119895 = 1 119898 in (7) Denote by F the feasible region of theproblem that is
F = (119908 120585 119905 119887) | 1199052
119895minus 119908119895ge 0 119905
2
119895+ 119908119895ge 0
119905119895ge 0 119895 = 1 119898
cap (119908 120585 119905 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894
120585119894ge 0 119894 = 1 119899
(9)
Let 119911 = (119908 120585 119905 119887) Then the above problem can be written as
min119891 (119911) 119911 isin F (10)
The following theorem establishes the equivalencebetween the 119871
12-SVM and (10)
Theorem 1 If 119906lowast = (119908lowast 120585lowast 119887lowast) isin 119877119898+119899+1 is a solution of the
11987112
minus 119878119881119872 (7) then 119911lowast
= (119908lowast 120585lowast |119908lowast|12
119887lowast) isin 119877
119898+119899+119898+1
is a solution of the optimization problem (10) Conversely if(119908lowast 120585lowast 119905lowast 119887lowast) is a solution of the optimization problem (10)
then (119908lowast 120585lowast 119887lowast) is a solution of the 119871
12-119878119881119872 (7)
Proof Let 119906lowast = (119908lowast 120585lowast 119887lowast) be a solution of the 119871
12-SVM
(7) and let = ( 120585 119905 ) be a solution of the constrained
4 Journal of Applied Mathematics
optimization problem (10) It is clear that 119911lowast
= (119908lowast 120585lowast
|119908lowast|12
119887lowast) isin F Moreover we have 119905
2
119895ge |
119895| forall119895 =
1 2 119898 and hence
119891 () =
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894ge
119898
sum
119895=1
10038161003816100381610038161003816119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894
ge
119898
sum
119895=1
10038161003816100381610038161003816119908lowast
119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585lowast
119894= 120601 (119906
lowast)
(11)
Since 119911lowast isin F we have
120601 (119906lowast) = 119891 (119911
lowast) ge 119891 () (12)
This together with (11) implies that 119891() = 120601(119906lowast) The proof
is complete
It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections
As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations
119877 (119908 120585 119905 119887 120582) =
(((((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119910119879120582(4)
min 1199052
119895minus 119908119895 120582(1)
119895119895=12119898
min 1199052
119895+ 119908119895 120582(2)
119895119895=12119898
min 119905119895 120582(3)
119895119895=12119898
min 119901119894 120582(4)
119894119894=12119899
min 120585119894 120582(5)
119894119894=12119899
)))))))))))
)
= 0
(13)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmultipliers 119883 = (119909
1 1199092 119909
119899)119879 119884 = diag(119910) is diagonal
matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585
119894) 119894 = 1 2 119899
For the sake of simplicity the properties of the reformu-lation to 119871
12-SVM are shown in Appendix A
3 An Interior Point Method
In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871
12-SVM (7)
31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows
min Φ120583(119908 120585 119905 119887)
≜
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894
minus 120583
119898
sum
119895=1
[log (1199052119895minus 119908119895) + log (1199052
119895+ 119908119895) + log 119905
119895]
minus 120583
119899
sum
119894=1
(log 119901119894+ log 120585
119894)
st 1199052
119895minus 119908119895gt 0 119895 = 1 2 119898
1199052
119895+ 119908119895gt 0 119895 = 1 2 119898
119905119895gt 0 119895 = 1 2 119898
119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1 gt 0 119894 = 1 2 119899
120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899
(14)
where 120583 is the barrier parameter converging to zero fromabove
The KKT system of problem (14) is the following systemof linear equations
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
= 0
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
= 0
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
= 0
119910119879120582(4)
= 0
(1198792minus 119882)120582
(1)minus 120583119890119898
= 0
(1198792+ 119882)120582
(2)minus 120583119890119898
= 0
119879120582(3)
minus 120583119890119898
= 0
119875120582(4)
minus 120583119890119899= 0
Ξ120582(5)
minus 120583119890119899= 0
(15)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =
diag(119884(119883119908+119887lowast119890119899)+120585minus119890
119899) are diagonalmatrices and 119890
119898isin 119877119898
and 119890119899isin 119877119899 stand for the vector whose elements are all ones
32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations
119872 =
((((((((
(
Δ119908
Δ120585
Δ119905
Δ119887
Δ120582(1)
Δ120582(2)
Δ120582(3)
Δ120582(4)
Δ120582(5)
))))))))
)
= minus
(((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
)))))))))
)
(16)
where 119872 is the Jacobian of the left function in (15) and takesthe form
Journal of Applied Mathematics 5
119872 =
(((((((
(
0 0 0 0 119868 minus119868 0 minus119883119879119884119879
0
0 0 0 0 0 0 0 minus119868 minus119868
0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0
0 0 0 0 0 0 0 119910119879
0
minus1198631
0 21198791198631
0 1198792minus 119882 0 0 0 0
1198632
0 21198791198632
0 0 1198792+ 119882 0 0 0
0 0 1198633
0 0 0 119879 0 0
1198634119884119883 119863
40 119863
4119910 0 0 0 119875 0
0 1198635
0 0 0 0 0 0 Ξ
)))))))
)
(17)
where 1198631
= diag(120582(1)) 1198632
= diag(120582(2)) 1198633
= diag(120582(3))1198634= diag(120582(4)) and119863
5= diag(120582(5))
We can rewrite (16) as
(120582(1)
+ Δ120582(1)
) minus (120582(2)
+ Δ120582(2)
) minus 119883119879119884119879(120582(4)
+ Δ120582(4)
) = 0
(120582(4)
+ Δ120582(4)
) + (120582(5)
+ Δ120582(5)
) = 119862 lowast 119890119899
2 (1198631+ 1198632) Δ119905 + 2119879 (120582
(1)+ Δ120582(1)
)
+ 2119879 (120582(2)
+ Δ120582(2)
) + (120582(3)
+ Δ120582(3)
) = 119890119898
119910119879(120582(4)
+ Δ120582(4)
) = 0
minus1198631Δ119908 + 2119879119863
1Δ119905 + (119879
2minus 119882) (120582
(1)+ Δ120582(1)
) = 120583119890119898
1198632Δ119908 + 2119879119863
2Δ119905 + (119879
2+ 119882) (120582
(2)+ Δ120582(2)
) = 120583119890119898
1198633Δ119905 + 119879 (120582
(3)+ Δ120582(3)
) = 120583119890119898
1198634119884119883Δ119908 + 119863
4Δ120585 + 119863
4119910 lowast Δ119887 + 119875 (120582
(4)+ Δ120582(4)
) = 120583119890119899
1198635Δ120585 + Ξ (120582
(5)+ Δ120582(5)
) = 120583119890119899
(18)
It follows from the last five equations that vector ≜ 120582 + Δ120582
can be expressed as
(1)
= (1198792minus 119882)minus1
(120583119890119898+ 1198631Δ119908 minus 2119879119863
1Δ119905)
(2)
= (1198792+ 119882)minus1
(120583119890119898minus 1198632Δ119908 minus 2119879119863
2Δ119905)
(3)
= 119879minus1
(120583119890119898minus 1198633Δ119905)
(4)
= 119875minus1
(120583119890119899minus 1198634119884119883Δ119908 minus 119863
4Δ120585 minus 119863
4119910 lowast Δ119887)
(5)
= Ξminus1
(120583119890119899minus 1198635Δ120585)
(19)
Substituting (19) into the first four equations of (18) weobtain
119878(
Δ119908
Δ120585
Δ119905
Δ119887
)
=
(((
(
minus120583(1198792minus 119882)minus1
119890119898+ 120583(119879
2+ 119882)minus1
119890119898
+119883119879119884119879119875minus1119890119899
minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899
minus119890119898+ 2120583119879(119879
2minus 119882)minus1
119890119898
+2120583119879(1198792+ 119882)minus1
119890119898+ 120583119879minus1119890119898
120583119910119879119875minus1119890119899
)))
)
(20)
where matrix 119878 takes the form
119878 = (
11987811
11987812
11987813
11987814
11987821
11987822
11987823
11987824
11987831
11987832
11987833
11987834
11987841
11987842
11987843
11987844
) (21)
with blocks
11987811
= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883
11987812
= 119883119879119884119879119875minus11198634
11987813
= minus2 (119880 minus 119881)119879 11987814
= 119883119879119884119879119875minus11198634119910
11987821
= 119875minus11198634119884119883
11987822
= 119875minus11198634+ Ξminus11198635
11987823
= 0 11987824
= 119875minus11198634119910
6 Journal of Applied Mathematics
11987831
= minus2 (119880 minus 119881)119879 11987832
= 0
11987833
= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632) 119878
34= 0
11987841
= 119910119879119875minus11198634119884119883 119878
42= 119910119879119875minus11198634
11987843
= 0 11987844
= 119910119879119875minus11198634119910
(22)
and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879
2+ 119882)minus11198632
33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =
(120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm
Algorithm 2 The interior pointer algorithm (IPA) is asfollows
Step 0 Given tolerance 120598120583 set 120591
1isin (0 (12)) 119897 gt 0
1205741gt 1 120573 isin (0 1) Let 119896 = 0
Step 1 Stop if KKT condition (15) holds
Step 2 Compute Δ119911119896 from (20) and
119896+1 from (19)Compute 119911
119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian
multipliers to obtain 120582119896+1
Step 3 Let 119896 = 119896 + 1 Go to Step 1
In Step 2 a step length 120572119896 is used to calculate 119911
119896+1We estimate 120572
119896 by Armijo line search [34] in which 120572119896
=
max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the
following inequalities
(119905119896+1
)2
minus 119908119896+1
gt 0
(119905119896+1
)2
+ 119908119896+1
gt 0
119905119896+1
gt 0
Φ120583(119911119896+1
) minus Φ120583(119911119896)
le 1205911120572119896(nabla119908Φ
120583(119911119896)119879
Δ119908119896
+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896)
(23)
where 1205911isin (0 12)
To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582
119896 should
be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by
(1)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
lt min119897120583
(119905119896
119894)2
minus 119908119896
119894
(1)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
minus x119896119894
le (1)(119896+1)
119894le
1205831205741
(119905119896
119894)2
minus 119908119896
119894
1205831205741
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
gt1205831205741
(119905119896
119894)2
minus 119908119896
119894
(2)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
lt min119897120583
(119905119896
119894)2
+ 119908119896
119894
(2)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
+ 119908119896
119894
le (2)(119896+1)
119894le
1205831205741
(119905119896
119894)2
+ 119908119896
119894
1205831205741
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
gt1205831205741
(119905119896
119894)2
+ 119908119896
119894
(3)(119896+1)
119894=
min119897120583
119905119896
119894
if (3)(119896+1)119894
lt min119897120583
119905119896
119894
(3)(119896+1)
119894 if min119897
120583
119905119896
119894
le (3)(119896+1)
119894le
1205831205741
119905119896
119894
1205831205741
119905119896
119894
if (3)(119896+1)119894
gt1205831205741
119905119896
119894
(4)(119896+1)
119894=
min119897120583
119901119896
119894
if (4)(119896+1)119894
lt min119897120583
119901119896
119894
(4)(119896+1)
119894 if min119897
120583
119901119896
119894
le (4)(119896+1)
119894le
1205831205741
119901119896
119894
1205831205741
119901119896
119894
if (4)(119896+1)119894
gt1205831205741
119901119896
119894
(24)
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 3
0 1 20
1
2
3
4
W
Lp
pena
lty
p = 2
p = 1
p = 05
minus2 minus15 minus1 minus05
35
25
15
15
05
05
Figure 1 119871119901penalty in one dimension
23 11987112
-SVM Model We pay particular attention to the11987112
-SVM namely problem (4) with 119901 = 12 We will derivea smooth constrained optimization reformulation to the119871
12-
SVM so that it is relatively easy to design numerical methodsWe first specify the 119871
12-SVM
min119898
sum
119895=1
10038161003816100381610038161003816119908119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894≜ 120601 (119908 120585 119887)
st 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119899
120585119894ge 0 119894 = 1 119899
(5)
Denote by 119906 = (119908 120585 119887) andD the feasible region of the pro-blem that is
D = (119908 120585 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894
120585119894ge 0 119894 = 1 119899
(6)
Then the 11987112
-SVM can be written as an impact form
min120601 (119906) 119906 isin D (7)
It is a nonconvex and non-Lipschitz problem Due to theexistence of the term |119908
119894|12 the objective function is not even
directionally differentiable at a point with some119908119894= 0 which
makes the problem very difficult to solve Existing numericalmethods that are very efficient for solving smooth problemcould not be used directly One possible way to developnumerical methods for solving (7) is to smoothing the term|119908119895|12 using some smoothing function such as 120601
120598(119908119895) =
(1199082
119895+ 1205982)14 with some 120598 gt 0 However it is easy to see that
the derivative of 120601120598(119908119895) will be unbounded as 119908
119895rarr 0 and
120598 rarr 0 Consequently it is not desirable that the smoothingfunction based numerical methods could work well
Recently Tian and Yang [33] proposed an interior point11987112
-penalty functionmethod to solve general nonlinear pro-gramming problems by using a quadratic relaxation schemefor their 119871
12-lower order penalty problems We will follow
the idea of [33] to develop an interior point method forsolving the 119871
12-SVM To this end in the next subsection we
reformulate problem (7) to a smooth constrained optimiza-tion problem
24 A Reformulation to the 11987112
-SVM Model Consider thefollowing constrained optimization problem
min119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894≜ 119891 (119908 120585 119905 119887)
st 1199052
119895minus 119908119895ge 0 119895 = 1 119898
1199052
119895+ 119908119895ge 0 119895 = 1 119898
119905119895ge 0 119895 = 1 119898
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119899
120585119894ge 0 119894 = 1 119899
(8)
It is obtained by letting 119905119895= |119908119895|12 in the objective function
and adding constraints 1199052
119895minus 119908119895
ge 0 and 1199052
119895+ 119908119895
ge 0119895 = 1 119898 in (7) Denote by F the feasible region of theproblem that is
F = (119908 120585 119905 119887) | 1199052
119895minus 119908119895ge 0 119905
2
119895+ 119908119895ge 0
119905119895ge 0 119895 = 1 119898
cap (119908 120585 119905 119887) | 119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894
120585119894ge 0 119894 = 1 119899
(9)
Let 119911 = (119908 120585 119905 119887) Then the above problem can be written as
min119891 (119911) 119911 isin F (10)
The following theorem establishes the equivalencebetween the 119871
12-SVM and (10)
Theorem 1 If 119906lowast = (119908lowast 120585lowast 119887lowast) isin 119877119898+119899+1 is a solution of the
11987112
minus 119878119881119872 (7) then 119911lowast
= (119908lowast 120585lowast |119908lowast|12
119887lowast) isin 119877
119898+119899+119898+1
is a solution of the optimization problem (10) Conversely if(119908lowast 120585lowast 119905lowast 119887lowast) is a solution of the optimization problem (10)
then (119908lowast 120585lowast 119887lowast) is a solution of the 119871
12-119878119881119872 (7)
Proof Let 119906lowast = (119908lowast 120585lowast 119887lowast) be a solution of the 119871
12-SVM
(7) and let = ( 120585 119905 ) be a solution of the constrained
4 Journal of Applied Mathematics
optimization problem (10) It is clear that 119911lowast
= (119908lowast 120585lowast
|119908lowast|12
119887lowast) isin F Moreover we have 119905
2
119895ge |
119895| forall119895 =
1 2 119898 and hence
119891 () =
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894ge
119898
sum
119895=1
10038161003816100381610038161003816119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894
ge
119898
sum
119895=1
10038161003816100381610038161003816119908lowast
119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585lowast
119894= 120601 (119906
lowast)
(11)
Since 119911lowast isin F we have
120601 (119906lowast) = 119891 (119911
lowast) ge 119891 () (12)
This together with (11) implies that 119891() = 120601(119906lowast) The proof
is complete
It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections
As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations
119877 (119908 120585 119905 119887 120582) =
(((((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119910119879120582(4)
min 1199052
119895minus 119908119895 120582(1)
119895119895=12119898
min 1199052
119895+ 119908119895 120582(2)
119895119895=12119898
min 119905119895 120582(3)
119895119895=12119898
min 119901119894 120582(4)
119894119894=12119899
min 120585119894 120582(5)
119894119894=12119899
)))))))))))
)
= 0
(13)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmultipliers 119883 = (119909
1 1199092 119909
119899)119879 119884 = diag(119910) is diagonal
matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585
119894) 119894 = 1 2 119899
For the sake of simplicity the properties of the reformu-lation to 119871
12-SVM are shown in Appendix A
3 An Interior Point Method
In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871
12-SVM (7)
31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows
min Φ120583(119908 120585 119905 119887)
≜
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894
minus 120583
119898
sum
119895=1
[log (1199052119895minus 119908119895) + log (1199052
119895+ 119908119895) + log 119905
119895]
minus 120583
119899
sum
119894=1
(log 119901119894+ log 120585
119894)
st 1199052
119895minus 119908119895gt 0 119895 = 1 2 119898
1199052
119895+ 119908119895gt 0 119895 = 1 2 119898
119905119895gt 0 119895 = 1 2 119898
119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1 gt 0 119894 = 1 2 119899
120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899
(14)
where 120583 is the barrier parameter converging to zero fromabove
The KKT system of problem (14) is the following systemof linear equations
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
= 0
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
= 0
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
= 0
119910119879120582(4)
= 0
(1198792minus 119882)120582
(1)minus 120583119890119898
= 0
(1198792+ 119882)120582
(2)minus 120583119890119898
= 0
119879120582(3)
minus 120583119890119898
= 0
119875120582(4)
minus 120583119890119899= 0
Ξ120582(5)
minus 120583119890119899= 0
(15)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =
diag(119884(119883119908+119887lowast119890119899)+120585minus119890
119899) are diagonalmatrices and 119890
119898isin 119877119898
and 119890119899isin 119877119899 stand for the vector whose elements are all ones
32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations
119872 =
((((((((
(
Δ119908
Δ120585
Δ119905
Δ119887
Δ120582(1)
Δ120582(2)
Δ120582(3)
Δ120582(4)
Δ120582(5)
))))))))
)
= minus
(((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
)))))))))
)
(16)
where 119872 is the Jacobian of the left function in (15) and takesthe form
Journal of Applied Mathematics 5
119872 =
(((((((
(
0 0 0 0 119868 minus119868 0 minus119883119879119884119879
0
0 0 0 0 0 0 0 minus119868 minus119868
0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0
0 0 0 0 0 0 0 119910119879
0
minus1198631
0 21198791198631
0 1198792minus 119882 0 0 0 0
1198632
0 21198791198632
0 0 1198792+ 119882 0 0 0
0 0 1198633
0 0 0 119879 0 0
1198634119884119883 119863
40 119863
4119910 0 0 0 119875 0
0 1198635
0 0 0 0 0 0 Ξ
)))))))
)
(17)
where 1198631
= diag(120582(1)) 1198632
= diag(120582(2)) 1198633
= diag(120582(3))1198634= diag(120582(4)) and119863
5= diag(120582(5))
We can rewrite (16) as
(120582(1)
+ Δ120582(1)
) minus (120582(2)
+ Δ120582(2)
) minus 119883119879119884119879(120582(4)
+ Δ120582(4)
) = 0
(120582(4)
+ Δ120582(4)
) + (120582(5)
+ Δ120582(5)
) = 119862 lowast 119890119899
2 (1198631+ 1198632) Δ119905 + 2119879 (120582
(1)+ Δ120582(1)
)
+ 2119879 (120582(2)
+ Δ120582(2)
) + (120582(3)
+ Δ120582(3)
) = 119890119898
119910119879(120582(4)
+ Δ120582(4)
) = 0
minus1198631Δ119908 + 2119879119863
1Δ119905 + (119879
2minus 119882) (120582
(1)+ Δ120582(1)
) = 120583119890119898
1198632Δ119908 + 2119879119863
2Δ119905 + (119879
2+ 119882) (120582
(2)+ Δ120582(2)
) = 120583119890119898
1198633Δ119905 + 119879 (120582
(3)+ Δ120582(3)
) = 120583119890119898
1198634119884119883Δ119908 + 119863
4Δ120585 + 119863
4119910 lowast Δ119887 + 119875 (120582
(4)+ Δ120582(4)
) = 120583119890119899
1198635Δ120585 + Ξ (120582
(5)+ Δ120582(5)
) = 120583119890119899
(18)
It follows from the last five equations that vector ≜ 120582 + Δ120582
can be expressed as
(1)
= (1198792minus 119882)minus1
(120583119890119898+ 1198631Δ119908 minus 2119879119863
1Δ119905)
(2)
= (1198792+ 119882)minus1
(120583119890119898minus 1198632Δ119908 minus 2119879119863
2Δ119905)
(3)
= 119879minus1
(120583119890119898minus 1198633Δ119905)
(4)
= 119875minus1
(120583119890119899minus 1198634119884119883Δ119908 minus 119863
4Δ120585 minus 119863
4119910 lowast Δ119887)
(5)
= Ξminus1
(120583119890119899minus 1198635Δ120585)
(19)
Substituting (19) into the first four equations of (18) weobtain
119878(
Δ119908
Δ120585
Δ119905
Δ119887
)
=
(((
(
minus120583(1198792minus 119882)minus1
119890119898+ 120583(119879
2+ 119882)minus1
119890119898
+119883119879119884119879119875minus1119890119899
minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899
minus119890119898+ 2120583119879(119879
2minus 119882)minus1
119890119898
+2120583119879(1198792+ 119882)minus1
119890119898+ 120583119879minus1119890119898
120583119910119879119875minus1119890119899
)))
)
(20)
where matrix 119878 takes the form
119878 = (
11987811
11987812
11987813
11987814
11987821
11987822
11987823
11987824
11987831
11987832
11987833
11987834
11987841
11987842
11987843
11987844
) (21)
with blocks
11987811
= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883
11987812
= 119883119879119884119879119875minus11198634
11987813
= minus2 (119880 minus 119881)119879 11987814
= 119883119879119884119879119875minus11198634119910
11987821
= 119875minus11198634119884119883
11987822
= 119875minus11198634+ Ξminus11198635
11987823
= 0 11987824
= 119875minus11198634119910
6 Journal of Applied Mathematics
11987831
= minus2 (119880 minus 119881)119879 11987832
= 0
11987833
= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632) 119878
34= 0
11987841
= 119910119879119875minus11198634119884119883 119878
42= 119910119879119875minus11198634
11987843
= 0 11987844
= 119910119879119875minus11198634119910
(22)
and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879
2+ 119882)minus11198632
33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =
(120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm
Algorithm 2 The interior pointer algorithm (IPA) is asfollows
Step 0 Given tolerance 120598120583 set 120591
1isin (0 (12)) 119897 gt 0
1205741gt 1 120573 isin (0 1) Let 119896 = 0
Step 1 Stop if KKT condition (15) holds
Step 2 Compute Δ119911119896 from (20) and
119896+1 from (19)Compute 119911
119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian
multipliers to obtain 120582119896+1
Step 3 Let 119896 = 119896 + 1 Go to Step 1
In Step 2 a step length 120572119896 is used to calculate 119911
119896+1We estimate 120572
119896 by Armijo line search [34] in which 120572119896
=
max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the
following inequalities
(119905119896+1
)2
minus 119908119896+1
gt 0
(119905119896+1
)2
+ 119908119896+1
gt 0
119905119896+1
gt 0
Φ120583(119911119896+1
) minus Φ120583(119911119896)
le 1205911120572119896(nabla119908Φ
120583(119911119896)119879
Δ119908119896
+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896)
(23)
where 1205911isin (0 12)
To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582
119896 should
be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by
(1)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
lt min119897120583
(119905119896
119894)2
minus 119908119896
119894
(1)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
minus x119896119894
le (1)(119896+1)
119894le
1205831205741
(119905119896
119894)2
minus 119908119896
119894
1205831205741
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
gt1205831205741
(119905119896
119894)2
minus 119908119896
119894
(2)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
lt min119897120583
(119905119896
119894)2
+ 119908119896
119894
(2)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
+ 119908119896
119894
le (2)(119896+1)
119894le
1205831205741
(119905119896
119894)2
+ 119908119896
119894
1205831205741
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
gt1205831205741
(119905119896
119894)2
+ 119908119896
119894
(3)(119896+1)
119894=
min119897120583
119905119896
119894
if (3)(119896+1)119894
lt min119897120583
119905119896
119894
(3)(119896+1)
119894 if min119897
120583
119905119896
119894
le (3)(119896+1)
119894le
1205831205741
119905119896
119894
1205831205741
119905119896
119894
if (3)(119896+1)119894
gt1205831205741
119905119896
119894
(4)(119896+1)
119894=
min119897120583
119901119896
119894
if (4)(119896+1)119894
lt min119897120583
119901119896
119894
(4)(119896+1)
119894 if min119897
120583
119901119896
119894
le (4)(119896+1)
119894le
1205831205741
119901119896
119894
1205831205741
119901119896
119894
if (4)(119896+1)119894
gt1205831205741
119901119896
119894
(24)
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Journal of Applied Mathematics
optimization problem (10) It is clear that 119911lowast
= (119908lowast 120585lowast
|119908lowast|12
119887lowast) isin F Moreover we have 119905
2
119895ge |
119895| forall119895 =
1 2 119898 and hence
119891 () =
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894ge
119898
sum
119895=1
10038161003816100381610038161003816119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585119894
ge
119898
sum
119895=1
10038161003816100381610038161003816119908lowast
119895
10038161003816100381610038161003816
12
+ 119862
119899
sum
119894=1
120585lowast
119894= 120601 (119906
lowast)
(11)
Since 119911lowast isin F we have
120601 (119906lowast) = 119891 (119911
lowast) ge 119891 () (12)
This together with (11) implies that 119891() = 120601(119906lowast) The proof
is complete
It is clear that the constraint functions of (10) are convexConsequently at any feasible point the set of all feasibledirections is the same as the set of all linearized feasibledirections
As a result the KKT point exists The KKT system ofthe problem (10) can be written as the following system ofnonlinear equations
119877 (119908 120585 119905 119887 120582) =
(((((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119910119879120582(4)
min 1199052
119895minus 119908119895 120582(1)
119895119895=12119898
min 1199052
119895+ 119908119895 120582(2)
119895119895=12119898
min 119905119895 120582(3)
119895119895=12119898
min 119901119894 120582(4)
119894119894=12119899
min 120585119894 120582(5)
119894119894=12119899
)))))))))))
)
= 0
(13)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmultipliers 119883 = (119909
1 1199092 119909
119899)119879 119884 = diag(119910) is diagonal
matrix and 119901119894= 119910119894(119908119879119909119894+ 119887) minus (1 minus 120585
119894) 119894 = 1 2 119899
For the sake of simplicity the properties of the reformu-lation to 119871
12-SVM are shown in Appendix A
3 An Interior Point Method
In this section we develop an interior point method to solvethe equivalent constrained problem (10) of the 119871
12-SVM (7)
31 Auxiliary Function Following the idea of the interiorpoint method the constrained problem (10) can be solvedby minimizing a sequence of logarithmic barrier functions asfollows
min Φ120583(119908 120585 119905 119887)
≜
119898
sum
119895=1
119905119895+ 119862
119899
sum
119894=1
120585119894
minus 120583
119898
sum
119895=1
[log (1199052119895minus 119908119895) + log (1199052
119895+ 119908119895) + log 119905
119895]
minus 120583
119899
sum
119894=1
(log 119901119894+ log 120585
119894)
st 1199052
119895minus 119908119895gt 0 119895 = 1 2 119898
1199052
119895+ 119908119895gt 0 119895 = 1 2 119898
119905119895gt 0 119895 = 1 2 119898
119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1 gt 0 119894 = 1 2 119899
120585119894gt 0 119894 = 1 2 sdot sdot sdot 119899
(14)
where 120583 is the barrier parameter converging to zero fromabove
The KKT system of problem (14) is the following systemof linear equations
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
= 0
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
= 0
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
= 0
119910119879120582(4)
= 0
(1198792minus 119882)120582
(1)minus 120583119890119898
= 0
(1198792+ 119882)120582
(2)minus 120583119890119898
= 0
119879120582(3)
minus 120583119890119898
= 0
119875120582(4)
minus 120583119890119899= 0
Ξ120582(5)
minus 120583119890119899= 0
(15)
where 120582 = (120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) are the Lagrangianmulti-pliers 119879 = diag(119905) 119882 = diag(119908) Ξ = diag(120585) and 119875 =
diag(119884(119883119908+119887lowast119890119899)+120585minus119890
119899) are diagonalmatrices and 119890
119898isin 119877119898
and 119890119899isin 119877119899 stand for the vector whose elements are all ones
32 Newtonrsquos Method We apply Newtonrsquos method to solvethe nonlinear system (15) in variables 119908 120585 119905 119887 and 120582 Thesubproblem of the method is the following system of linearequations
119872 =
((((((((
(
Δ119908
Δ120585
Δ119905
Δ119887
Δ120582(1)
Δ120582(2)
Δ120582(3)
Δ120582(4)
Δ120582(5)
))))))))
)
= minus
(((((((((
(
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
)))))))))
)
(16)
where 119872 is the Jacobian of the left function in (15) and takesthe form
Journal of Applied Mathematics 5
119872 =
(((((((
(
0 0 0 0 119868 minus119868 0 minus119883119879119884119879
0
0 0 0 0 0 0 0 minus119868 minus119868
0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0
0 0 0 0 0 0 0 119910119879
0
minus1198631
0 21198791198631
0 1198792minus 119882 0 0 0 0
1198632
0 21198791198632
0 0 1198792+ 119882 0 0 0
0 0 1198633
0 0 0 119879 0 0
1198634119884119883 119863
40 119863
4119910 0 0 0 119875 0
0 1198635
0 0 0 0 0 0 Ξ
)))))))
)
(17)
where 1198631
= diag(120582(1)) 1198632
= diag(120582(2)) 1198633
= diag(120582(3))1198634= diag(120582(4)) and119863
5= diag(120582(5))
We can rewrite (16) as
(120582(1)
+ Δ120582(1)
) minus (120582(2)
+ Δ120582(2)
) minus 119883119879119884119879(120582(4)
+ Δ120582(4)
) = 0
(120582(4)
+ Δ120582(4)
) + (120582(5)
+ Δ120582(5)
) = 119862 lowast 119890119899
2 (1198631+ 1198632) Δ119905 + 2119879 (120582
(1)+ Δ120582(1)
)
+ 2119879 (120582(2)
+ Δ120582(2)
) + (120582(3)
+ Δ120582(3)
) = 119890119898
119910119879(120582(4)
+ Δ120582(4)
) = 0
minus1198631Δ119908 + 2119879119863
1Δ119905 + (119879
2minus 119882) (120582
(1)+ Δ120582(1)
) = 120583119890119898
1198632Δ119908 + 2119879119863
2Δ119905 + (119879
2+ 119882) (120582
(2)+ Δ120582(2)
) = 120583119890119898
1198633Δ119905 + 119879 (120582
(3)+ Δ120582(3)
) = 120583119890119898
1198634119884119883Δ119908 + 119863
4Δ120585 + 119863
4119910 lowast Δ119887 + 119875 (120582
(4)+ Δ120582(4)
) = 120583119890119899
1198635Δ120585 + Ξ (120582
(5)+ Δ120582(5)
) = 120583119890119899
(18)
It follows from the last five equations that vector ≜ 120582 + Δ120582
can be expressed as
(1)
= (1198792minus 119882)minus1
(120583119890119898+ 1198631Δ119908 minus 2119879119863
1Δ119905)
(2)
= (1198792+ 119882)minus1
(120583119890119898minus 1198632Δ119908 minus 2119879119863
2Δ119905)
(3)
= 119879minus1
(120583119890119898minus 1198633Δ119905)
(4)
= 119875minus1
(120583119890119899minus 1198634119884119883Δ119908 minus 119863
4Δ120585 minus 119863
4119910 lowast Δ119887)
(5)
= Ξminus1
(120583119890119899minus 1198635Δ120585)
(19)
Substituting (19) into the first four equations of (18) weobtain
119878(
Δ119908
Δ120585
Δ119905
Δ119887
)
=
(((
(
minus120583(1198792minus 119882)minus1
119890119898+ 120583(119879
2+ 119882)minus1
119890119898
+119883119879119884119879119875minus1119890119899
minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899
minus119890119898+ 2120583119879(119879
2minus 119882)minus1
119890119898
+2120583119879(1198792+ 119882)minus1
119890119898+ 120583119879minus1119890119898
120583119910119879119875minus1119890119899
)))
)
(20)
where matrix 119878 takes the form
119878 = (
11987811
11987812
11987813
11987814
11987821
11987822
11987823
11987824
11987831
11987832
11987833
11987834
11987841
11987842
11987843
11987844
) (21)
with blocks
11987811
= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883
11987812
= 119883119879119884119879119875minus11198634
11987813
= minus2 (119880 minus 119881)119879 11987814
= 119883119879119884119879119875minus11198634119910
11987821
= 119875minus11198634119884119883
11987822
= 119875minus11198634+ Ξminus11198635
11987823
= 0 11987824
= 119875minus11198634119910
6 Journal of Applied Mathematics
11987831
= minus2 (119880 minus 119881)119879 11987832
= 0
11987833
= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632) 119878
34= 0
11987841
= 119910119879119875minus11198634119884119883 119878
42= 119910119879119875minus11198634
11987843
= 0 11987844
= 119910119879119875minus11198634119910
(22)
and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879
2+ 119882)minus11198632
33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =
(120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm
Algorithm 2 The interior pointer algorithm (IPA) is asfollows
Step 0 Given tolerance 120598120583 set 120591
1isin (0 (12)) 119897 gt 0
1205741gt 1 120573 isin (0 1) Let 119896 = 0
Step 1 Stop if KKT condition (15) holds
Step 2 Compute Δ119911119896 from (20) and
119896+1 from (19)Compute 119911
119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian
multipliers to obtain 120582119896+1
Step 3 Let 119896 = 119896 + 1 Go to Step 1
In Step 2 a step length 120572119896 is used to calculate 119911
119896+1We estimate 120572
119896 by Armijo line search [34] in which 120572119896
=
max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the
following inequalities
(119905119896+1
)2
minus 119908119896+1
gt 0
(119905119896+1
)2
+ 119908119896+1
gt 0
119905119896+1
gt 0
Φ120583(119911119896+1
) minus Φ120583(119911119896)
le 1205911120572119896(nabla119908Φ
120583(119911119896)119879
Δ119908119896
+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896)
(23)
where 1205911isin (0 12)
To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582
119896 should
be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by
(1)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
lt min119897120583
(119905119896
119894)2
minus 119908119896
119894
(1)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
minus x119896119894
le (1)(119896+1)
119894le
1205831205741
(119905119896
119894)2
minus 119908119896
119894
1205831205741
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
gt1205831205741
(119905119896
119894)2
minus 119908119896
119894
(2)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
lt min119897120583
(119905119896
119894)2
+ 119908119896
119894
(2)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
+ 119908119896
119894
le (2)(119896+1)
119894le
1205831205741
(119905119896
119894)2
+ 119908119896
119894
1205831205741
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
gt1205831205741
(119905119896
119894)2
+ 119908119896
119894
(3)(119896+1)
119894=
min119897120583
119905119896
119894
if (3)(119896+1)119894
lt min119897120583
119905119896
119894
(3)(119896+1)
119894 if min119897
120583
119905119896
119894
le (3)(119896+1)
119894le
1205831205741
119905119896
119894
1205831205741
119905119896
119894
if (3)(119896+1)119894
gt1205831205741
119905119896
119894
(4)(119896+1)
119894=
min119897120583
119901119896
119894
if (4)(119896+1)119894
lt min119897120583
119901119896
119894
(4)(119896+1)
119894 if min119897
120583
119901119896
119894
le (4)(119896+1)
119894le
1205831205741
119901119896
119894
1205831205741
119901119896
119894
if (4)(119896+1)119894
gt1205831205741
119901119896
119894
(24)
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 5
119872 =
(((((((
(
0 0 0 0 119868 minus119868 0 minus119883119879119884119879
0
0 0 0 0 0 0 0 minus119868 minus119868
0 0 minus2 (1198631+ 1198632) 0 minus2119879 minus2119879 minus119868 0 0
0 0 0 0 0 0 0 119910119879
0
minus1198631
0 21198791198631
0 1198792minus 119882 0 0 0 0
1198632
0 21198791198632
0 0 1198792+ 119882 0 0 0
0 0 1198633
0 0 0 119879 0 0
1198634119884119883 119863
40 119863
4119910 0 0 0 119875 0
0 1198635
0 0 0 0 0 0 Ξ
)))))))
)
(17)
where 1198631
= diag(120582(1)) 1198632
= diag(120582(2)) 1198633
= diag(120582(3))1198634= diag(120582(4)) and119863
5= diag(120582(5))
We can rewrite (16) as
(120582(1)
+ Δ120582(1)
) minus (120582(2)
+ Δ120582(2)
) minus 119883119879119884119879(120582(4)
+ Δ120582(4)
) = 0
(120582(4)
+ Δ120582(4)
) + (120582(5)
+ Δ120582(5)
) = 119862 lowast 119890119899
2 (1198631+ 1198632) Δ119905 + 2119879 (120582
(1)+ Δ120582(1)
)
+ 2119879 (120582(2)
+ Δ120582(2)
) + (120582(3)
+ Δ120582(3)
) = 119890119898
119910119879(120582(4)
+ Δ120582(4)
) = 0
minus1198631Δ119908 + 2119879119863
1Δ119905 + (119879
2minus 119882) (120582
(1)+ Δ120582(1)
) = 120583119890119898
1198632Δ119908 + 2119879119863
2Δ119905 + (119879
2+ 119882) (120582
(2)+ Δ120582(2)
) = 120583119890119898
1198633Δ119905 + 119879 (120582
(3)+ Δ120582(3)
) = 120583119890119898
1198634119884119883Δ119908 + 119863
4Δ120585 + 119863
4119910 lowast Δ119887 + 119875 (120582
(4)+ Δ120582(4)
) = 120583119890119899
1198635Δ120585 + Ξ (120582
(5)+ Δ120582(5)
) = 120583119890119899
(18)
It follows from the last five equations that vector ≜ 120582 + Δ120582
can be expressed as
(1)
= (1198792minus 119882)minus1
(120583119890119898+ 1198631Δ119908 minus 2119879119863
1Δ119905)
(2)
= (1198792+ 119882)minus1
(120583119890119898minus 1198632Δ119908 minus 2119879119863
2Δ119905)
(3)
= 119879minus1
(120583119890119898minus 1198633Δ119905)
(4)
= 119875minus1
(120583119890119899minus 1198634119884119883Δ119908 minus 119863
4Δ120585 minus 119863
4119910 lowast Δ119887)
(5)
= Ξminus1
(120583119890119899minus 1198635Δ120585)
(19)
Substituting (19) into the first four equations of (18) weobtain
119878(
Δ119908
Δ120585
Δ119905
Δ119887
)
=
(((
(
minus120583(1198792minus 119882)minus1
119890119898+ 120583(119879
2+ 119882)minus1
119890119898
+119883119879119884119879119875minus1119890119899
minus119862 lowast 119890119899+ 120583119875minus1119890119899+ 120583Ξminus1119890119899
minus119890119898+ 2120583119879(119879
2minus 119882)minus1
119890119898
+2120583119879(1198792+ 119882)minus1
119890119898+ 120583119879minus1119890119898
120583119910119879119875minus1119890119899
)))
)
(20)
where matrix 119878 takes the form
119878 = (
11987811
11987812
11987813
11987814
11987821
11987822
11987823
11987824
11987831
11987832
11987833
11987834
11987841
11987842
11987843
11987844
) (21)
with blocks
11987811
= 119880 + 119881 + 119883119879119884119879119875minus11198634119884119883
11987812
= 119883119879119884119879119875minus11198634
11987813
= minus2 (119880 minus 119881)119879 11987814
= 119883119879119884119879119875minus11198634119910
11987821
= 119875minus11198634119884119883
11987822
= 119875minus11198634+ Ξminus11198635
11987823
= 0 11987824
= 119875minus11198634119910
6 Journal of Applied Mathematics
11987831
= minus2 (119880 minus 119881)119879 11987832
= 0
11987833
= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632) 119878
34= 0
11987841
= 119910119879119875minus11198634119884119883 119878
42= 119910119879119875minus11198634
11987843
= 0 11987844
= 119910119879119875minus11198634119910
(22)
and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879
2+ 119882)minus11198632
33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =
(120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm
Algorithm 2 The interior pointer algorithm (IPA) is asfollows
Step 0 Given tolerance 120598120583 set 120591
1isin (0 (12)) 119897 gt 0
1205741gt 1 120573 isin (0 1) Let 119896 = 0
Step 1 Stop if KKT condition (15) holds
Step 2 Compute Δ119911119896 from (20) and
119896+1 from (19)Compute 119911
119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian
multipliers to obtain 120582119896+1
Step 3 Let 119896 = 119896 + 1 Go to Step 1
In Step 2 a step length 120572119896 is used to calculate 119911
119896+1We estimate 120572
119896 by Armijo line search [34] in which 120572119896
=
max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the
following inequalities
(119905119896+1
)2
minus 119908119896+1
gt 0
(119905119896+1
)2
+ 119908119896+1
gt 0
119905119896+1
gt 0
Φ120583(119911119896+1
) minus Φ120583(119911119896)
le 1205911120572119896(nabla119908Φ
120583(119911119896)119879
Δ119908119896
+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896)
(23)
where 1205911isin (0 12)
To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582
119896 should
be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by
(1)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
lt min119897120583
(119905119896
119894)2
minus 119908119896
119894
(1)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
minus x119896119894
le (1)(119896+1)
119894le
1205831205741
(119905119896
119894)2
minus 119908119896
119894
1205831205741
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
gt1205831205741
(119905119896
119894)2
minus 119908119896
119894
(2)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
lt min119897120583
(119905119896
119894)2
+ 119908119896
119894
(2)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
+ 119908119896
119894
le (2)(119896+1)
119894le
1205831205741
(119905119896
119894)2
+ 119908119896
119894
1205831205741
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
gt1205831205741
(119905119896
119894)2
+ 119908119896
119894
(3)(119896+1)
119894=
min119897120583
119905119896
119894
if (3)(119896+1)119894
lt min119897120583
119905119896
119894
(3)(119896+1)
119894 if min119897
120583
119905119896
119894
le (3)(119896+1)
119894le
1205831205741
119905119896
119894
1205831205741
119905119896
119894
if (3)(119896+1)119894
gt1205831205741
119905119896
119894
(4)(119896+1)
119894=
min119897120583
119901119896
119894
if (4)(119896+1)119894
lt min119897120583
119901119896
119894
(4)(119896+1)
119894 if min119897
120583
119901119896
119894
le (4)(119896+1)
119894le
1205831205741
119901119896
119894
1205831205741
119901119896
119894
if (4)(119896+1)119894
gt1205831205741
119901119896
119894
(24)
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Journal of Applied Mathematics
11987831
= minus2 (119880 minus 119881)119879 11987832
= 0
11987833
= 4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632) 119878
34= 0
11987841
= 119910119879119875minus11198634119884119883 119878
42= 119910119879119875minus11198634
11987843
= 0 11987844
= 119910119879119875minus11198634119910
(22)
and 119880 = (1198792minus 119882)minus11198631and 119881 = (119879
2+ 119882)minus11198632
33The Interior PointerAlgorithm Let 119911 = (119908 120585 119905 119887) and120582 =
(120582(1)
120582(2)
120582(3)
120582(4)
120582(5)
) we first present the interior pointeralgorithm to solve the barrier problem (14) and then discussthe details of the algorithm
Algorithm 2 The interior pointer algorithm (IPA) is asfollows
Step 0 Given tolerance 120598120583 set 120591
1isin (0 (12)) 119897 gt 0
1205741gt 1 120573 isin (0 1) Let 119896 = 0
Step 1 Stop if KKT condition (15) holds
Step 2 Compute Δ119911119896 from (20) and
119896+1 from (19)Compute 119911
119896+1= 119911119896+ 120572119896Δ119911119896 Update the Lagrangian
multipliers to obtain 120582119896+1
Step 3 Let 119896 = 119896 + 1 Go to Step 1
In Step 2 a step length 120572119896 is used to calculate 119911
119896+1We estimate 120572
119896 by Armijo line search [34] in which 120572119896
=
max 120573119895| 119895 = 0 1 2 for some 120573 isin (0 1) and satisfies the
following inequalities
(119905119896+1
)2
minus 119908119896+1
gt 0
(119905119896+1
)2
+ 119908119896+1
gt 0
119905119896+1
gt 0
Φ120583(119911119896+1
) minus Φ120583(119911119896)
le 1205911120572119896(nabla119908Φ
120583(119911119896)119879
Δ119908119896
+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896)
(23)
where 1205911isin (0 12)
To avoid ill-conditioned growth of 119896 and guarantee thestrict dual feasibility the Lagrangian multipliers 120582
119896 should
be sufficiently positive and bounded from above Followinga similar idea of [33] we first update the dual multipliers by
(1)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
lt min119897120583
(119905119896
119894)2
minus 119908119896
119894
(1)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
minus x119896119894
le (1)(119896+1)
119894le
1205831205741
(119905119896
119894)2
minus 119908119896
119894
1205831205741
(119905119896
119894)2
minus 119908119896
119894
if (1)(119896+1)119894
gt1205831205741
(119905119896
119894)2
minus 119908119896
119894
(2)(119896+1)
119894
=
min119897120583
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
lt min119897120583
(119905119896
119894)2
+ 119908119896
119894
(2)(119896+1)
119894 if min119897
120583
(119905119896
119894)2
+ 119908119896
119894
le (2)(119896+1)
119894le
1205831205741
(119905119896
119894)2
+ 119908119896
119894
1205831205741
(119905119896
119894)2
+ 119908119896
119894
if (2)(119896+1)119894
gt1205831205741
(119905119896
119894)2
+ 119908119896
119894
(3)(119896+1)
119894=
min119897120583
119905119896
119894
if (3)(119896+1)119894
lt min119897120583
119905119896
119894
(3)(119896+1)
119894 if min119897
120583
119905119896
119894
le (3)(119896+1)
119894le
1205831205741
119905119896
119894
1205831205741
119905119896
119894
if (3)(119896+1)119894
gt1205831205741
119905119896
119894
(4)(119896+1)
119894=
min119897120583
119901119896
119894
if (4)(119896+1)119894
lt min119897120583
119901119896
119894
(4)(119896+1)
119894 if min119897
120583
119901119896
119894
le (4)(119896+1)
119894le
1205831205741
119901119896
119894
1205831205741
119901119896
119894
if (4)(119896+1)119894
gt1205831205741
119901119896
119894
(24)
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 7
where 119901119894= 119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1
(5)(119896+1)
119894=
min119897120583
120585119896
119894
if (5)(119896+1)119894
lt min119897120583
120585119896
119894
(5)(119896+1)
119894 if min119897
120583
120585119896
119894
le (5)(119896+1)
119894le
1205831205741
120585119896
119894
1205831205741
120585119896
119894
if (5)(119896+1)119894
gt1205831205741
120585119896
119894
(25)
where the parameters 119897 and 1205741satisfy 0 lt 119897 120574
1gt 1
Since positive definiteness of the matrix 119878 is demanded inthis method the Lagrangian multipliers
119896+1 should satisfythe following condition
120582(3)
minus 2 (120582(1)
+ 120582(2)
) 119905 ge 0 (26)
For the sake of simplicity the proof is given in Appendix BTherefore if 119896+1 satisfies (26) we let 120582119896+1 =
119896+1 Other-wise we would further update it by the following setting
120582(1)(119896+1)
= 1205742(1)(119896+1)
120582(2)(119896+1)
= 1205742(2)(119896+1)
120582(3)(119896+1)
= 1205743(3)(119896+1)
(27)
where constants 1205742isin (0 1) and 120574
3ge 1 satisfy
1205743
1205742
= max119894isin119864
2119905119896+1
119894((1)(119896+1)
119894+ (2)(119896+1)
119894)
(3)(119896+1)
119894
(28)
with 119864 = 1 2 119899 It is not difficult to see that the vector((1)(119896+1)
(2)(119896+1)
(3)(119896+1)
) determined by (27) satisfies (26)In practice the KKT conditions (15) are allowed to be
satisfied within a tolerance 120598120583 It turns to be that the iterative
process stops while the following inequalities meet
Res (119911 120582 120583) =
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
120582(1)
minus 120582(2)
minus 119883119879119884119879120582(4)
119862 lowast 119890119899minus 120582(4)
minus 120582(5)
119890119898minus 2119879120582
(1)minus 2119879120582
(2)minus 120582(3)
119890119879
119899119884120582(4)
(1198792minus 119882)120582
(1)minus 120583119890119898
(1198792+ 119882)120582
(2)minus 120583119890119898
119879120582(3)
minus 120583119890119898
119875120582(4)
minus 120583119890119899
Ξ120582(5)
minus 120583119890119899
10038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817100381710038171003817
lt 120598120583 (29)
120582 ge minus120598120583119890 (30)
where 120598120583is related to the current barrier parameter 120583 and
satisfies 120598120583darr 0 as 120583 rarr 0
Since functionΦ120583in (14) is convex we have the following
lemma which shows that Algorithm 2 is well defined (theproof is given in Appendix B)
Lemma 3 Let 119911119896 be strictly feasible for problem (10) If Δ119911119896=
0 then (23) is satisfied for all 120572119896
ge 0 If Δ119911119896
= 0 then thereexists a
119896isin (0 1] such that (23) holds for all 120572
119896isin (0
119896]
The proposed interior point method successively solvesthe barrier subproblem (14) with a decreasing sequence 120583
119896
We simply reduce both 120598120583and 120583 by a constant factor 120573 isin
(0 1) Finally we test optimality for problem (10) by meansof the residual norm Res(119911 0)
Here we present the whole algorithm to solve the 11987112
-SVM problem(10)
Algorithm 4 Algorithm for solving 11987112
-SVM problem is asfollows
Step 0 Set 1199080 isin 119877119898 1198870isin 119877 120585
0isin 119877119899 1205820 =
0gt 0
1199050
119894ge radic|119908
0
119894| + (12) 119894 isin 1 2 119898 Given constants
1205830gt 0 1205981205830
gt 0 120573 isin (0 1) and 120598 gt 0 let 119895 = 0
Step 1 Stop if Res(119911119895 119895 0) le 120598 and 119895ge 0
Step 2 Starting from (119911119895 120582119895) apply Algorithm 2 to
solve (14) with barrier parameter 120583119895and stopping
tolerance 120598120583119895 Set 119911119895+1 = 119911
119895119896 119895+1 = 119895119896 and 120582
119895+1=
120582119895119896
Step 3 Set 120583119895+1
= 120573120583119895and 120598120583119895+1
= 120573120598120583119895 Let 119895 = 119895 + 1
and go to Step 1
In Algorithm 4 the index 119895 denotes an outer iterationwhile 119896 denotes the last inner iteration of Algorithm 2
The convergence of the proposed interior point methodcan be proved We list the theorem here and give the proof inAppendix C
Theorem 5 Let (119911119896 120582119896) be generated by Algorithm 2 Thenany limit point of the sequence (119911
119896 119896) generated by
Algorithm 2 satisfies the first-order optimality conditions (15)
Theorem 6 Let (119911119895 119895) be generated by Algorithm 4 by
ignoring its termination condition Then the following state-ments are true
(i) The limit point of (119911119895 119895) satisfies the first order opti-mality condition (13)
(ii) The limit point 119911lowast of the convergent subsequence
119911119895J sube 119911
119895 with unbounded multipliers
119895J is a
Fritz-John point [35] of problem (10)
4 Experiments
In this section we tested the constrained optimization refor-mulation to the 119871
12-SVM and the proposed interior point
method We compared the performance of the 11987112
-SVMwith 119871
2-SVM [30] 119871
1-SVM [12] and 119871
0-SVM [12] on
artificial data and ten UCI data sets (httparchiveicsucieduml) These four problems were solved in primal refer-encing the machine learning toolbox Spider (httppeoplekybtuebingenmpgdespider) The 119871
2-SVM and 119871
1-SVM
were solved directly by quadratic programming and linearprogramming respectively To the NP-hard problem 119871
0-
SVM a commonly cited approximation method FeatureSelection Concave (FSV) [12] was applied and then the FSV
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Journal of Applied Mathematics
problem was solved by a Successive Linear Approximation(SLA) algorithm All the experiments were run in thepersonal computer (16 GHz of CPU 4GB of RAM) withMATLAB R2010b on 64 bit Windows 7
In the proposed interior point method (Algorithms 2 and4) we set the parameters as 120573 = 06 120591
1= 05 119897 = 000001
1205741
= 100000 and 120573 = 05 The balance parameter 119862 wasselected by 5-fold cross-validation on training set over therange 2
minus10 2minus9 210 After training the weights that didnot satisfy the criteria |119908
119895|max
119894(|119908119894|) ge 10
4 [14] were set tozeroThen the cardinality of the hyperplane was computed asthe number of the nonzero weights
41 Artificial Data First we took an artificial binary linearclassification problem as an example The problem is similarto that in [13] The probability of 119910 = 1 or minus1 is equal Thefirst 6 features are relevant but redundant In 70 samplesthe first three features 119909
1 1199092 1199093were drawn as 119909
119894= 119910119873(119894 1)
and the second three features 1199094 1199095 1199096 as 119909
119894= 119873(0 1)
Otherwise the first three were drawn as 119909119894= 119873(0 1) and the
second three as 119909119894= 119910119873(119894 minus 3 1) The rest features are noise
119909119894= 119873(0 20) 119894 = 7 119898 Here119898 is the dimension of input
features The inputs were scaled to mean zero and standarddeviation In each trial 500 points were generated for testingand the average results were estimated over 30 trials
In the first experiment we consider the cases with thefixed feature size 119898 = 30 and different training sample sizes119899 = 10 20 100 The average results over the 30 trialsare shown in Table 1 and Figure 2 Figure 2 (left) plots theaverage cardinality of each classifier Since the artificial datasets have 2 relevant and nonredundant features the idealaverage cardinality is 2 Figure 2 (left) shows that the threesparse SVMs 119871
1-SVM 119871
12-SVM and 119871
0-SVM can achieve
sparse solution while the 1198712-SVM almost uses full features
in each data set Furthermore the solutions of 11987112
-SVMand 119871
0-SVM are much sparser than 119871
1-SVM As shown in
Table 1 the 1198711-SVM selects more than 6 features in all cases
which implies that some redundant or irrelevant featuresare selected The average cardinalities of 119871
12-SVM and 119871
0-
SVM are similar and close to 2 However when 119899 = 10
and 20 the 1198710-SVM has the average cardinalities of 142
and 187 respectively It means that the 1198710-SVM sometimes
selects only one feature in low sample data set and maybeignores some really relevant feature Consequently with thecardinalities between 205 and 29 119871
12-SVM has the more
reliable solution than 1198710-SVM In short as far as the number
of selected features is concerned the 11987112
-SVM behavesbetter than the other three methods
Figure 2 (right) plots the trend of the prediction accuracyversus the size of the training sample The classificationperformance of all methods is generally improved with theincreasing of the training sample size 119899 119871
1-SVM has the best
prediction performance in all cases and a slightly better than11987112
-SVM 11987112
-SVM shows more accuracy in classificationthan 119871
2-SVM and 119871
0-SVM especially in the case of 119899 =
10 50 As shown in Table 1 when there are only 10 train-ing samples the average accuracy of 119871
12-SVM is 8805
while the results of 1198712-SVM and 119871
0-SVM are 8465 and
7765 respectively Compared with 1198712-SVM and 119871
0-SVM
11987112
-SVM has the average accuracy increased by 34 and104 respectively as can be explained in what follows Tothe 1198712-SVM all features are selected without discrimination
and the prediction would bemisled by the irrelevant featuresTo the 119871
0-SVM few features are selected and some relevant
features are not included which would put negative impacton the prediction result As the tradeoff between119871
2-SVMand
1198710-SVM 119871
12-SVM has better performance than the two
The average results over ten artificial data sets in the firstexperiment are shown in the bottom of Table 1 On averagethe accuracy of 119871
12-SVM is 087 lower than the 119871
1-SVM
while the features selected by 11987112
-SVM are 7414 less than1198711-SVM It indicates that the 119871
12-SVM can achieve much
sparser solution than 1198711-SVM with little cost of accuracy
Moreover the average accuracy of 11987112
-SVMover 10 data setsis 222 higher than 119871
0-SVM with the similar cardinality
To sum up the 11987112
-SVM provides the best balance betweenaccuracy and sparsity among the three sparse SVMs
To further evaluate the feature selection performance of11987112
-SVM we investigate whether the features are correctlyselected For the119871
2-SVM is not designed for feature selection
it is not included in this comparison Since our artificial datasets have 2 best features (119909
3 1199096) the best result should have
the two features (1199093 1199096) ranking on the top according to their
absolute values of weights |119908119895| In the experiment we select
the top 2 features with the maximal |119908119895| for each method
and calculate the frequency that the top 2 features are 1199093
and 1199096in 30 runs The results are listed in Table 2 When the
training sample size is too small it is difficult to discriminatethe two most important features for all sparse SVMs Forexample when 119899 = 10 the selected frequencies of 119871
1-SVM
1198710-SVM and 119871
12-SVM are 7 3 and 9 respectively When 119899
increases all methods tend to make more correct selectionMoreover Table 2 shows that the 119871
12-SVM outperforms the
other two methods in all cases For example when 119899 = 100the selected frequencies of 119871
1-SVM and119871
0-SVM are 22 and
25 respectively and the result of 11987112
-SVM is 27 The 1198711-
SVMselects toomany redundant or irrelevant features whichmay influence the ranking in some extent Therefore 119871
1-
SVM is not so good as 11987112
-SVM at distinguishing the criticalfeatures The 119871
0-SVM has the lower hit frequency than 119871
12-
SVM which is probably due to the excessive small featuresubset it obtained Above all Tables 1 and 2 and Figure 2clearly show that the 119871
12-SVM is a promising sparsity driven
classification methodIn the second simulation we consider the cases with
various dimensions of feature space 119898 = 20 40 180 200
and the fixed training sample size 119899 = 100 The averageresults over 30 trials are shown in Figure 3 and Table 3 Sincethere are only 6 relevant features yet the larger 119898 meansthe more noisy features Figure 3 (left) shows that as thedimension increases from 20 to 200 the number of featuresselected by 119871
1-SVM increases from 826 to 231 However the
cardinalities of 11987112
-SVM and 1198710-SVM keep stable (from 22
to 295) It indicates that the 11987112
-SVM and 1198710-SVM aremore
suitable for feature selection than 1198711-SVM
Figure 3 (right) shows thatwith the increasing of the noisefeatures the accuracy of 119871
2-SVM drops significantly (from
9868 to 8747) On the contrary to the other three sparse
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 9
0 20 40 60 80 1000
5
10
15
20
25
30
Number of training samples
Card
inal
ity
0 20 40 60 80 10075
80
85
90
95
100
Number of training samples
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 2 Results comparison on 10 artificial data sets with various training samples
Table 1 Results comparison on 10 artificial data sets with various training sample sizes
1198991198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Card Acc Card Acc Card Acc Card Acc10 2997 8465 693 8855 143 7765 225 880520 3000 9221 883 9679 187 9317 235 955030 3000 9427 847 9823 203 9467 205 974140 2997 9599 900 9875 207 9539 210 977550 3000 9661 950 9891 213 9657 245 976860 2997 9725 1000 9894 220 9761 245 979870 3000 9741 1100 9898 220 9756 240 982380 3000 9768 1003 9909 223 9761 260 985090 2997 9789 920 9921 230 9741 290 9830100 3000 9813 1027 9909 250 9793 255 9836On average 2999 9521 932 9765 210 9456 241 9678ldquo119899rdquo is the number of training samples ldquoCardrdquo represents the number of selected features and ldquoAccrdquo is the classification accuracy
Table 2 Frequency of the most important features (1199093 1199096) ranked on top 2 in 30 runs
119899 10 20 30 40 50 60 70 80 90 1001198711-SVM 7 14 16 18 22 23 20 22 21 22
1198710-SVM 3 11 12 15 19 22 22 23 22 25
11987112-SVM 9 16 19 24 24 23 27 26 26 27
Table 3 Average results over 10 artificial data sets with varies dimension
1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Cardinality 10995 1585 238 241Accuracy 9293 9907 9776 9826
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Journal of Applied Mathematics
Table 4 Feature selection performance comparison on UCI data sets (Cardinality)
No UCI Data Set 119899 1198981198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 768 8 800 000 740 089 620 110 760 1522 Breast Cancer 683 9 900 000 860 122 640 207 840 1673 Wine (3) 178 13 1300 000 973 069 373 028 407 0494 Image (7) 2100 19 1900 000 820 053 366 097 511 0625 SPECT 267 22 2200 000 1740 527 980 427 9 1176 WDBC 569 30 3000 000 980 152 900 303 560 2057 Ionosphere 351 34 3320 045 2500 335 2460 1021 2180 8448 SPECTF 267 44 4400 000 3880 1098 2400 1132 2560 7759 Sonar 208 60 6000 000 3140 1305 2700 718 1360 100610 Musk1 476 166 16600 000 8580 1885 4860 428 41 1416
Average cardinality 4050 4042 2421 1630 1418
Table 5 Classification performance comparison on UCI data sets (accuaracy)
No UCI Data Set 1198712-SVM 119871
1-SVM 119871
0-SVM 119871
12-SVM
Mean Std Mean Std Mean Std Mean Std1 Pima diabetes 7699 347 7673 085 7673 294 7712 1522 Breast cancer 9618 223 9706 096 9632 138 9647 1593 Wine (3) 9771 373 9829 128 9314 239 9714 2564 Image (7) 8894 149 9127 177 8667 458 9136 1945 SPECT 8340 215 7849 231 7811 338 8234 5106 WDBC 9611 211 9646 131 9558 115 9646 1317 Ionosphere 8343 544 8800 370 8457 538 8743 5948 SPECTF 7849 391 7434 450 7387 325 7849 1899 Sonar 7366 529 7561 582 7415 798 7610 43510 Musk1 8484 173 8211 242 7600 564 8232 338
Average accuracy 8597 8584 8351 8652
SVMs there is little change in the accuracy It reveals thatSVMs can benefit from the features reduction
Table 3 shows the average results over all data sets inthe second experiment On average the solution of 119871
12-
SVM yields much sparser than 1198711-SVM and a slightly better
accuracy than 1198710-SVM
42 UCI Data Sets We further tested the reformulation andthe proposed interior point methods to 119871
12-SVM on 10 UCI
data sets [36] There are 8 binary classification problems and2multiclass problems (wine image) Each feature of the inputdata was normalized to zero mean and unit variance and theinstances with missing value were deletedThen the data wasrandomly split into training set (80) and testing set (20)For the two multiclass problems a one-against-rest methodwas applied to construct a binary classifier for each class Werepeated the training and testing procedure 10 times and theaverage results were shown in Tables 4 and 5 and Figure 4
Tables 4 and 5 summarize the feature selection andclassification performance of the numerical experiments onUCI data sets respectively Here 119899 is the numbers of samplesand 119898 is the number of the input features For the two
multiclass data sets the numbers of the classes are markedbehind their names Sparsity is defined as card119898 and thesmall value of sparsity is preferredThe data sets are arrangedin descending order according to the dimension The lowestcardinality and the best accuracy rate for each problem arebolded
As shown in Tables 4 and 5 the three sparse SVMs canencourage sparsity in all data sets while remaining roughlyidentical accuracy with 119871
2-SVM Among the three sparse
methods the 11987112
-SVM has the lowest cardinality (1418)and the highest classification accuracy (8652) on averageWhile the 119871
1-SVM has the worst feature selection perform-
ance with the highest average cardinality (2421) and the1198710-SVM has the lowest average classification accuracy
(8351)Figure 4 plots the sparsity (left) and classification accu-
racy (right) of each classifier on UCI data sets In three datasets (6 9 10) the 119871
12-SVM has the best performance both
in feature selection and classification among the three sparseSVMs Compared with 119871
1-SVM 119871
12-SVM can achieve
sparser solution in nine data sets For example in the dataset ldquo8 SPECTFrdquo the features selected by 119871
12-SVM are 34
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 11
50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Dimension
Card
inal
ity
50 100 150 20080
82
84
86
88
90
92
94
96
98
100
Dimension
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 3 Results comparison on artificial data set with various dimensions
less than 1198711-SVM at the same time the accuracy of 119871
12-
SVM is 41 higher than 1198711-SVM In most data sets (4
5 6 9 10) the cardinality of 11987112
-SVM drops significantly(at least 37) with the equal or a slightly better result inaccuracy Only in the data set ldquo3 Winerdquo the accuracy of11987112
-SVM is decreased by 11 but the sparsity provided by11987112
-SVM leads to 582 improvement over 1198711-SVM In the
rest three data sets (1 2 7) the two methods have similarresults in feature selection and classification As seen abovethe 119871
12-SVM can provide lower dimension representation
than 1198711-SVM with the competitive prediction perform-
anceFigure 4 (right) shows that compared with 119871
0-SVM
11987112
-SVM has the classification accuracy improved in alldata sets For instance in four data sets (3 4 8 10) theclassification accuracy of 119871
12-SVM is at least 40 higher
than 1198710-SVM Especially in the data set ldquo10Muskrdquo 119871
12-SVM
gives a 63 rise in accuracy over 1198710-SVM Meanwhile it
can be observed from Figure 4 (left) that 11987112
-SVM selectsfewer feature than 119871
0-SVM in five data sets (5 6 7 9 10) For
example in the data sets 6 and 9 the cardinalities of 11987112
-SVM are 378 and 496 less than 119871
0-SVM respectively
In summary 11987112
-SVM presents better classification perfor-mance than 119871
0-SVM while it is effective in choosing relevant
features
5 Conclusions
In this paper we proposed a 11987112
regularization techniquefor simultaneous feature selection and classification in theSVMWe have reformulated the 119871
12-SVM into an equivalent
smooth constrained optimization problem The problem
possesses a very simple structure and is relatively easy todevelop numerical methods By the use of this interestingreformulation we proposed an interior point method andestablished its convergence Our numerical results supportedthe reformulation and the proposed method The 119871
12-
SVM can get more sparsity solution than 1198711-SVM with
the comparable classification accuracy Furthermore the11987112
-SVM can achieve more accuracy classification resultsthan 119871
0-SVM (FSV)
Inspired by the good performance of the smooth opti-mization reformulation of 119871
12-SVM there are some inter-
esting topics deserving further research For examples todevelop more efficient algorithms for solving the reformu-lation to study nonlinear 119871
12-SVM and to explore varies
applications of the 11987112
-SVM and further validate its effectiveare interesting research topics Some of them are under ourcurrent investigation
Appendix
A Properties of the Reformulation to11987112
-SVM
Let 119863 be set of all feasible points of the problem (10) Forany 119911 isin 119863 we let 119865119863(119911119863) and 119871119865119863(119911119863) be the set of allfeasible directions and linearized feasible directions of 119863 at119911 Since the constraint functions of (10) are all convex weimmediately have the following lemma
Lemma A1 For any feasible point 119911 of (10) one has119865119863 (119911119863) = 119871119865119863 (119911119863) (A1)
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
12 Journal of Applied Mathematics
1 2 3 4 5 6 7 8 9 1010
20
30
40
50
60
70
80
90
100
Data set number Data set number
Spar
sity
1 2 3 4 5 6 7 8 9 1070
75
80
85
90
95
100
Accu
racy
L2-SVML1-SVM
L0-SVML12-SVM
L2-SVML1-SVM
L0-SVML12-SVM
Figure 4 Results comparison on UCI data sets
Based on the above Lemma we can easily derive a firstorder necessary condition for (10) The Lagrangian functionof (10) is
119871 (119908 119905 120585 119887 120582 120583 ] 119901 119902)
= 119862
119899
sum
119894=1
120585119894+
119898
sum
119895=1
119905119895minus
119898
sum
119895=1
120582119895(1199052
119895minus 119908119895) minus
119898
sum
119895=1
120583119895(1199052
119895+ 119908119895)
minus
119898
sum
119895=1
]119895119905119895minus
119899
sum
119894=1
119901119894(119910119894119908119879119909119894+ 119910119894119887 + 120585119894minus 1) minus
119899
sum
119894=1
119902119894120585119894
(A2)
where 120582 120583 ] 119901 119902 are the Lagrangian multipliers By the useof Lemma A1 we immediately have the following theoremabout the first order necessary condition
Theorem A2 Let (119908 119905 120585 119887) be a local solution of (10) Thenthere are Lagrangian multipliers (120582 120583 ] 119901 119902) isin 119877
119898+119898+119898+119899+119899
such that the following KKT conditions hold
120597119871
120597119908= 120582119895minus 120583119895minus 119901119894
119899
sum
119894=1
119910119894119909119894= 0 119894 = 1 2 119899
120597119871
120597119905= 1 minus 2120582
119895119905119895minus 2120583119895119905119895minus ]119895= 0 119895 = 1 2 119898
120597119871
120597120585= 119862 minus 119901
119894minus 119902119894= 0 119894 = 1 2 119899
120597119871
120597119887=
119899
sum
119894=1
119901119894119910119894= 0 119894 = 1 2 119899
120582119895ge 0 119905
2
119895minus 119908119895ge 0 120582
119895(1199052
119895minus 119908119895) = 0
119895 = 1 2 119898
120583119895ge 0 119905
2
119895+ 119908119895ge 0 120583
119895(1199052
119895+ 119908119895) = 0
119895 = 1 2 119898
]119895ge 0 119905
119895ge 0 ]
119895119905119895= 0 119895 = 1 2 119898
119901119894ge 0 119910
119894(119908119879119909119894+ 119887) + 120585
119894minus 1 ge 0
119901119894(119910119894(119908119879119909119894+ 119887) + 120585
119894minus 1) = 0 119894 = 1 2 119899
119902119894ge 0 120585
119894ge 0 119902
119894120585119894= 0 119894 = 1 2 119899
(A3)
The following theorem shows that the level set will bebounded
Theorem A3 For any given constant 119888 gt 0 the level set
119882119888= 119911 = (119908 119905 120585) | 119891 (119911) le 119888 cap 119863 (A4)
is bounded
Proof For any (119908 119905 120585) isin Ω119888 we have
119898
sum
119894
119905119894+ 119862
119899
sum
119895
120585119895le 119888 (A5)
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 13
Combining with 119905119894ge 0 119894 = 1 119898 and 120585
119895ge 0 119895 = 1 119899
we have
0 le 119905119894le 119888 119894 = 1 119898
0 le 120585119895le
119888
119862 119895 = 1 119899
(A6)
Moreover for any (119908 119905 120585) isin 119863
10038161003816100381610038161199081198941003816100381610038161003816 le 1199052
119894 119894 = 1 119899 (A7)
Consequently 119908 119905 120585 are bounded What is more from thecondition
119910119894(119908119879119909119894+ 119887) ge 1 minus 120585
119894 119894 = 1 119898 (A8)
we have
119908119879119909119894+ 119887 ge 1 minus 120585
119894 if 119910
119894= 1 119894 = 1 119898
119908119879119909119894+ 119887 le minus1 + 120585
119894 if 119910
119894= minus1 119894 = 1 119898
(A9)
Thus if the feasible region119863 is not empty then we have
max119910119894=1
(1 minus 120585119894minus 119908119879119909119894) le 119887 le min
119910119894=minus1
(minus1 + 120585119894minus 119908119879119909119894) (A10)
Hence 119887 is also bounded The proof is complete
B Proof of Lemma 3
Lemma 3 (see Section 3) shows that Algorithm 2 is welldefinedWe first introduce the following proposition to showthat the matrix 119878 defined by (21) is always positive definitewhich ensures Algorithm 2 to be well defined
Proposition B1 Let the inequality (26) hold then the matrix119878 defined by (21) is positive definite
Proof By an elementary deduction we have for any V(1) isin
119877119898 V(2) isin 119877
119899 V(3) isin 119877119898 V(4) isin 119877 with (V(1) V(2) V(3) V(4)) = 0
(V(1)119879 V(2)119879 V(3)119879 V(4)119879) 119878(
V(1)
V(2)
V(3)
V(4))
= V(1)11987911987811V(1) + V(1)119879119878
12V(2) + V(1)119879119878
13V(3)
+ V(1)11987911987814V(4) + V(2)119879119878
21V(1) + V(2)119879119878
22V(2)
+ V(2)11987911987823V(3) + V(2)119879119878
24V(4)
+ V(3)11987911987831V(1) + V(3)119879119878
32V(2)
+ V(3)11987911987833V(3) + V(3)119879119878
34V(4)
+ V(4)11987911987841V(1) + V(4)119879119878
42V(2)
+ V(4)11987911987843V(3) + V(4)119879119878
44V(4)
= V(1)119879 (119880 + 119881 + 119883119879119884119879119875minus11198634119884119883) V(1)
+ V(1)119879 (119883119879119884119879119875minus11198634) V(2)
+ V(1)119879 (minus2 (119880 minus 119881)119879) V(3)
+ V(1)119879 (119883119879119884119879119875minus11198634119910) V(4)
+ V(2)119879119875minus11198634119884119883V(1) + V(2)119879 (119875minus1119863
4+ Ξminus11198635) V(2)
+ V(2)119879119875minus11198634119910V(4)
minus 2V(3)119879 (119880 minus 119881)119879V(1)
+ V(3)119879 (4119879 (119880 + 119881)119879 + 119879minus11198633minus 2 (119863
1+ 1198632)) V(3)
+ V(4)119879119910119879119875minus11198634119884119883V(1)
+ V(4)119879119910119879119875minus11198634V(2) + V(4)119879119910119879119875minus1119863
4119910V(4)
= V(1)119879119880V(1) minus 4V(3)119879119880119879V(1) + 4V(3)119879119879119880119879V(3)
+ V(1)119879119881V(1) + 4V(3)119879119881119879V(1) + 4V(3)119879119879119881119879V(3)
+ V(2)119879Ξminus11198635V(2) + V(3)119879 (119879minus1119863
3minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634
times (119884119883V(1) + V(2) + V(4)119910)
= 119890119879
119898119880(119863V
1minus 2119879119863V
3)2
119890119898+ 119890119879
119898119881(119863V
1+ 2119879119863V
3)2
119890119898
+ V(2)119879Ξminus11198635V(2)
+ V(3)119879 (119879minus11198633minus 21198631minus 21198632) V(3)
+ (119884119883V(1) + V(2) + V(4)119910)119879
119875minus11198634(119884119883V(1) + V(2) + V(4)119910)
gt 0
(B1)
where 119863V1= diag(V(1)) 119863V
2= diag(V(2)) 119863V
3= diag(V(3))
and119863V4= diag(V(4))
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
14 Journal of Applied Mathematics
B1 Proof of Lemma 3
Proof It is easy to see that if Δ119911119896= 0 which implies that (23)
holds for all 120572119896ge 0
Suppose Δ119911119896
= 0 We have
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896
= minus (Δ119908119896119879
Δ120585119896119879
Δ119905119896119879
Δ119887119896) 119878(
Δ119908119896
Δ120585119896
Δ119905119896
Δ119887119896
)
(B2)
Since matrix 119878 is positive definite and Δ119911119896
= 0 the lastequation implies
nabla119908Φ120583(119911119896)119879
Δ119908119896+ nabla119905Φ120583(119911119896)119879
Δ119905119896
+ nabla120585Φ120583(119911119896)119879
Δ120585119896+ nabla119887Φ120583(119911119896)119879
Δ119887119896lt 0
(B3)
Consequently there exists a 119896isin (0
119896] such that the fourth
inequality in (23) is satisfied for all 120572119896isin (0
119896]
On the other hand since (119909119896 119905119896) is strictly feasible the
point (119909119896 119905119896) + 120572(Δ119909 119905119896 Δ119905119896) will be feasible for all 120572 gt 0
sufficiently small The proof is complete
C Convergence Analysis
This appendix is devoted to the global convergence of theinterior point method We first show the convergence ofAlgorithm 2 when a fixed 120583 is applied to the barrier subprob-lem (14)
Lemma C1 Let (119911119896 120582119896) be generated by Algorithm 2 Then119911119896 are strictly feasible for problem (10) and the Lagrangian
multipliers 120582119896 are bounded from above
Proof For the sake of convenience we use 119892119894(119911) 119894 =
1 2 3119898 + 2119899 to denote the constraint functions of theconstrained problem (10)
We first show that 119911119896 are strictly feasible Suppose onthe contrary that there exists an infinite index subset K andan index 119894 isin 1 2 3119898 + 2119899 such that 119892
119894(119911119896)K darr 0
By the definition of Φ120583(119911) and 119891(119911) = sum
119898
119895=1119905119894+ 119862sum
119899
119894=1120585119894
being bounded from below in the feasible set it must holdthat Φ
120583(119911119896)K rarr infin
However the line search rule implies that the sequenceΦ120583(119911119896) is decreasing So we get a contradiction Conse-
quently for any 119894 isin 1 2 3119898+2119899 119892119894(119909119896 119905119896) is bounded
away from zero The boundedness of 120582119896 then follows from(24)ndash(27)
Lemma C2 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenthe sequence (Δ119911
119896 119896+1
)K is bounded
Proof Again we use 119892119894(119911) 119894 = 1 2 3119898+2119899 to denote the
constraint functions of the constrained problem (10)We suppose on the contrary that there exists an infinite
subsetK1015840 sube K such that the subsequence (Δ119911119896 119896+1
)K1015840
tends to infinity Let 119911119896K1015840 rarr 119911
lowast It follows fromLemma C1 that there is an infinite subset K sube K1015840 such that120582119896K rarr 120582
lowast and 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 Thus
we have
119878119896997888rarr 119878lowast (C1)
as 119896 rarr infin with 119896 isin K By Proposition B1 119878lowast is positivedefinite Since the right hand size of (18) is bounded andcontinuous the unboundedness of (Δ119911
119896 119896+1
)K impliesthat the limit of the coefficient matrices of (18) is singularwhich yields a contradiction The proof is complete
Lemma C3 Let (119911119896 120582119896) and (Δ119911119896 119896+1
) be generated byAlgorithm 2 If 119911119896K is a convergent subsequence of 119911119896 thenone has Δ119911
119896K rarr 0
Proof It follows from Lemma C1 that the sequence Δ119911119896K
is bounded Suppose on the contrary that there exists aninfinite subsetK1015840 sube K such that Δ119911
119896K1015840 rarr Δ119911
lowast= 0 Since
subsequences 119911119896K1015840 120582
119896K1015840 and
119896K1015840 are all bounded
there are points 119911lowast 120582lowast and
lowast as well as an infinite indexset K sube K1015840 such that 119911119896K rarr 119911
lowast 120582119896K rarr 120582lowast and
119896K rarr
lowast By Lemma C1 we have 119892119894(119911lowast) gt 0 forall119894 isin
1 2 3119898 + 2119899 Similar to the proof of Lemma 3 it is notdifficult to get
nabla119908Φ120583(119911lowast)119879
Δ119908lowast+ nabla119905Φ120583(119911lowast)119879
Δ119905lowast
+ nabla120585Φ120583(119911lowast)119879
Δ120585lowast+ nabla119887Φ120583(119911lowast)119879
Δ119887lowast
= minus (Δ119908lowast119879
Δ120585lowast119879
Δ119905lowast119879
Δ119887lowast) 119878(
Δ119908lowast
Δ120585lowast
Δ119905lowast
Δ119887lowast
) lt 0
(C2)
Since 119892119894(119911lowast) gt 0 forall119894 isin 1 2 3119898 + 2119899 there exists a
isin (0 1] such that for all 120572 isin (0 ]
119892119894(119911lowast+ 120572Δ119911
lowast) gt 0 forall119894 isin 1 2 3119899 (C3)
Taking into account 1205911isin (0 12) we claim that there exists
a isin (0 ] such that the following inequality holds for all120572 isin (0 ]
Φ120583(119911lowast+ 120572Δ119911
lowast) minus Φ120583(119911lowast) le 11120591
1120572nabla119911Φ120583(119911lowast)119879
Δ119911lowast (C4)
Let 119898lowast
= min119895 | 120573119895isin (0 ] 119895 = 0 1 and 120572
lowast= 120573119898lowast
It follows from (C3) that the following inequality is satisfiedfor all 119896 isin K sufficient large and any 119894 isin 1 2 3119898 + 2119899
119892119894(119911119896+ 120572lowastΔ119911119896) gt 0 (C5)
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Applied Mathematics 15
Moreover we have
Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911119896)
= Φ120583(119911lowast+ 120572lowastΔ119911lowast) minus Φ120583(119911lowast)
+ Φ120583(119911119896+ 120572lowastΔ119911119896) minus Φ120583(119911lowast+ 120572lowastΔ119911lowast)
+ Φ120583(119911lowast) minus Φ120583(119911119896)
le 1051205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
le 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
(C6)
By the backtracking line search rule the last inequalitytogether with (C5) yields the inequality120572
119896ge 120572lowastfor all 119896 isin K
large enough Consequently when 119896 isin K is sufficiently largewe have from (23) and (C3) that
Φ120583(119911119896+1
) le Φ120583(119911119896) + 1205911120572119896nabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) + 1205911120572lowastnabla119911Φ120583(119911119896)119879
Δ119911119896
le Φ120583(119911119896) +
1
21205911120572lowastnabla119911Φ120583(119911lowast)119879
Δ119911lowast
(C7)
This shows that Φ120583(119911119896)K rarr minusinfin contradicting with the
fact that Φ120583(119911119896)K is bounded from below The proof is
complete
Then we will establish the convergence of Algorithms 2and 4 that is Theorems 5 and 6 in Section 3
C1 Proof of Theorem 5
Proof We let (119911lowast lowast) be a limit point of (119911119896 119896) and thesubsequence (119911119896 119896)K converges to (119911
lowast lowast)
Recall that (Δ119911119896 119896) are the solution of (18) with (119911 120582) =
(119911119896 120582119896) Taking limits in both sides of (18) with (119911 120582) =
(119911119896 120582119896) as 119896 rarr infinwith 119896 isin K by the use of Lemma C1 we
obtain
lowast(1)
minus lowast(2)
minus 119883119879119884119879lowast(4)
= 0
119862 lowast 119890119899minus lowast(4)
minus lowast(5)
= 0
119890119898minus 2119879lowastlowast(1)
minus 2119879lowastlowast(2)
minus lowast(3)
= 0
119910119879lowast(4)
= 0
(119879lowast2
minus 119882lowast) lowast(1)
minus 120583119890119898
= 0
(119879lowast2
+ 119882lowast) lowast(2)
minus 120583119890119898
= 0
119879lowastlowast(3)
minus 120583119890119898
= 0
119875lowastlowast(4)
minus 120583119890119899= 0
Ξlowastlowast(5)
minus 120583119890119899= 0
(C8)
This shows that (119911lowast lowast) satisfies the first-order optimalityconditions (15)
C2 Proof of Theorem 6
Proof (i) Without loss of generality we suppose that thebounded subsequence (119911
119895 119895)J converges to some point
(119911lowast lowast) It is clear that 119911lowast is a feasible point of (10) Since
120583119895 120598120583119895 rarr 0 it follows from (29) and (30) that
Res (119911119895 119895 120583119895)
J997888rarr Res (119911lowast lowast 0) = 0
119895J
997888rarr lowastge 0
(C9)
Consequently the (119911lowast lowast) satisfies the KKT conditions (13)(ii) Let 120585
119895= max119895
infin 1 and
119895= 120585minus1
119895119895 Obviously
119895 is bounded Hence there exists an infinite subsetJ1015840 sube J
such that 1198951015840J rarr = 0 and 119895infin
= 1 for large 119895 isin JFrom (30) we know that ge 0 Dividing both sides of thefirst inequality of (18) with (119911 120582) = (119911
119896 120582119896) by 120585119895 then taking
limits as 119895 rarr infin with 119895 isin J we get
(1)
minus (2)
minus 119883119879119884119879(4)
= 0
(4)
+ (5)
= 0
2119879lowast(1)
+ 2119879lowast(2)
+ (3)
= 0
119910119879(4)
= 0
(119879lowast2
minus 119882lowast) (1)
= 0
(119879lowast2
+ 119882lowast) (2)
= 0
119879lowast(3)
= 0
119875lowast(4)
= 0
Ξlowast(5)
= 0
(C10)
Since 119911lowast is feasible the above equations has shown that 119911lowast is
a Fritz-John point of problem (10)
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
The authors would like to acknowledge support for thisproject from the National Science Foundation (NSF Grants11371154 11071087 11271069 and 61103202) of China
References
[1] Y M Yang and J O Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14th
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
16 Journal of Applied Mathematics
International Conference on Machine Learning (ICML rsquo97) vol97 pp 412ndash420 1997
[2] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquoThe Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[3] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[4] H H Zhang J Ahn X Lin and C Park ldquoGene selection usingsupport vector machines with non-convex penaltyrdquo Bioinfor-matics vol 22 no 1 pp 88ndash95 2006
[5] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[6] J Weston F Perez-Cruz O Bousquet O Chapelle A Elisseeffand B Scholkopf ldquoFeature selection and transduction forprediction of molecular bioactivity for drug designrdquo Bioinfor-matics vol 19 no 6 pp 764ndash771 2003
[7] Y Liu ldquoA comparative study on feature selection methods fordrug discoveryrdquo Journal of Chemical Information and ComputerSciences vol 44 no 5 pp 1823ndash1828 2004
[8] I Guyon J Weston S Barnhill and V Vapnik Eds FeatureExtraction Foundations and Applications vol 207 Springer2006
[9] A Rakotomamonjy ldquoVariable selection using SVM-based cri-teriardquo Journal of Machine Learning Research vol 3 no 7-8 pp1357ndash1370 2003
[10] J Weston S Mukherjee O Chapelle M Pontil T Poggio andV Vapnik ldquoFeature selection for SVMsrdquo in Proceedings of theConference onNeural Information Processing Systems vol 13 pp668ndash674 Denver Colo USA 2000
[11] D Peleg and R Meir ldquoA feature selection algorithm based onthe global minimization of a generalization error boundrdquo inProceedings of the Conference on Neural Information ProcessingSystems Vancouver Canada 2004
[12] P S Bradley and O L Mangasarian ldquoFeature selection via con-cave minimization and support vector machinesrdquo in Proceed-ings of the International Conference on Machine Learning pp82ndash90 San Francisco Calif USA 1998
[13] JWeston A Elisseeff B Scholkopf andM Tipping ldquoUse of thezero-norm with linear models and kernel methodsrdquo Journal ofMachine Learning Research vol 3 pp 1439ndash1461 2003
[14] A B Chan N Vasconcelos and G R G Lanckriet ldquoDirectconvex relaxations of sparse SVMrdquo in Proceedings of the 24thInternational Conference on Machine Learning (ICML rsquo07) pp145ndash153 Corvallis Ore USA June 2007
[15] GM Fung andO L Mangasarian ldquoA feature selection Newtonmethod for support vector machine classificationrdquo Computa-tional Optimization and Applications vol 28 no 2 pp 185ndash2022004
[16] J Zhu S Rosset T Hastie and R Tibshirani ldquo1-norm supportvector machinesrdquo Advances in Neural Information ProcessingSystems vol 16 no 1 pp 49ndash56 2004
[17] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society B Methodological vol 58no 1 pp 267ndash288 1996
[18] T Zhang ldquoAnalysis of multi-stage convex relaxation for sparseregularizationrdquo Journal of Machine Learning Research vol 11pp 1081ndash1107 2010
[19] R Chartrand ldquoExact reconstruction of sparse signals via non-convexminimizationrdquo IEEE Signal Processing Letters vol 14 no10 pp 707ndash710 2007
[20] R Chartrand ldquoNonconvex regularizaron for shape preserva-tionrdquo inProceedings of the 14th IEEE International Conference onImage Processing (ICIP rsquo07) vol 1 pp I293ndashI296 San AntonioTex USA September 2007
[21] Z Xu H Zhang Y Wang X Chang and Y Liang ldquo11987112
regularizationrdquo Science China Information Sciences vol 53 no6 pp 1159ndash1169 2010
[22] W J Chen and Y J Tian ldquoLp-norm proximal support vectormachine and its applicationsrdquo Procedia Computer Science vol1 no 1 pp 2417ndash2423 2010
[23] J Liu J Li W Xu and Y Shi ldquoA weighted L119902adaptive least
squares support vector machine classifiers-Robust and sparseapproximationrdquo Expert Systems with Applications vol 38 no 3pp 2253ndash2259 2011
[24] Y Liu H H Zhang C Park and J Ahn ldquoSupport vector mach-ines with adaptive 119871
119902penaltyrdquo Computational Statistics amp Data
Analysis vol 51 no 12 pp 6380ndash6394 2007[25] Z Liu S Lin andM Tan ldquoSparse support vectormachines with
Lp penalty for biomarker identificationrdquo IEEEACM Transac-tions on Computational Biology and Bioinformatics vol 7 no1 pp 100ndash107 2010
[26] A Rakotomamonjy R Flamary G Gasso and S Canu ldquol119901-l119902
penalty for sparse linear and sparse multiple kernel multitasklearningrdquo IEEE Transactions on Neural Networks vol 22 no 8pp 1307ndash1320 2011
[27] J Y Tan Z Zhang L Zhen C H Zhang and N Y DengldquoAdaptive feature selection via a new version of support vectormachinerdquo Neural Computing and Applications vol 23 no 3-4pp 937ndash945 2013
[28] Z B Xu X Y Chang F M Xu and H Zhang ldquo11987112
regu-larization an iterative half thresholding algorithmrdquo IEEETrans-actions on Neural Networks and Learning Systems vol 23 no 7pp 1013ndash1027 2012
[29] D Ge X Jiang and Y Ye ldquoA note on the complexity of 119871119901
minimizationrdquo Mathematical Programming vol 129 no 2 pp285ndash299 2011
[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995
[31] J Fan and H Peng ldquoNonconcave penalized likelihood with adiverging number of parametersrdquo The Annals of Statistics vol32 no 3 pp 928ndash961 2004
[32] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoThe Annals of Statistics vol 28 no 5 pp 1356ndash1378 2000
[33] B S Tian and X Q Yang ldquoAn interior-point l 11987112-penalty-
method for nonlinear programmingrdquo Technical ReportDepartment of Applied Mathematics Hong Kong PolytechnicUniversity 2013
[34] L Armijo ldquoMinimization of functions having Lipschitz con-tinuous first partial derivativesrdquo Pacific Journal of Mathematicsvol 16 pp 1ndash3 1966
[35] F John ldquoExtremum problems with inequalities as side-con-ditionsrdquo in Studies and Essays Courant Anniversary Volume pp187ndash1204 1948
[36] D J Newman S Hettich C L Blake and C J Merz ldquoUCIrepository of machine learning databasesrdquo Technical Report9702 Department of Information and Computer Science Uni-verisity of California Irvine Calif USA 1998 httparchiveicsucieduml
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of