a kernel-based sparsity preserving method for semi-supervised classification

12
A kernel-based sparsity preserving method for semi-supervised classication Nannan Gu a , Di Wang b,n , Mingyu Fan b , Deyu Meng c a School of Statistics, Capital University of Economics and Business, Beijing 100070, China b Institute of Intelligent Systems and Decision, Wenzhou University, Wenzhou 325000, China c Institute for Information and System Sciences and Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an 710049, China article info Article history: Received 7 January 2013 Received in revised form 12 January 2014 Accepted 7 February 2014 Communicated by Steven Hoi Available online 8 April 2014 Keywords: Sparse representation Feature extraction Manifold regularization Semi-supervised learning Semi-supervised classication abstract In this paper, we propose an effective approach to semi-supervised classication through kernel-based sparse representation. The new method computes the sparse representation of data in the feature space, and then the learner is subject to a cost function which aims to preserve the sparse representing coefcients. By mapping the data into the feature space, the so-called l 2 -norm problemthat may be encountered when directly applying sparse representations to non-image data classication tasks will be naturally alleviated, and meanwhile, the label of a data point can be reconstructed more precisely by the labels of other data points using the sparse representing coefcients. Inherited from sparse representation, our method can adaptively establish the relationship between data points, and has high discriminative ability. Furthermore, the new method has a natural multi-class explicit expression for new samples. Experimental results on several benchmark data sets are provided to show the effectiveness of our method. & 2014 Elsevier B.V. All rights reserved. 1. Introduction Semi-supervised classication (SSC), which aims to utilize both the labeled and unlabeled data simultaneously to train classiers, has emerged in recent years since usually labeled samples are scarce and time-consuming to obtain, while unlabeled data are abundant and relatively easier to get. Moreover, under certain assumptions, it has been shown that the information conveyed by the marginal distribution of the unlabeled samples can help to boost the classication performance. Since SSC requires less human effort and gives higher accuracy, it has attracted consider- able attention in the elds of data mining and machine learning. So far, many SSC methods have been proposed, such as the Expectation-Maximization algorithm for semi-supervised genera- tive mixture models [1,2], self-training [3,4], co-training [5], and transductive support vector machines [6,7]. Among the various kinds of SSC approaches, the graph-based approaches have become one of the hottest research areas. They rst model the whole data set as a graph and then perform classication on the graph based on certain assumptions such as the cluster assumption [8] which states that points are likely to belong to the same class if they locate in the same cluster. Some representative methods include the Gaussian random eld (GRF) [9,10], the Local and Global Consistency (LGC) [11], and the Laplacian Regularized Least Square Classication (LapRLSC) [12]. Many graph-based SSC methods can be unied into the Manifold Regularization (MR) framework [12], which consists of a tting term for the labeled points, a regularization term to control the complexity of the classier, and another regularization term to control the smoothness of the classier with respect to the geometric distribution of data. Though graph-based SSC methods have been successfully applied in many elds [13], there are still certain issues that have not been properly solved. For example, the selection of neighbor- hood size for the adjacency graph is a difcult parameter selection problem, the manifold structure assumed by many graph-based methods is short of convincing evidence, and the explicit multi- class classiers cannot be obtained for many graph-based methods. To address the above issues, we propose to utilize the power of sparse representation of data. Sparse representation of data refers to the reconstruction of a data point by a linear combination of a small number of elementary points from a dictionary [14,15]. It has the advantage of high discriminative power and adaptively establishing the relationship between data points [16]. Recently, there have emerged several classication methods based on sparse Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2014.02.022 0925-2312/& 2014 Elsevier B.V. All rights reserved. n Corresponding author. E-mail addresses: [email protected] (N. Gu), [email protected] (D. Wang), [email protected] (M. Fan), [email protected] (D. Meng). Neurocomputing 139 (2014) 345356

Upload: deyu

Post on 29-Dec-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A kernel-based sparsity preserving method for semi-supervised classification

A kernel-based sparsity preserving methodfor semi-supervised classification

Nannan Gu a, Di Wang b,n, Mingyu Fan b, Deyu Meng c

a School of Statistics, Capital University of Economics and Business, Beijing 100070, Chinab Institute of Intelligent Systems and Decision, Wenzhou University, Wenzhou 325000, Chinac Institute for Information and System Sciences and Ministry of Education Key Lab for Intelligent Networks and Network Security,Xi'an Jiaotong University, Xi'an 710049, China

a r t i c l e i n f o

Article history:Received 7 January 2013Received in revised form12 January 2014Accepted 7 February 2014Communicated by Steven HoiAvailable online 8 April 2014

Keywords:Sparse representationFeature extractionManifold regularizationSemi-supervised learningSemi-supervised classification

a b s t r a c t

In this paper, we propose an effective approach to semi-supervised classification through kernel-basedsparse representation. The new method computes the sparse representation of data in the feature space,and then the learner is subject to a cost function which aims to preserve the sparse representingcoefficients. By mapping the data into the feature space, the so-called “l2-norm problem” that maybe encountered when directly applying sparse representations to non-image data classification tasks willbe naturally alleviated, and meanwhile, the label of a data point can be reconstructed more preciselyby the labels of other data points using the sparse representing coefficients. Inherited from sparserepresentation, our method can adaptively establish the relationship between data points, and has highdiscriminative ability. Furthermore, the new method has a natural multi-class explicit expressionfor new samples. Experimental results on several benchmark data sets are provided to show theeffectiveness of our method.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

Semi-supervised classification (SSC), which aims to utilize boththe labeled and unlabeled data simultaneously to train classifiers,has emerged in recent years since usually labeled samples arescarce and time-consuming to obtain, while unlabeled data areabundant and relatively easier to get. Moreover, under certainassumptions, it has been shown that the information conveyedby the marginal distribution of the unlabeled samples can helpto boost the classification performance. Since SSC requires lesshuman effort and gives higher accuracy, it has attracted consider-able attention in the fields of data mining and machine learning.So far, many SSC methods have been proposed, such as theExpectation-Maximization algorithm for semi-supervised genera-tive mixture models [1,2], self-training [3,4], co-training [5], andtransductive support vector machines [6,7].

Among the various kinds of SSC approaches, the graph-basedapproaches have become one of the hottest research areas. Theyfirst model the whole data set as a graph and then performclassification on the graph based on certain assumptions such as

the cluster assumption [8] which states that points are likely tobelong to the same class if they locate in the same cluster. Somerepresentative methods include the Gaussian random field (GRF)[9,10], the Local and Global Consistency (LGC) [11], and theLaplacian Regularized Least Square Classification (LapRLSC) [12].Many graph-based SSC methods can be unified into the ManifoldRegularization (MR) framework [12], which consists of a fittingterm for the labeled points, a regularization term to controlthe complexity of the classifier, and another regularization termto control the smoothness of the classifier with respect to thegeometric distribution of data.

Though graph-based SSC methods have been successfullyapplied in many fields [13], there are still certain issues that havenot been properly solved. For example, the selection of neighbor-hood size for the adjacency graph is a difficult parameter selectionproblem, the manifold structure assumed by many graph-basedmethods is short of convincing evidence, and the explicit multi-class classifiers cannot be obtained for many graph-based methods.

To address the above issues, we propose to utilize the power ofsparse representation of data. Sparse representation of data refersto the reconstruction of a data point by a linear combination ofa small number of elementary points from a dictionary [14,15]. Ithas the advantage of high discriminative power and adaptivelyestablishing the relationship between data points [16]. Recently,there have emerged several classification methods based on sparse

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.02.0220925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author.E-mail addresses: [email protected] (N. Gu), [email protected] (D. Wang),

[email protected] (M. Fan), [email protected] (D. Meng).

Neurocomputing 139 (2014) 345–356

Page 2: A kernel-based sparsity preserving method for semi-supervised classification

representations [17–19]. These methods have shown great dis-criminative ability for image data sets; however, the following twoissues still need to be settled:

(1) The l2-norm problem: For general natural data sets wheredifferent points may have different l2-norms, the sparserepresentation of one data point may be inclined to selectthe points with larger l2-norms other than the ones in thesame class (the reason will be given in Section 3.1). Moreover,normalizing all points to have a unit l2-norm may change thestructure of the data set.

(2) The imprecise reconstruction of labels by sparse representations:Sparsity-based methods [18,19] are usually based on the factthat the label of a data point can be reconstructed by the labelsof other points using the coefficients of sparse representation.However, the classification function is usually nonlinear, so thereconstruction is not precise, which may in return decreasethe classification accuracy of the obtained classifier.

In view of this, we propose the Kernel-based Sparse Regular-ization (KSR) approach to semi-supervised learning. First, the datapoints are projected into the kernel space by the kernel trick, andthen the sparse representation of each projected point is com-puted in the kernel space. Finally, inspired by the MR framework,the multi-class classifier is constructed based on the discrimina-tive ability of the obtained sparse representation. In this way, theprojected data points in the kernel space naturally have a unitl2-norm, which can overcome the l2-norm problem describedabove. Meanwhile, the classification function is linear in the kernelspace so the sparse representing coefficients can be preservedmore precisely. The proposed KSR approach not only has anexplicit formulation for multi-classification problem and can beeasily utilized in any multi-class (including binary-class) cases, butalso inherits the advantage of high discriminative ability andadaptively establishing the relationship between data points fromsparse representation. Experiments on real-world data setsdemonstrate the effectiveness and high discriminative ability ofour approach.

The rest of the paper is organized as follows. Some previouswork is introduced in Section 2. The proposed KSR approachand the derived Kernel-based Sparse Regularized Least SquareClassification (KSR-LSC) algorithm are presented in Section 3. InSection 4, experiments on benchmark real-world data sets arereported. Final conclusions are given in Section 5.

2. Previous work

In this paper, we represent the training data set of the semi-supervised classification problem as fðxi; ziÞ; xlþ j; i¼ 1;…; l; j¼1;…;ug, where l and u are the number of labeled and unlabeleddata points, respectively, xiARd is a data point, ziAf1;…;Cg is theclass label of xi and C is the total number of classes. Throughoutthis paper, the data points and the corresponding label vectors arein the form of column vectors. All vectors are indicated by lower-case letters and all matrices are denoted by normal capital letters.

2.1. The MR framework

The MR framework [12] was proposed for learning based onthe theory of Reproducing Kernel Hilbert Space (RKHS). It can beexpressed in the form

f n ¼ arg minf AHK

1l∑l

i ¼ 1Vðxi; zi; f ÞþγK‖f ‖2K þγI‖f ‖2I

( ); ð1Þ

where f is the desired classification function, V is some lossfunction, ‖f ‖2K is the square of the norm of f in the RKHS HK toregularize the complexity of the classifier and ‖f‖2I is anotherregularization term to control the smoothness of the classifier onthe data manifold. If we define

‖f‖2I ¼1

ðlþuÞ2∑lþu

i;j ¼ 1wij½f ðxiÞ� f ðxjÞ�2;

where wij is the edge weight in the data adjacency graph, then thesolution of (1) admits the representation [12]

f nðxÞ ¼ ∑lþu

i ¼ 1αikðxi; xÞ;

where kð�; �Þ is some Mercer kernel function associated with theRKHS HK .

Many SSC algorithms can be unified into the MR framework.However, there are still some issues to be addressed. For example,usually a fixed neighborhood size is adopted to define theadjacency graphs, which brings the difficulty of parameter selec-tion and cannot be adaptive to uneven data. Besides, the methodsare often based on the assumption that high-dimensional datadistribute on a low-dimensional manifold. However, in manycases, this assumption may not be true, so several classificationmethods have been proposed based on sparse representation.These methods can adaptively establish the relationship betweendata points and have high discriminative ability.

2.2. Sparse representation based classification methods

Sparse representation of data refers to constructing each datapoint with a linear combination of a small number of elementarypoints from a dictionary. For image data, if sufficient trainingsamples are available for each class, it will be possible to representone sample as a linear combination of just those samples from thesame class. This representation is naturally sparse, involving only asmall fraction of the overall training samples, i.e., the samplesfrom the same class. So the sparse representation of data isdiscriminative and can be utilized to train the classifier.

Based on the discriminative ability of sparse representation ofdata, Wright et al. proposed a general classification algorithmnamed SRC for (image-based) object recognition [17], where eachtest sample is represented by an over-complete dictionary whosebase elements are the training samples themselves. Specifically,the sparse representation of data is obtained by the l1-normminimization problem [20–22]

αn ¼ arg minαARN

‖α‖1 subject to xn ¼ Xα; ð2Þ

where xn is a test sample, αn is the desired sparse representation ofxn, X ¼ ½X1;…;XC �ARd�N is the training data matrix of C classesand ‖ � ‖1 is the l1-norm of vectors. For αnARN , denote byδiðαnÞARN the vector whose only nonzero entries are the entriesin αn that are associated with class i. Then they classify xn as

identityðxnÞ ¼ arg miniA f1;…;Cg

‖xn�XδiðαnÞ‖2:

Fan et al. proposed an approach (named S-RLSC) to semi-supervised learning based on sparse representation [18]. In [18],the best sparse linear reconstruction coefficient vector αiARlþu isfirst computed for each data point xi ði¼ 1;…; lþuÞ via the l1-normminimization (2). The sparse representation is naturally discrimi-native, and it is meaningful to require that the classificationfunction f on the data points keeps the sparse representingcoefficients. Hence, the label of a data point can be reconstructedby the labels of other data points using the sparse representing

N. Gu et al. / Neurocomputing 139 (2014) 345–356346

Page 3: A kernel-based sparsity preserving method for semi-supervised classification

coefficients. Based on this, the approach defines a penalty term

‖f ‖2I ¼1

ðlþuÞ2∑lþu

i ¼ 1f ðxiÞ� ∑

lþu

j ¼ 1αijf ðxjÞ

����������2

2

; ð3Þ

where αij is the j-th element of the sparse representing coefficientvector αi. The classifier is then derived from the MR framework (1)with ‖f‖2I defined by (3).

These methods have shown great discriminative ability for imagedata sets where all the points can be normalized to have unit l2-norm. This normalization process does not deteriorate the classifica-tion results. However, for general natural data sets, these methodsmay suffer from the l2-norm problem. Moreover, the S-RLSC methodis based on the fact that the label of a data point can be reconstructedby the labels of other points using the sparse representing coeffi-cients. But, the classification function f is usually nonlinear, so ‖f ‖2Idefined by (3) may not be a good criterion to evaluate the preserva-tion of the sparse representing coefficients by f.

3. Kernel-based sparse regularization

In this section, we first explain the l2-norm problem in detailand then present the kernel-based sparse representation method.Finally, we propose the Kernel-based Sparse Regularization (KSR)approach and then derive the associated Kernel-based SparseRegularized Least-Squared Classification (KSR-LSC) algorithm.

3.1. Sparse representation of data and the l2-norm problem

There have been various models proposed for recognition, andone simple approach models the samples from the same class aslying on a linear subspace. Suppose that there are sufficient trainingsamples of the k-th class, denoted as Ak ¼ ½xk;1;…; xk;nk �ARd�nk ,where nk is the number of samples of the k-th class. Any data pointxk;i from the k-th class approximately lies in the linear span of otherdata points, which can be described as

xk;i ¼ αk;1xk;1þ⋯þαk;i�1xk;i�1þαk;iþ1xk;iþ1þ⋯þαk;nkxk;nk

:

Here, fαk;jgnkj ¼ 1 are the reconstruction coefficients. Further, we canrepresent xk;i as the linear combination of all training data

xk;i ¼ Xα; ð4Þ

where X ¼ ½A1;…;AC �ARd�ðlþuÞ is the training data matrix of Cclasses, and α¼ ½0;…;0; αk;1;…; αk;i�1;0;αk;iþ1;…; αk;nk

;0;…;0�T ARlþu is the coefficient vector whose nonzero entries are onlyassociated with the k-th class.

If do lþu, (4) is underdetermined, and this problem can besolved by the following sparse representation problem:

αn

i ¼ arg minαi ARlþ u

‖αi‖1 s:t: xi ¼ Xαi;

where αiARlþu is the sparse representing coefficient vector of thetraining point xi. Thus, seeking the sparsest representation canevidently help us to automatically discriminate various classes inthe training set, and sparse representations of data are thus able tofacilitate the solving of the aimed classification tasks.

However, consider the experiment shown in Fig. 1(a), wherethe data set consists of 300 data points from two classes C1 (bluedots) and C2 (black dots). For a point xn ¼ ð1:5;1:5Þ (green dot), wecompute its sparse representing coefficient vector αn ¼ ½αn

1;…; αn

300�by the corresponding software package. The 20 largest elements ofαn are αn

ð1Þ ¼ 0:47515, αn

ð2Þ ¼ 0:00127, αn

ð3Þ ¼ 0:00124, αn

ð4Þ ¼ 0:00113,αnð5Þ ¼ 0:00108, αn

ð6Þ ¼ 0:00085, αnð7Þ ¼ 0:00077, αn

ð8Þ ¼ 0:00073, αn

ð9Þ ¼0:00073, αn

ð10Þ ¼ 0:00072, αn

ð11Þ ¼ 0:00071, αn

ð12Þ ¼ 0:00069, αn

ð13Þ ¼0:00069, αn

ð14Þ ¼ 0:00069, αn

ð15Þ ¼ 0:00065, αn

ð16Þ ¼ 0:00059, αn

ð17Þ ¼0:00058, αn

ð18Þ ¼ 0:00054, αn

ð19Þ ¼ 0:00051, αn

ð20Þ ¼ 0:00050. The restelements of αn are smaller than αn

ð20Þ ¼ 0:00050, so the contributionof the corresponding data points to the representation of xn can beneglected. The 20 data points corresponding to αn

ð1Þ;…; αn

ð20Þ aremarked with red square in Fig. 1(a). We can see that the sparserepresentation of xn mainly involves the ones from the differentclass. In this case the sparse representation is not discriminative.

We call the above problem as “the l2-norm problem” in sparserepresentation of data. The reason for this problem is that the datapoints in the data set may have different l2-norms, so the sparserepresentation of one data point may be inclined to select the datapoints with larger l2-norms if possible. For the above data set,the l2-norms of the 20 points, which are involved in the sparserepresentation of xn, are 4.3326, 4.1141, 3.9516, 4.0432, 4.0275,4.3761, 4.0087, 4.0552, 4.4528, 3.8453, 3.8493, 4.0424, 4.2506,4.1497, 4.1273, 4.0062, 4.4120, 4.0805, 4.1214, 4.3070. Comparably,the l2-norm of xn is 2.12 and the l2-norms of points in C1 rangefrom 1.414 to 2.828. Therefore, we can see that the sparse repre-sentation of xn inclines to possibly select the data points with largerl2-norms from the entire norm distribution. Consider another simpleexample shown in Fig. 1(b). There are six data points fx1;…; x6gwith x1; x2; x3 (red dots) belonging to the first class and x4; x5; x6(blue squares) belonging to the second class. Their coordinates arex1¼ ð1;1Þ, x2¼ ð0:9;1:1Þ, x3¼ ð1:1;0:9Þ, x4¼ ð2:05;1:95Þ, x5¼ð1:9;2:1Þ, x6¼ ð2:1;1:9Þ. Then we have

x1¼ 12 x2þ1

2 x3¼ 14 x5þ1

4 x6

¼ Xn½0; 12 ; 12 ;0;0;0�T ¼ Xn½0;0;0;0; 14 ; 14 �T ;

where X ¼ ½x1;…; x6� is the data matrix. From this and (2) we canget that the sparse representation of x1 will select x5; x6 (which are

1 1.5 2 2.5 3

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

x1x2

x3

x4x5

x6

Fig. 1. Two illustrative examples for the l2-norm problem. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of thispaper.)

N. Gu et al. / Neurocomputing 139 (2014) 345–356 347

Page 4: A kernel-based sparsity preserving method for semi-supervised classification

in a different class from x1), but not x2; x3 (which are in the sameclass with x1). This illustrates that, when a data point can beexpressed as a linear combination of a group of data points withsmaller l2-norms and as a linear combination of a group of datapoints with larger l2-norms, respectively, the sparse representationof this data point will select the one using the group of data pointswith larger l2-norms.

Image data do not suffer from the l2-norm problem if they arenormalized to have unit l2-norm. Here the normalization refers tothat, for each data point xARd, we replace xwith x̂ ¼ x=‖x‖2 whichhas unit l2-norm to represent the point. This normalizationpreprocess will change the brightness of the images; and it isusually adopted for image classification tasks in order to eliminatethe negative effects caused by different brightness. After normal-ization, the image data points distribute on a unit sphere Sd �Rd,so the l2-norm problem will not occur. To illustrate this, we takethe two-dimensional case (d¼2) as an example. Suppose thatthere are two data classes M1 and M2 (shown in Fig. 2). Assumethat the data point x1AM1 can be expressed as a linear combina-tion of x2AM1 and x3AM1. Denote q1 as the intersection point ofthe line connecting x2; x3 with the line connecting x1 and theorigin, and denote b1 as the L2-distance between the origin andthe data point q1. Then we have

x1¼ 1b1

nq1¼ 1b1

nðβ2nx2þβ3nx3Þ

¼ ½x1; x2;…; xn�n 0;β2b1;β3b1

;0;…;0� �T

;

where β2 and β3 are some positive real numbers satisfyingβ2þβ3 ¼ 1, that is

x1¼ Xnα1;

where X ¼ ½x1; x2;…; xn� is the data matrix, α1 ¼ ½0; β2=b1;β3=b1;0;…;0�T and ‖α1‖1 ¼ 1=b1. Similarly, if x1 can be expressedas a linear combination of x2 and x4AM2, then we also have

x1¼ Xnα̂1;

where α̂1 ¼ ½0; β̂2=b̂1;0; β̂4=b̂1;…;0�T and ‖α̂1‖1 ¼ 1=b̂1. Here b̂1 isthe distance between the origin and q2, which is the intersectionpoint of the line connecting x2; x4 and the line connecting x1 andthe origin. From Fig. 2 it is easy to see that b14 b̂1, so ‖α1‖1o‖α̂1‖1. Thus, the sparse representation of x1 will be selected asx1¼ Xnα1. This means that, when all data points have the same

l2-norm, the sparse representation of a data point will be apt to selectthe one using points from the same class, that is, the sparserepresentation is discriminative. Therefore, due to the normalizationpreprocess, image data do not suffer from the l2-norm problem.However, for general natural data sets, the normalization preprocessmay distort the intrinsic structure of the data set, and thus it is not afeasible strategy to solve the l2-norm problem.

3.2. The kernel-based sparse representation of data

Kernel trick [23] is a way of mapping data from the originalinput space to another higher (or possibly infinite) dimensionalHilbert space as ϕ : x-ϕðxÞAHk, to make the data points in thenew feature space linearly separable. The most commonly usedkernel function is the Gaussian kernel, that is

kðw; vÞ ¼ ⟨ϕðwÞ;ϕðvÞ⟩Hk ¼ exp �‖w�v‖22s2

� �;

where Hk is the RKHS associated with the kernel kð�; �Þ and sAR isa parameter. For any data point x we have ‖ϕðxÞ‖22 ¼ kðx; xÞ ¼ 1; sothe data point ϕðxÞ naturally have unit l2-norm.

To overcome the l2-norm problem described in the last section,we propose to take advantage of the kernel trick to get the kernel-based sparse representation of data. We first map the data into thefeature space Hk by the mapping ϕ and then get the sparserepresentation of data in the new feature space. In this way, allthe data in Hk have unit l2-norm, and thus their sparse represen-tations obtained are discriminative. Besides, since features of thesame type are easier grouped together and linearly separable inthe feature space, we could find the sparse representation for datamore easily, and the reconstruction error may be reduced as well.

Remark 1. By kernel trick, some linearly inseparable data in theoriginal input space become linearly separable in the featurespace. In other words, the points of the same class should begrouped together and the points of different classes should belocated in different regions of the unit sphere SDAHk. If a pointϕðxiÞ can be represented by ϕðxjÞ of a different class, then ϕðxjÞshould be far from ϕðxiÞ on the sphere. Just as the example shownin Fig. 2, we infer that the l1-norm of the corresponding represent-ing coefficient vector also tends to be large. Therefore, the kernel-based solution also inclines to alleviate the l2-norm problem forhigh-dimensional problem to a certain extent.

Remark 2. If data points of the same class are clustered into anumber of disjoint clusters, our method will still be effective. Fromthe analysis of Remark 1 we can see that, by kernel trick, the pointswhich belong to different clusters of the same class should be locatedin different regions of the unit sphere SD in the kernel space. Then thesparse representation of a point ϕðxÞ inclines to involve the points ofthe same cluster in the same class. This shows that sparse representa-tion is discriminative in the case that data points of the same class areclustered into a number of disjoint clusters, and the proposed methodis still effective in such case.

Remark 3. In the paper we propose to make use of the Gaussiankernel. However, we can also choose other kernels, provided thatthe chosen kernel has the property of kðx; xÞ ¼ Cð8xAX Þ, where Cis a constant. This can be satisfied when the kernel is

(1) The isotropic kernel kðx; yÞ ¼ KðJx�yJ Þ (e.g., Gaussian kernel).(2) Any normalized kernel kðx; yÞ ¼ Kðx; yÞ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiKðx; xÞ

p�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiKðy; yÞ

p.

For each training point xi (labeled or unlabeled), we cansparsely represent it by the rest training points (including labeledand unlabeled points) in Hk. Specifically, given sufficient data

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1 x1x2

x3

x4

q1

q2

Fig. 2. An example to illustrate why image data do not suffer from the l2-normproblem.

N. Gu et al. / Neurocomputing 139 (2014) 345–356348

Page 5: A kernel-based sparsity preserving method for semi-supervised classification

points, the sparse representation of ϕðxiÞ ði¼ 1;…; lþuÞ can befound through the stable l1-norm minimization problem:

αn

i ¼ arg minαi ARlþ u

‖αi‖1 s:t: ϕðxiÞ ¼ ϕðXÞαi; ð5Þ

where ϕðXÞ ¼ ½ϕðx1Þ;…;ϕðxlÞ;…;ϕðxlþuÞ�ARD�ðlþuÞ is the mappedtraining data matrix in Hk, αi ¼ ½αi;1;…; αi;i�1;0; αi;iþ1;…;

αi;lþu�T ARlþu is the reconstruction coefficient vector to be deter-mined. It should be pointed out that the process of finding kernel-based sparse representation is unsupervised, without harnessingany data label information. Therefore, the kernel-based sparserepresentation of ϕðxi) will involve both labeled and unlabeledpoints.

Taking account of the effect of noise, insufficient training pointsand outliers, the optimization problem (5) can be formulated as

αn

i ¼ arg minαi ARlþ u

‖αi‖1 s:t: ‖ϕðxiÞ�ϕðXÞαi‖2rϵ;

where ϵ is the sparse representation bias bound predefined. Asexplained in Section 3.1, when all data points have unit l2-norm,the training data points of the same class associated with thetested one are preferred in the sparse sense. Therefore, using themapped training data points ϕðxiÞ ði¼ 1;…; lþuÞ in the kernelspace Hk as the base elements, the sparse representation of atesting data point in the kernel space has strong discriminativeability. Besides, we have to point out that, if there are sufficienttraining data (including labeled and unlabeled points), the sparserepresentation bias will be very small and the effect caused byrepresentation bias can be neglected. However, if we suffer fromsmall sample problem, i.e., there are not sufficient training datapoints, then the kernel-based sparse representation bias maydecrease the classification accuracy of the obtained classifier.

The above sparse optimization problem is equivalent to thefollowing Lagrange form:

αn

i ¼ arg minαi ARlþ u

f‖ϕðxiÞ�ϕðXÞαi‖22þλ‖αi‖1g

¼ arg minαi ARlþ u

‖ϕðxiÞ� ∑lþu

j ¼ 1ϕðxjÞαij‖22þλ‖αi‖1

( )

¼ arg minαi ARlþ u

ϕðxiÞTϕðxiÞ�2 ∑lþu

j ¼ 1αijϕðxiÞTϕðxjÞ

(

þ ∑lþu

j1 ¼ 1∑lþu

j2 ¼ 1αij1αij2ϕðxj1 ÞTϕðxj2 Þþλ‖αi‖1

)

¼ arg minαi ARlþ u

f1�2kTi αiþαTi Kαiþλ‖αi‖1g;

where K ¼ fkðxj1 ; xj2 ÞgARðlþuÞ�ðlþuÞ is the kernel matrix andki ¼ ½kðxi; x1Þ;…; kðxi; xlþuÞ�T ARlþu is the i-th kernel vector. Thus,the sparse representation of ϕðxiÞ can be obtained by solving theoptimization problem

αn

i ¼ arg minαi ARlþ u

fαTi Kαi�2kTi αiþλ‖αi‖1g: ð6Þ

This problem can be easily solved using Matlab sparse tool boxessuch as the TFOCS1 proposed by Candès et al. [24].

3.3. Kernel-based sparse regularization and KSR-LSC algorithm

For the C-class classification problem, our classification func-tion Fð�Þ is defined as a vector function

FðxÞ ¼ ½f 1ðxÞ; f 2ðxÞ;…; f CðxÞ�T :Then the class that x belongs to is determined as that of thecomponent which takes the maximum value of ff sðxÞ; s¼ 1;2;…;Cg. Correspondingly, we define the label vector yi to be a

C-dimensional label vector with elements 0 or 1. If xi belongs tothe k-th class, then the k-th component of yi is equal to 1 and therest components are equal to 0.

Many existing graph-based semi-supervised learning algo-rithms are based on the manifold assumption that the data pointsdistribute on an intrinsic low-dimensional manifold and that iftwo points are close on the manifold then their labels are similar,that is, the label function varies smoothly on the intrinsic mani-fold. However, this assumption is very restrictive and usuallyneeds to set the parameter of the neighborhood size manually,in order to establish the relationship between data points. Thisleads to the difficult parameter selection problem since a fixedneighborhood size cannot be adaptive to uneven data.

On the other hand, we have shown that the kernel-basedsparse representation is discriminative. Moreover, from Pro-position 1 and Remark 4 we know that, if the components of theclassification function F can be expressed as

f sðxÞ ¼ ∑lþu

i ¼ 1bsikðxi; xÞ ðs¼ 1;…;CÞ;

then F should preserve the kernel-based sparse representingcoefficients. Inspired by the MR framework which utilizes aregularization term ‖f‖2I defined as (2) to control the smoothnessof the classifier on the data manifold, we define the sparsityregularization term to control the complexity of F, as measured bythe preservation of the kernel-based sparse structure of data

‖F‖2R ¼1

ðlþuÞ2∑lþu

i ¼ 1FðxiÞ� ∑

lþu

j ¼ 1αijFðxjÞ

����������2

2

; ð7Þ

where αi ¼ ½αi;1;…; αi;i�1;0; αi;iþ1;…; αi;lþu�T ARlþu is the kernel-based sparse representing coefficient vector.

Proposition 1. If FðxÞ ¼∑lþuj ¼ 1bjkðxj; xÞ where bj ¼ ½b1j;…; bCj�T ARC ,

then the solutions of the optimization problems (5) and (8) are equal

βn

i ¼ arg minβi ARlþ u

‖βi‖1 s:t: FðxiÞ ¼ FðXÞβi: ð8Þ

Proof. If ϕðxiÞ ¼ ϕðXÞαi satisfies, then

FðxiÞ ¼ ∑lþu

j ¼ 1bjkðxi; xjÞ ¼ ∑

lþu

j ¼ 1bj⟨ϕðxiÞ;ϕðxjÞ⟩H

¼ ∑lþu

j ¼ 1bj ∑

lþu

p ¼ 1αipϕðxpÞ;ϕðxjÞ

* +H

¼ ∑lþu

j ¼ 1bj ∑

lþu

p ¼ 1αip⟨ϕðxpÞ;ϕðxjÞ⟩H

¼ ∑lþu

j ¼ 1bj ∑

lþu

p ¼ 1αipkðxp; xjÞ

¼ ∑lþu

p ¼ 1αip ∑

lþu

j ¼ 1bjkðxp; xjÞ

¼ ∑lþu

p ¼ 1αipFðxpÞ

Therefore, the problem (5) can be transformed to

αn

i ¼ arg minαi ARlþ u

‖αi‖1 s:t: FðxiÞ ¼ FðXÞαi;

which is the same as the problem (8). □

Remark 4. Proposition 1 indicates that, in the case when FðxÞ ¼∑lþu

j ¼ 1bjkðxj; xÞ (which will be proved in Proposition 2), if a pointϕðxiÞ can be sparsely represented by other points in the featurespace, then the label FðxiÞ can also be sparsely represented by thelabels of other points using the same kernel-based sparse repre-senting coefficients. Since the kernel-based sparse representation1 TFOCS: http://tfocs.stanford.edu/.

N. Gu et al. / Neurocomputing 139 (2014) 345–356 349

Page 6: A kernel-based sparsity preserving method for semi-supervised classification

of data is discriminative, it inclines to further enhance theclassification ability by utilizing (7) as a criterion to evaluate thepreservation of the kernel-based sparse representing coefficientsby F.

Given a Mercer kernel function kðw; vÞ, there is an associatedRKHS HK of functions fs with the corresponding norm ‖f s‖K . Thenfor the vector function FðxÞ ¼ ½f 1ðxÞ; f 2ðxÞ;…; f CðxÞ�T ; we can definea penalty term to measure the complexity of the classificationfunction F as

‖F‖2K ¼ ∑C

s ¼ 1‖f s‖2K : ð9Þ

Inspired by the MR framework (1), we take (9) to control thecomplexity of the classifier F in the ambient space, and thenreplace the smoothness regularization term of (1) with thesparsity regularization term (7) to control the complexity of F asmeasured by the preservation of kernel-based sparse structure.Then the proposed KSR approach can be formulated as follows:

Fn ¼ arg minf s AHKs ¼ 1;…;C

1l∑l

i ¼ 1Vðxi; yi; FÞþγK ∑

C

s ¼ 1‖f s‖2K

(

þ γRðlþuÞ2

∑lþu

i ¼ 1FðxiÞ� ∑

lþu

j ¼ 1αijFðxjÞ

����������2

2

9=;: ð10Þ

Here, fs is to be determined, αij is the kernel-based sparserepresentation got by (6), γA and γR are the regularization para-meters, V is some loss function.

For convenience, we define the loss function V in (10) to be thesquared loss, that is, Vðx; y; f Þ ¼ ‖y�FðxÞ‖2F . Then our KSR-LSCalgorithm can be presented as

Fn ¼ arg minf s AHKs ¼ 1;…;C

1l∑l

i ¼ 1‖yi�FðxiÞ‖2F þγK ∑

C

s ¼ 1‖f s‖2K

(

þ γRðlþuÞ2

∑lþu

i ¼ 1FðxiÞ� ∑

lþu

j ¼ 1αijFðxjÞ

����������2

2

9=;: ð11Þ

Proposition 2. The minimizer of the optimization problem (11)admits an expansion

FðxÞ ¼ ∑lþu

i ¼ 1bikðxi; xÞ; ð12Þ

where bi ¼ ½b1i;…; bCi�T ARC .

Proof. The proof is based on a simple orthogonality argument[25].

We know that any function f sAHK can be uniquely decom-posed into a component ðf sÞ J in the linear subspace spanned bythe kernel functions fkðxi; �Þglþu

i ¼ 1 and a component ðf sÞ? orthogonalto it. Thus,

f s ¼ ðf sÞ J þðf sÞ? ¼ ∑lþu

i ¼ 1bsikðxi; �Þþðf sÞ? :

We then have

F ¼ ½f 1;…; f C �T ¼ ∑lþu

i ¼ 1bikðxi; �ÞþF ? ;

where bi ¼ ½b1i;…; bCi�T ARC and F ? ¼ ½ðf 1Þ? ;…; ðf CÞ? �T .For any data point xjð1r jr lþuÞ, we have

FðxjÞ ¼ ½⟨f 1; kðxj; �Þ⟩;…; ⟨f C ; kðxj; �Þ⟩�T

¼ ∑lþu

i ¼ 1b1ikðxi; �Þ; kðxj; �Þ

* +þ ⟨ðf 1Þ? ; kðxj; �Þ⟩;…;

"

� ∑lþu

i ¼ 1bCikðxi; �Þ; kðxj; �Þ

* +þ ⟨ðf CÞ? ; kðxj; �Þ⟩

#T:

Since ⟨ðf sÞ? ; kðxj; �Þ⟩¼ 0 and ⟨kðxi; �Þ; kðxj; �Þ⟩¼ kðxi; xjÞ, it follows thatFðxjÞ ¼∑N

i ¼ 1bikðxi; xjÞ. This means that FðxjÞ is independent of theorthogonal component F ? . In other words, the loss function andthe smoothness regularization term in (11) depend only on thevalue of the coefficients fbiglþu

i ¼ 1 and the Gram matrix of the kernelfunction.

In fact, the orthogonal component F ? only increases thecomplexity regularization term ‖F‖2K , which can be seen fromthe fact that

‖F‖2K ¼ ∑lþu

i ¼ 1bikðxi; �Þ

��������2

K

þ‖F ? ‖2K Z ∑lþu

i ¼ 1bikðxi; �Þ

��������2

K

:

Thus, the minimizer of the problem (11) must have a zeroorthogonal component F ? ¼ 0 and therefore admits a representa-tion Fð�Þ ¼∑lþu

i ¼ 1bikðxi; �Þ. □

From the expansion (12) together with (7) and (9), we obtainthat

f s ¼ ∑lþu

i ¼ 1bsikðxi; �Þ; ‖f s‖2K ¼ βsKβ

Ts ; ð13Þ

‖F‖2R ¼1

ðlþuÞ2trðBKðI�AÞT ðI�AÞKBT Þ; ð14Þ

‖F‖2K ¼ ∑C

s ¼ 1‖f s‖2K ¼ ∑

C

s ¼ 1βsKβ

Ts ¼ trðBKBT Þ: ð15Þ

Here, A¼ ½α1;…; αlþu�T ARðlþuÞ�ðlþuÞ is the kernel-based sparserepresenting coefficient matrix with αi ¼ ½αi;1;…; αi;lþu�T , B¼ ½b1;…; blþu� ¼ ½βT1 ;…; βTC �T ARC�ðlþuÞ is the kernel coefficient matrix, biis the i-th column of B, βs is the s-th row of B, and K ¼ fkðxi; xjÞgARðlþuÞ�ðlþuÞ is the kernel matrix.

Let Y ¼ ½y1;…; yl;0;…;0�ARC�ðlþuÞ represent the label matrixwhere 0ARC is the zero vector and let JARðlþuÞ�ðlþuÞ be a diagonalmatrix with the first l diagonal elements being 1 and the rest being0. By substituting (12), (14) and (15) into (11), the following convexoptimization problem can be obtained:

Bn ¼ arg minBARC�ðlþ uÞ

1ltrððY�BKJÞðY�BKJÞÞT þγK trðBKBT Þ

þ γRðlþuÞ2

trðBKðI�AÞT ðI�AÞKBT Þ)

ð16Þ

The solution of the optimization problem (16) is given by

Bn ¼ Y KJþγK lIþγRl

ðlþuÞ2KðI�AÞT ðI�AÞ

!�1

; ð17Þ

where I is the identity matrix of order ðlþuÞ. Then the solution ofthe problem (11) is obtained as

FnðxÞ ¼ ðf n1ðxÞ; f n2ðxÞ;…; f nCðxÞÞT ¼ ∑lþu

i ¼ 1bn

i kðxi; xÞ; ð18Þ

where bn

i is the i-th column of Bn. Therefore, the KSR-LSC classifieris obtained as

identityðxÞ ¼ in ¼ arg maxiA f1;2;…;Cg

ff ni ðxÞg:

Based on the ideas discussed above, the corresponding KSR-LSCalgorithm can be summarized as follows.

Algorithm 1. The KSR-LSC algorithm.

Input: Data set fðxi; yiÞ; xlþ j; i¼ 1;…; l; j¼ 1;…;ug, regularizationparameters γA, γR and the kernel parameter s.

N. Gu et al. / Neurocomputing 139 (2014) 345–356350

Page 7: A kernel-based sparsity preserving method for semi-supervised classification

1: Compute the best kernel-based sparse representingcoefficients for each point in fxiglþu

i ¼ 1 by solving the l1-normminimization problem (6).

2: Compute the kernel matrix K ¼ fkðxi; xjÞg.3: Compute the coefficients Bn of the kernel matrix K by (17).4: Compute the discriminative function FnðxÞ by (18).Output: The class label identityðxÞ ¼ in ¼ arg max

iA f1;2;…;Cgff ni ðxÞg.

The proposed semi-supervised KSR-LSC algorithm tries to keepthe prior class labels, and regularizes the complexity and thesmoothness of the classifier. Like the kernel learning method in[26] which states that a good kernel should enable each trainingpoint xi to be well reconstructed from the localized bases xjweighted by the kernel values, the KSR-LSC algorithm thinks thatthe class label FðxiÞ of each point should be well reconstructedfrom the labels FðxjÞ of other points weighted by the kernel-basedsparse representation. Another kernel learning method in [27]tries to let the kernel preserve the intrinsic manifold structure ofdata, while our method attempts to make the classifier preservethe sparse structure of data. Different from the method in [28]which computes the kernel-based sparse representation with thedictionary to be learn, we use the training data in kernel space asdictionary. The classifier in [29] is supervised and utilizes thekernel-based sparse representation of each test point to assign theclass, while our method performs in a semi-supervised way, anduses the kernel-based sparse representation of training data toconstruct the explicit classifier.

Compared to the S-RLSC method in [18], the contribution ofthis paper mainly lies in three aspects. Firstly, this paper proposesthe l2-norm problem that the sparsity-based methods, includingS-RLSC, tend to suffer from when facing non-image data sets.Secondly, due to the l2-norm problem, the S-RLSC method can onlydeal with image data sets, while the method proposed in thispaper can deal with general natural data sets, which furtherextends its practicable range. Thirdly, the S-RLSC method isdesigned based on the assumption that the label of a point canbe reconstructed by the labels of other points using the coeffi-cients of sparse representation. In practice, however, the classifica-tion function is usually nonlinear, and thus the reconstruction isgenerally not precise, which may in return decrease the classifica-tion accuracy of the obtained classifier. Conversely, in this paper,from Proposition 1 and Remark 4 we can see that the label of apoint can be sparsely reconstructed by the labels of other pointsusing the kernel-based sparse representing coefficients, and thusthe classification accuracy of the obtained classifier inclines to beimproved.

4. Experiments

In this section, experiments are implemented on both imageand non-image data sets to show that our KSR-LSC method forsemi-supervised classification is effective and robust to differentdata sets.

4.1. Data sets and compared approaches

To evaluate the performance of the proposed method, weimplement experiments on nine real-world benchmark data sets:USPS handwritten digit [30], COIL-20 [31], and Intelligent TrafficSystem (ITS) [32] data sets; Ringnorm, Splice, and Waveform datasets; Tic-Tac-Toe Endgame and Pima Indians Diabetes data sets;dna data set [33]. The first three are image data sets, and the restare non-image data sets. The 4rd, 5th and 6th data sets come from

IDA Benchmark Repository2; the 7th and 8th data sets come fromUCI Machine Learning Repository3; the last data set comes fromthe Statlog collection.4 The numbers of samples in the data sets are2500, 1440, 2529, 2000, 2000, 2000, 958, 768, 3186 respectively.(For some databases, we only randomly select a subset for ourexperiments.) In order to alleviate the negative effect caused bythe different scales of different dimensions, for all data sets, eachrow of the training data matrix XARd�ðlþuÞ is normalized to makethe maximum component of the row be 1.

The following approaches are to be compared in the experi-ments: the proposed KSR-LSC method, LapRLSC method [12], GRFmethod [9], 1-nearest neighbor (1-NN) method, Linear Neighbor-hood Propagation (LNP) method [34], standard and nonstandardSRC methods [16], standard and nonstandard S-RLSC [18] meth-ods. Here, the standard and nonstandard SRC/S-RLSC refer tothe form of SRC/S-RLSC with the columns of data matrix normal-ized to have unit l2-norm (named SRC/S-RLSC) and the formof SRC/S-RLSC with the columns of data matrix unnormalized(named SRC/S-RLSC-un). Here “normalization” means that, foreach training data point or test data point xARd, we replace xwith x̂ ¼ x=‖x‖2 which has a unit l2-norm to represent the point.

4.2. Parameter selection and experimental settings

For the parameter selection, the KSR-LSC, LapRLSC, GRF, S-RLSCand S-RLSC-un methods all need the kernel parameter s as a keyparameter for implementation. In the experiments, we search sfrom f0:01s0;0:1s0;s0;10s0;100s0g, where s0 is the mean valueof pairwise L2-distances on the training data set. The KSR-LSC,LapRLSC, S-RLSC and S-RLSC-un methods have regularizationparameters γK and γR. Let CK ¼ γK l, CR¼ γRl=ðlþuÞ2. We find thatthe algorithms perform well with a wide range of parameters CKand CR. For convenience, we set CK¼0.005, CR¼1 for all data sets.KSR-LSC also needs the parameter λ to compute the kernel-basedsparse representation, and we simply set it to be 0.01 in all theexperiments. For LNP, the parameter α, which is the fraction oflabel information that a point receives from its neighbors, is set tobe 0.99 as in their paper [34]. The SRC method needs only aparameter: error tolerance ε, and in their paper [17], this value isset as ϵ0 ¼ 0:05 throughout all their experiments. Therefore, weuse this setup for the SRC method and use ϵ1 ¼ anϵ0 for the SRC-unmethod, where a is the mean value of the l2-norms of the trainingdata points.

The whole experiment is conducted as follows. For each dataset X , tenfold cross validation (10-CV) is repeated for 5 times. Ineach fold of 10-CV, denote X1 as the test set and X2 as the trainingset. Then the training set X2 is randomly partitioned into thelabeled set L with m data points and the unlabeled set U . There-fore, for supervised algorithms of 1-NN, S-RLSC and S-RLSC-un, weuse L which includes only the labeled points to train theclassifiers; for semi-supervised methods of LapRLSC, GRF, S-RLSC,S-RLSC-un and our KSR-LSC, the training set is L⋃U , includingboth the labeled and unlabeled data points. After building theclassifiers, classifications are first performed on the unlabeledpoints in U . Then, classifications are conducted on the test setX1. For each data set under a specific number m of labeled points,we repeat 10-CV for 5 times, and the obtained 50 results areaveraged.

Besides, using paired t-test, we compare the 50 accuracy ratesobtained by KSR-LSC and the 50 accuracy rates obtained by other8 algorithms, respectively, to examine whether the performance

2 http://mldata.org/repository/tags/data/IDA_Benchmark_Repository/.3 http://archive.ics.uci.edu/ml/datasets.html.4 http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/multiclass.html.

N. Gu et al. / Neurocomputing 139 (2014) 345–356 351

Page 8: A kernel-based sparsity preserving method for semi-supervised classification

differences between our approach and others are statisticallysignificant.

4.3. Experimental results

For the nine data sets, the recognition results (i.e., the averageresults of tenfold cross validation of 5 times) of the comparedalgorithms on the unlabeled data points U in the training sets areshown in Figs. 3–11, with the number of labeled data points mchanging. As can be seen, the proposed KSR-LSC outperformsother methods in general, no matter on image or non-image datasets. For SRC and SRC-un methods, they perform worst among allmethods on some non-image data sets, such as Ringnorm andWaveform data sets. For S-RLSC and S-RLSC-un methods, theresults on non-image data sets sometimes are still not satisfactory,but we should note that they perform much better than SRC andSRC-un. This is mainly because SRC and SRC-un only rely on sparserepresentation, whilst S-RLSC and S-RLSC-un methods not onlyrely on sparse representation but also take account of the com-plexity of the classifier and the fitting results on the labeled points.Besides, from the figures we can see that, sometimes S-RLSCperforms worse than S-RLSC-un. This may be because that the

normalization preprocess changes the intrinsic structure of somedata sets, which will negatively affect the performance of theobtained classifier.

500 1000 1500 200060

65

70

75

80

85

90

95

100USPS

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 3. Classification results of the unlabeled data points in the training set for USPSdata set, where m is the number of labeled data points in the training set.

400 500 600 700 800 900 100093

94

95

96

97

98

99

100COIL−20

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 4. Classification results of the unlabeled data points in the training set forCOIL-20 data set, where m is the number of labeled data points in the training set.

500 1000 1500 200055

60

65

70

75

80

85

90

95

100ITS

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 5. Classification results of the unlabeled data points in the training set for ITSdata set, where m is the number of labeled data points in the training set.

200 400 600 800 1000 1200 1400 160020

30

40

50

60

70

80

90

Ringnorm

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 6. Classification results of the unlabeled data points in the training set forRingnorm data set, wherem is the number of labeled data points in the training set.

200 400 600 800 1000 1200 1400 160050

55

60

65

70

75

80

85

90Splice

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 7. Classification results of the unlabeled data points in the training set forSplice data set, where m is the number of labeled data points in the training set.

N. Gu et al. / Neurocomputing 139 (2014) 345–356352

Page 9: A kernel-based sparsity preserving method for semi-supervised classification

The average classification results of the algorithms on the testsets are tabulated in Tables 1–9 for several representative values ofm, where the best classification results are in boldface for eachfixed value of m. The value in parentheses is the p value of thepaired t-test between KSR-LSC and that method. From statisticaltests, we can see that the discriminative ability of the proposedKSR-LSC is often significantly better than other algorithms, espe-cially when the number of labeled points m is relatively large.Besides, we can see that the proposed KSR-LSC method has acomparable performance with S-RLSC for image data sets (shownin Tables 1–3), but it consistently outperforms S-RLSC for non-image data sets (shown in Tables 4–9). There are two reasons toexplain this phenomenon. On one hand, in the S-RLSC method, thedata points are normalized, that is, replacing x with x̂ ¼ x=‖x‖2. Forimage data, the normalization preprocess can eliminate thenegative effects caused by different brightness, and is usuallyadopted for image classification tasks to improve the classificationaccuracy. However, KSR-LSC does not adopt the normalizationpreprocess as S-RLSC. Therefore, for image data sets, the differ-ences between them are not very significant. On the other hand,compared with S-RLSC, the most important contribution of thispaper is that, due to the proposed l2-norm problem, the S-RLSCmethod can only deal with image data sets, while KSR-LSC candeal with general natural data sets. We think that the abovephenomenon can demonstrate the existence of the l2-norm problemand the superiority of the proposed method over S-RLSC for non-image data sets.

In summary, the experimental results show that the proposedKSR-LSC algorithm is more effective than the other comparedalgorithms, especially for non-image data sets.

5. Conclusion

In this paper, we proposed a novel semi-supervised classifica-tion algorithm called the KSR-LSC algorithm. The algorithm firstmaps the original data into the feature space so that the mappeddata points have unit l2-norm, and then tries to keep the sparserepresenting coefficients of data points in the feature space.Experiments have been conducted on image and non-image datasets, and the results have shown that our algorithm has betterrecognition results than other compared algorithms.

In fact, our algorithm mainly focuses on the classification offinite dimensional examples. When the data dimension is infiniteor extremely high, the l2-norm problemwill be naturally alleviated

200 400 600 800 1000 1200 1400 160050

55

60

65

70

75

80

85

90Waveform

m

Acc

urac

y % KSR−LSC

S−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 8. Classification results of the unlabeled data points in the training setfor Waveform data set, where m is the number of labeled data points in thetraining set.

100 200 300 400 500 600 700 80055

60

65

70

75

80

85

90

95Tic−Tac−Toe Endgame

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 9. Classification results of the unlabeled data points in the training set forTic-Tac-Toe Endgame data set, where m is the number of labeled data points in thetraining set.

100 200 300 400 500 60040

45

50

55

60

65

70

75

Pima Indians Diabetes

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 10. Classification results of the unlabeled data points in the training set forPima Indians Diabetes data set, wherem is the number of labeled data points in thetraining set.

0 500 1000 1500 2000 2500 3000

50

60

70

80

90

100dna

m

Acc

urac

y %

KSR−LSCS−RLSCS−RLSC−unSRCSRC−un1−NNGRFLapRLSCLNP

Fig. 11. Classification results of the unlabeled data points in the training set for dnadata set, where m is the number of labeled data points in the training set.

N. Gu et al. / Neurocomputing 139 (2014) 345–356 353

Page 10: A kernel-based sparsity preserving method for semi-supervised classification

due to the well-known curse of dimensionality problem. In suchcases, the high-dimensional space will always be “far away” fromthe center, or, to put it another way, the high-dimensional unit

space will consist almost entirely of the “corners” of the hypercube,with almost no “middle”. However, even if the l2-norm problemdoes not occur, our strategy may promote the classification accuracy

Table 1Classification results on the test set of USPS data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

200 88.36 87.33 68.28 60.60 60.33 77.53 45.32 78.11 70.01(0.0002) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

800 95.07 94.65 88.51 91.20 90.48 88.67 80.51 91.11 87.24(0.0113) (0.0000) (0.0000) (0.0000) (0.0000) (0.0036) (0.0000) (0.0028)

1000 95.36 95.31 90.99 92.80 92.35 90.12 90.11 91.96 88.57(0.6871) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0073)

1800 96.80 97.00 95.64 95.77 95.44 92.29 92.57 93.79 91.35(0.1089) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0294)

Table 2Classification results on the test set of COIL-20 data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

400 99.26 99.14 95.74 94.33 94.07 96.18 96.78 91.37 93.80(0.5094) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

600 99.70 99.49 98.47 97.66 97.01 97.99 98.33 93.08 97.69(0.2310) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

700 99.79 99.72 99.14 98.61 98.19 98.80 98.96 93.96 98.82(0.3746) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

900 99.95 99.91 99.70 99.42 99.19 99.44 99.54 95.49 99.68(0.3256) (0.0190) (0.0000) (0.0000) (0.0000) (0.0014) (0.0000) (0.0005)

Table 3Classification results on the test set of ITS data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

200 91.06 91.17 87.76 83.97 84.62 81.34 56.23 74.34 73.30(0.6748) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

600 94.41 94.50 90.42 90.83 88.74 87.70 59.18 84.78 88.70(0.6747) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1000 95.73 95.57 92.45 93.49 90.93 90.81 66.23 87.31 92.14(0.3665) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1400 96.02 96.16 93.96 94.65 92.63 92.43 64.73 88.34 93.70(0.4256) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

Table 4Classification results on the test set of Ringnorm data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

800 86.82 77.13 85.98 51.18 50.00 68.68 64.42 79.30 71.23(0.0000) (0.0263) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1000 88.27 77.43 87.02 50.27 50.00 69.13 63.68 79.93 71.98(0.0000) (0.0018) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1200 89.52 77.63 88.43 50.02 50.00 69.85 68.10 80.12 72.47(0.0000) (0.0011) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1400 90.50 77.75 89.65 50.70 50.00 70.48 71.52 80.38 73.23(0.0000) (0.0132) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

Table 5Classification results on the test set of Splice data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

600 86.35 83.40 84.97 82.45 82.85 69.32 52.62 71.83 69.95(0.0000) (0.0002) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1000 87.58 85.75 86.92 84.15 83.68 70.93 58.93 79.45 72.95(0.0000) (0.0336) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1400 88.82 87.13 88.10 85.32 84.87 72.58 61.05 83.28 75.02(0.5094) (0.0025) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1600 89.17 87.63 88.85 85.90 85.28 72.93 62.77 84.28 75.95(0.0000) (0.1213) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

N. Gu et al. / Neurocomputing 139 (2014) 345–356354

Page 11: A kernel-based sparsity preserving method for semi-supervised classification

more or less. In the future, efforts should be devoted to theextension of our method on higher-dimensional data sets. Besides,our aim can also possibly be realized by one-vs-one or one-vs-allextension of the binary-classification formulation and we will try toconstruct efficient binary classifiers in the future. In addition, fromRemark 3 we can see that the chosen kernel is constrained. How tofind a method which can utilize general kernels is still an interest-ing while challenging issue, and we will address this issue in ourfuture research.

Acknowledgments

This work is supported by the 2013 Scientific Research Founda-tion of Capital University of Economics and Business, the NationalNatural Science Foundation of China (NSFC) under Grants 61305035,

61203241, 11131006, 61373114, and the Natural Science Foundationof Zhejiang Province under Grants LQ13F030009 and LQ12F03004.

References

[1] A. Fujino, N. Ueda, K. Saito, Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle, IEEE Trans.Pattern Anal. Mach. Intell. 30 (3) (2008) 424–437.

[2] S.H. Ji, L.T. Watson, L. Carin, Semi-supervised learning of hidden Markovmodels via a homotopy method, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2)(2009) 275–287.

[3] U. Maulik, D. Chakraborty, A self-trained ensemble with semisupervised SVM:an application to pixel classification of remote sensing imagery, PatternRecognit. 44 (3) (2011) 615–623.

[4] Y. Li, C. Guan, H. Li, et al., A self-training semi-supervised SVM algorithm andits application in an EEG-based brain computer interface speller system,Pattern Recognit. Lett. 29 (9) (2008) 1285–1294.

Table 6Classification results on the test set of Waveform data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

600 87.03 84.33 84.65 54.38 49.80 85.00 86.55 86.70 86.73(0.0000) (0.0000) (0.0000) (0.0000) (0.0018) (0.4095) (0.1798) (0.5268)

1000 88.28 85.12 85.48 54.37 51.32 85.18 86.83 87.45 87.88(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0092) (0.0019) (0.3552)

1400 88.88 85.92 86.27 55.83 51.57 85.50 87.25 87.87 88.62(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0007) (0.0000) (0.4871)

1600 89.08 86.23 86.42 56.15 51.35 85.53 87.38 87.97 89.02(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0012) (0.0000) (0.8665)

Table 7Classification results on the test set of Tic-Tac-Toe Endgame data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

200 82.16 78.57 79.62 74.22 71.75 75.40 78.05 67.46 81.77(0.0000) (0.0002) (0.0000) (0.0000) (0.0000) (0.0001) (0.0000) (0.6104)

400 89.98 83.65 86.93 76.69 75.16 81.77 82.05 75.82 85.01(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

600 94.89 86.40 92.03 76.38 78.25 82.99 82.64 79.86 85.01(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

800 97.36 88.45 93.95 76.48 80.10 83.76 82.82 80.91 84.24(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

Table 8Classification results on the test set of Pima Indians Diabetes data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

200 72.91 67.97 69.79 66.32 46.49 69.05 55.63 67.88 69.97(0.0001) (0.0005) (0.0000) (0.0000) (0.0037) (0.0000) (0.0000) (0.0215)

300 74.65 68.01 70.79 64.41 45.27 69.84 58.62 69.92 71.14(0.0000) (0.0000) (0.0000) (0.0000) (0.0002) (0.0001) (0.0000) (0.0102)

400 75.61 68.32 72.00 64.10 43.53 69.79 63.55 71.57 71.96(0.0000) (0.0001) (0.0000) (0.0000) (0.0000) (0.0004) (0.0000) (0.0109)

600 76.83 69.62 73.52 65.32 43.49 70.01 71.22 73.52 71.78(0.0000) (0.0002) (0.0000) (0.0000) (0.0000) (0.0004) (0.0000) (0.0002)

Table 9Classification results on the test set of dna data set (accuracy rates % and p values in the parentheses).

m KSR-LSC S-RLSC S-RLSC-un SRC SRC-un 1-NN GRF LapRLSC LNP

200 86.99 75.14 76.42 49.46 48.55 63.99 51.91 52.08 58.93(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

800 93.73 90.12 92.47 82.37 82.72 69.96 51.91 53.41 73.32(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

1600 95.23 93.94 94.70 83.91 84.13 72.79 51.91 57.77 80.98(0.0000) (0.0006) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

2200 95.69 95.01 95.46 83.87 84.35 73.78 51.91 63.84 83.56(0.0000) (0.0034) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0000)

N. Gu et al. / Neurocomputing 139 (2014) 345–356 355

Page 12: A kernel-based sparsity preserving method for semi-supervised classification

[5] M. Li, Z.H. Zhou, Learning techniques using undiagnosed samples, IEEE Trans.Syst. Man Cybern. Part A 37 (6) (2007) 1088–1098.

[6] O. Chapelle, V. Sindhwani, S.S. Keerthi, Optimization techniques for semi-supervised support vector machines, J. Mach. Learn. Res. 9 (2008) 203–233.

[7] O. Chapelle, V. Sindhwani, S.S. Keerthi, Branch and bound for semi-supervisedsupport vector machines, in: Proceedings of the Advances in Neural Informa-tion Processing Systems, Cambridge, MA, 2007, pp. 217–224.

[8] O. Chapelle, J. Weston, B. Schok̈opf, Cluster kernels for semisupervisedlearning, in: Proceedings of the Neural Information Processing SystemsConference (NIPS 2003), 2003, pp. 585–592.

[9] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussianfields and harmonic functions, in: Proceedings of the 20th InternationalConference on Machine Learning (ICML2003), 2003.

[10] X. Zhu, Z. Ghahramani, Learning from Labeled and Unlabeled Data with LabelPropagation, Technical Report CMUCALD-02-107, Computer Science Depart-ment, Carnegie Mellon University, 2002.

[11] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schok̈opf, Learning with local andglobal consistency, in: Proceedings of the Neural Information ProcessingSystems Conference (NIPS 2004), 2004.

[12] M. Belkin, V. Sindhwani, P. Niyogi, Manifold regularization: a geometricframework for learning from labeled and unlabeled examples, J. Mach. Learn.Res. 7 (2006) 2399–2434.

[13] X. Zhu, Semi-Supervised Learning Literature Survey, Technical Report 1530,Computer Science Department, University of Wisconsin, 2006.

[14] A.M. Bruckstein, D.L. Donoho, M. Elad, From sparse solutions of systems ofequations to sparse modeling of signals and images, SIAM Rev. 51 (1) (2009)34–81.

[15] D.L. Donoho, M. Elad, Optimally sparse representation in general (nonortho-gonal) dictionaries via l1 minimization, Proc. Natl. Acad. Sci. 100 (5) (2003)2197–2202.

[16] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, S. Yan, Sparse representation forcomputer vision and pattern recognition, Proc. IEEE 98 (2010) 1031–1044.

[17] J. Wright, A. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition viasparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009)210–227.

[18] M. Fan, N. Gu, H. Qiao, B. Zhang, Sparse regularization for semi-supervisedclassification, Pattern Recognit. 44 (2011) 1777–1784.

[19] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications toface recognition, Pattern Recognit. 43 (2010) 331–341.

[20] D.L. Donoho, For most large underdetermined systems of linear equations theminimal l1-norm solution is also the sparsest solution, Commun. Pure Appl.Math. 59 (2006) 797–829.

[21] E. Candès, J. Romberg, T. Tao, Stable signal recovery from incomplete andinaccurate measurements, Commun. Pure Appl. Math. 59 (2006) 1207–1223.

[22] E. Candès, T. Tao, Near-optimal signal recovery from random projections:universal encoding strategies? IEEE Trans. Inf. Theory 52 (2006) 5406–5425.

[23] K. Muller, S. Mika, G. Riitsch, K. Tsuda, B. Schol̈kopf, An introduction to kernel-based learning algorithms, IEEE Trans. Neural Netw. 12 (2001) 181–201.

[24] S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems withapplications to sparse signal recovery, Math. Program. Comput. 3 (3) (2010)165–218.

[25] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002.[26] J. Zhuang, J. Wang, S.C. Hoi, X. Lan, Unsupervised multiple kernel learning, J.

Mach. Learn. Res. 20 (2011) 129–144 (Proceedings Track).[27] J. Zhuang, I.W. Tsang, S.C. Hoi, A family of simple non-parametric kernel

learning algorithms, J. Mach. Learn. Res. 12 (2011) 1313–1347.[28] S. Gao, I.W.H. Tsang, L.T. Chia, Kernel sparse representation for image

classification and face recognition. in: Proceedings of the European Confer-ence on Computer Vision (ECCV 2010), 2010.

[29] J. Yin, Z. Liu, Z. Jin, W. Yang, Kernel sparse representation based classification,Neurocomputing 77 (1) (2012) 120–128.

[30] J.J. Hull, A database for handwritten text recognition research, IEEE Trans.Pattern Anal. Mach. Intell. 16 (5) (1998) 550–554.

[31] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-20),Technical Report CUCS-005-96, 1996.

[32] X. Cao, H. Qiao, J. Keane, A low-cost pedestrian-detection system with a singleoptical camera, IEEE Trans. Intell. Transp. Syst. 9 (1) (2008) 58–67.

[33] C.W. Hsu, C.J. Lin, A comparison of methods for multi-class support vectormachines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425.

[34] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in:Proceedings of the 23rd International Conference on Machine Learning(ICML2006), 2006.

Nannan Gu received the B.Sc. degree in informationand computing science from Xi'an Jiaotong University,Xi'an, China, in 2006, the M.Sc. degree in appliedmathematics from Xi'an Jiaotong University, Xi'an,China, in 2009, and Ph.D. degree in pattern recognitionand intelligent systems from the Institute of Automa-tion, Chinese Academy of Sciences, Beijing, China, in2012. She is currently a lecturer in School of Statistics,Capital University of Economics and Business, Beijing,China. Her current research interests include theoryand application of semi-supervised classification, mani-fold learning and nonlinear dimensionality reduction.

Di Wang received the B.Sc. degree from ShandongUniversity, Jinan, China, and the Ph.D. degree in appliedmathematics from the Academy of Mathematics andSystems Science, Chinese Academy of Sciences, Beijing,China, in 2007 and 2012, respectively. He is currently alecturer with the College of Mathematics and Informa-tion Sciences, Wenzhou University, Wenzhou, China.His current research interests include theory and appli-cation of support vector machines.

Mingyu Fan received the B.Sc. degree from the CentralUniversity for Nationalities, Beijing, China, in 2006 andPh.D. degree in applied mathematics from the Institu-tion of Applied Mathematics, Academy of Mathematicsand System Science, Chinese Academy of Sciences,Beijing, China, in 2011. He is currently an associateprofessor in Wenzhou University, China. His currentresearch interests are theory and application of mani-fold learning and nonlinear dimensionality reduction.

Deyu Meng received the B.Sc., M.Sc., and Ph.D degreesin 2001, 2004, and 2008, respectively, from XianJiaotong University, Xian, China. He is currently anassociate professor with the institute for informationand system sciences, faculty of science, Xian JiaotongUniversity. His current research interests include prin-cipal component analysis, nonlinear dimensionalityreduction, feature extraction and selection, compressedsensing, and sparse machine learning methods.

N. Gu et al. / Neurocomputing 139 (2014) 345–356356