a novel support vector classifier for longitudinal high-dimensional data and its application to...

8
A Novel Support Vector Classifier for Longitudinal High-Dimensional Data and Its Application to Neuroimaging Data Shuo Chen and F. DuBois Bowman Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA Received 21 February 2011; revised 17 September 2011; accepted 21 September 2011 DOI:10.1002/sam.10141 Published online in Wiley Online Library (wileyonlinelibrary.com). Abstract: Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example, images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g., baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer’s disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space. 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 604–611, 2011 Keywords: longitudinal high-dimensional data; prediction; support vector classifier; fMRI; PET; Alzheimer’s disease 1. INTRODUCTION Current biomedical technology enables the collection of high-dimensional data (HDD) to gain insights regarding genomic, proteomic, and in vivo neural processing prop- erties. Moreover, such HDD are more commonly being collected longitudinally, potentially revealing changes in biological properties that may provide clues to disease diagnosis, progression, or recovery. Machine learning tools have been widely applied for HDD classification and pre- diction [1–3]. Support vector machine (SVM) methods are among the most popular machine learning techniques because of their high prediction accuracy and robustness [4–7]. However, most current machine learning meth- ods have been developed for cross-sectional rather than longitudinal high-dimensional data (LHDD) analysis. The ‘ideal’ methodology for LHDD would take advantage of the additional data to determine temporal trends of fea- tures and use them as inputs within machine learning models. However, in practice, the temporal trends are Correspondence to: Shuo Chen ([email protected]) usually unknown, and currently no such model exists for simultaneously determining the temporal trends and build- ing the classification model. To address classification or prediction objectives in con- text of LHDD, one may opt to use data from only a single time point, for example, baseline data. Another potential approach for handling LHDD is a naive procedure of sim- ply combining the longitudinal data as independent sources of information. Using data from only a single time point or using longitudinal data as independent sources of informa- tion may lead to substantial information loss and may not fully capitalize on the available data. One may also con- sider fitting preliminary models, for example, using logistic regression for each feature, and then using the resulting estimates to preset the temporal trends for classification. As this approach uses the classification outcome of interest in the preliminary modeling stage, presetting temporal param- eters for each feature using model based estimates would lead to the vast danger of overfitting and pose difficulty for the following feature selection procedure. In this paper, we propose a novel support vector clas- sifier (SVC) for LHDD that extracts key features of 2011 Wiley Periodicals, Inc.

Upload: shuo-chen

Post on 06-Jul-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

A Novel Support Vector Classifier for Longitudinal High-Dimensional Dataand Its Application to Neuroimaging Data

Shuo Chen∗ and F. DuBois Bowman

Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA

Received 21 February 2011; revised 17 September 2011; accepted 21 September 2011DOI:10.1002/sam.10141

Published online in Wiley Online Library (wileyonlinelibrary.com).

Abstract: Recent technological advances have made it possible for many studies to collect high dimensional data (HDD)longitudinally, for example, images collected during different scanning sessions. Such studies may yield temporal changes ofselected features that, when incorporated with machine learning methods, are able to predict disease status or responses to atherapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well suited for the classificationand prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collectedduring one time period or session (e.g., baseline). We propose a novel support vector classifier (SVC) for longitudinal HDDthat allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, whichdetermine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based onan augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use andpotential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer’s diseaseNeuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information toachieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naivelyexpanding the feature space. 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 604–611, 2011

Keywords: longitudinal high-dimensional data; prediction; support vector classifier; fMRI; PET; Alzheimer’s disease

1. INTRODUCTION

Current biomedical technology enables the collection ofhigh-dimensional data (HDD) to gain insights regardinggenomic, proteomic, and in vivo neural processing prop-erties. Moreover, such HDD are more commonly beingcollected longitudinally, potentially revealing changes inbiological properties that may provide clues to diseasediagnosis, progression, or recovery. Machine learning toolshave been widely applied for HDD classification and pre-diction [1–3]. Support vector machine (SVM) methodsare among the most popular machine learning techniquesbecause of their high prediction accuracy and robustness[4–7]. However, most current machine learning meth-ods have been developed for cross-sectional rather thanlongitudinal high-dimensional data (LHDD) analysis. The‘ideal’ methodology for LHDD would take advantage ofthe additional data to determine temporal trends of fea-tures and use them as inputs within machine learningmodels. However, in practice, the temporal trends are

Correspondence to: Shuo Chen ([email protected])

usually unknown, and currently no such model exists forsimultaneously determining the temporal trends and build-ing the classification model.

To address classification or prediction objectives in con-text of LHDD, one may opt to use data from only a singletime point, for example, baseline data. Another potentialapproach for handling LHDD is a naive procedure of sim-ply combining the longitudinal data as independent sourcesof information. Using data from only a single time point orusing longitudinal data as independent sources of informa-tion may lead to substantial information loss and may notfully capitalize on the available data. One may also con-sider fitting preliminary models, for example, using logisticregression for each feature, and then using the resultingestimates to preset the temporal trends for classification. Asthis approach uses the classification outcome of interest inthe preliminary modeling stage, presetting temporal param-eters for each feature using model based estimates wouldlead to the vast danger of overfitting and pose difficulty forthe following feature selection procedure.

In this paper, we propose a novel support vector clas-sifier (SVC) for LHDD that extracts key features of

2011 Wiley Periodicals, Inc.

Page 2: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

Chen and Bowman: A Novel SVC for Longitudinal High-Dimensional Data 605

each cross-sectional component as well as temporal trendsbetween these components for the purpose of classificationand prediction. The objective function of our new methodincorporates two groups of estimands: the decision hyper-plane function parameters and the temporal trend parame-ters that determine an optimal way to combine the longitudi-nal data. The objective function is derived from maximizingthe margin width, with error-tolerated correct classificationconstraints. Within the framework of the Lagrange (Wolfe)dual of the objective function, we augment the dimensionof the Hessian matrix by incorporating the temporal trendparameters. Then, we apply quadratic programming tech-niques to optimize the classification parameters and tempo-ral trend parameters. With the kernels satisfying Mercer’sconditions, the objective function is convex, leading to afinite dimensional representation of the decision function.The framework allows feature selection with unknown tem-poral trend parameters through recursive feature elimination(RFE) procedures.

Generally, our proposed framework is applicable to anytype of high dimensional data that are measured longitu-dinally. For example, in the application to neuroimagingdata, our method is applicable to longitudinal/multisessionstudies collecting fMRI, PET, EEG, and MEG data. Thelongitudinal property refers to multiple scanning sessions(e.g., images or collections of images acquired on differentdays). Importantly, for some neuroimaging data (e.g., fMRIdata), there may be a series of images measured at differenttime points within one session. Therefore, we usually usefeatures reflecting various summaries from the original dataat each session. For example, the features in our methodmay include functional connectivity, localized activity sum-mary statistics (first level analysis results for fMRI data),or frequency domain summary statistics. Hence with appro-priate summary statistics, our approach can handle a rangeof high-dimensional data modalities.

The rest of the paper is organized as follows. InSection 2, we present the new longitudinal SVC and pro-vide an accompanying computational strategy. Furthermore,we discuss its extension to nonlinear kernels and RFE basedfeature selection algorithm. In Section 3, we examine theclassification performance of the proposed method for adata example and in a stimulation study. Section 4 con-cludes the paper with a summary and a discussion of themajor strengths of our novel SVC for LHDD.

2. METHODS

2.1. Classical Support Vector Classifier

SVC is a popular kernel machine learning algorithmthat is derived to solve classification problems [8]. Forone subject indexed by s, the p dimensional feature

space is denoted as xs ∈ Rp, for s = 1, 2, . . ., N and

group indicators ys ∈ {−1, 1} denote a binary state suchas disease status (positive/negative) or treatment response(recovery or not). A classifier is defined by constructinga separating function (or hyperplane) h(xs) = w · xs +b and then generating yi = sign(h(xs)), if the data arelinearly separable. The SVC chooses the unique hyperplanethat maximizes the margins, which are the distancesbetween the hyperplane and the support vectors. For caseswhen data are not linearly separable, a ‘soft margin’ isintroduced that allows some data points to be misclassified.Therefore, the SVC optimizes the following objectivefunction:

minw

1

2‖ w ‖2 +C

N∑s=1

ξs s = 1, 2, . . . , N, (1)

subject to

ys(w · xs + b) � 1 − ξs, and ξs � 0,

where ξs is the distance of the subject s from its correct sideof the margin and the constraint constant C is the tuningparameter regarding the tolerance level of misclassification.

Then, we obtain the Lagrange (Wolfe) dual by substitut-ing w = ∑N

s=1 ysαsxs to the Lagrange primal function offormula (1).

minαs

1

2

∑s,s′

αsαs′ysys′ 〈�(xs),�(xs′)〉

−N∑

s=1

αs for s and s ′ = 1, 2, ...N, (2)

subject to

C � αs � 0, and∑

s

αsys = 0,

where K(xs , xs′) = 〈�(xs),�(xs′)〉 means that we firstmap data into a higher dimension through the function�(·), then take the inner product of the mapped vectors.The formula 2.2 could be expressed as:

minα

1

2αT Gα − 1′α, (3)

where G is a Gram matrix (N × N ) satisfying the Mercer’scondition (requiring G is at least semipositive definite)multiplied by corresponding group labels, α is a 1 by N

vector of estimands. Gs,s′ is 〈�(xs),�(xs′)〉ysys′ . Then,the ‘Wolfe’ dual is well suited for quadratic programming(QP) optimization programs in most software. The objective

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 3: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

606 Statistical Analysis and Data Mining, Vol. 4 (2011)

function in formula (3) can be also considered as thesum of penalty and loss functions in terms of reproducingkernel Hilbert space with h(·) = ∑N

s=1 αsysK(xs , ·) ∈ HK

[9,10]. Once the separating hyperplane has been determinedthrough quadratic programming optimization, the classlabel of a new observation xnew can be determined by thesign function of h(xnew) = ∑N

s=1 αsysK(xs , xnew) + b.

2.2. Longitudinal Support Vector Classifier

Consider longitudinal data collected from N subjects atT measurement occasions or scanning sessions, with p fea-tures quantified during each session. The expanded featurematrix is then T N by p. Let xs,t be used to represent thefeatures collected for one subject s at time t . Hence, our aimis to classify each individual xs = {xs,1, xs,2, . . . , xs,T }′ toa certain group ys ∈ {−1, 1}. We characterize linear trendsof change: xs = xs,1 + β1xs,2 + β2xs,3 + · · · + βT −1xs,T ,with unknown parameter vector β = (1, β1, β2, . . . , βT −1)

′.The trend information is desired as inputs of the SVC. Akey challenge that we address is how to jointly estimate theparameter vectors β and α. We propose a novel longitudinalsupport vector classifier (LSVC) that jointly estimates theseparating hyperplane parameters and the temporal trendparameters using quadratic programming. We present ourapproach using a simple linear kernel, but the ideas natu-rally extend to other kernel functions.

Let Xm = [Xt=1, Xt=2, . . . , Xt=T ]′ be a p by T N

matrix, with components Xt=k = (y1x1,t=k, y2x2,t=k, . . . ,

yNxN,t=k) representing data from N subjects, each withp features. The corresponding βm is a T N by N matrix.

G = (Xt=1 + β1Xt=2 + · · · + βT −1Xt=T )T

(Xt=1 + β1Xt=2 + · · · + βT −1Xt=T )

= (Xmβm)T (Xmβm)

= βTmGmβm,

(4)

with

Gm =

X

T

t=1Xt=1 · · · XT

t=1Xt=T

.... . .

...

XT

t=T Xt=1 · · · XT

t=T Xt=T

and βT

m = [IN×N, β1IN×N, β2IN×N, . . . , βT −1IN×N ].Then, we denote wnv as the estimate of the separating

hyperplane parameter in the classical SVC with inputs in theform of xs = xs,1 + β1xs,2 + β2xs,3 · · · + βT −1xs,T . Theprimal objective function becomes

minwnv

1

2‖ wnv ‖2 +C

N∑s=1

ξs s = 1, 2, . . . , N. (5)

Similarly, substituting wnv = ∑Ns=1 ysαs (xsβm)T , we

can reparameterize the Langrange (Wolfe) dual function as:

minα

1

2αT

mGmαm − 1′α, (6)

subject to

C � αm(s) � 0,

T∑t

N∑s

αm(s + (t − 1)N)ys = 0,

for s = 1, 2, . . . , N and t = 1, 2, . . . , T − 1.

In this way, the model augments the dimension of Gm

to T N by T N and the augmented kernel is ensured to besemipositive definite. After αm is determined, the separatinghyperplane parameter becomes

wnv =[ n∑

s=1

αm(s)xs,1ys +n∑

s=1

αm(s + N)xs,2ys, · · ·

+n∑

s=1

αm(s + N(T − 1))xs,T ys

]. (7)

Defining the 1 × T vector αm,s = (αm(s),αm(s + N),

. . . ,αm(s + (T − 1)N)) we then have wnv = ∑ns=1 ys

αm,s xs . In either case, we can notice that

wnv =n∑

s=1

ysαm(s)(xs,1 + β1xs,2)

+ (β2xs,3 + · · · + βT −1xs,T ).

After obtaining wnv , we have b = 1N

∑Ns=1 (wnv · (xsβm)T

−ys), in which βm can be estimated based on αm. Hence,the separating hyperplane is

h(x) = wnv · (xβm)T + b. (8)

The subjects with all αm > 0 are considered as supportvectors. Therefore, this method is different from directlyapplying SVC after stacking up the features at differenttimes as independent features. In fact, this naive expansionof the feature space is a special case of LSVC with allβ = 1.

Besides estimating α and β vectors from αm, we canalternatively employ an iterative procedure to estimate α

and β with respect to an objective function of Eq. (6). Thealgorithm will take T quadratic programming steps for eachiteration. We rewrite the first part of the objective functionin Eq. (6) as:

αG0,0m α + αβT

mG0,Tm α + αG0,T

m βmα + αGT ,Tm βmα, (9)

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 4: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

Chen and Bowman: A Novel SVC for Longitudinal High-Dimensional Data 607

where we denote

Gm =[

G0,0m G0,T

m

GT ,0m GT ,T

m

]. (10)

For example, G0,0m (N × N ) is the submatrix in the left top

corner of the matrix Gm for the baseline data (XT

t=1Xt=1).As the sum of convex functions is still convex, we only

need to prove that the objective function in Eq. (9) is convexwith respect to α and β. We relegate the proof of convex-ity to the Appendix. The convexity guarantees that the localminimum is also the global minimum and the solution forthat minimum is unique. The algorithm is described as fol-lows: (i) we start with initial values of β and use QP tooptimize Eq. (9) to obtain α; (ii) use the updated α obtainedin step 1 and apply QP again to estimate β; (3) repeat theabove two steps until convergence. The uniqueness of thesolution leads to the convergence of the iterative algorithm.

2.3. Nonlinear Kernel Functions

Although the above derivations are considered in contextof a linear kernel, it is natural to extend to nonlinear kernels.First, we can denote

K(xs , xs′) =

K(xs,1, xs′,1) · · · K(xs,1, xs′,T )...

. . ....

K(xs,T , xs′,1) · · · K(xs,T , xs′,T )

(11)

and we have < βK(·, xs,t ), K(·, xs′,t ) >= βK(xs,t , xs′,t ),where K(·, xs,t ) indicates the reproducing kernel mapof xs,t . Therefore, the temporal trend is taken on thereproducing kernel mapped space which may be a setof nonlinear transformations of xs,t , say K(·, xs,t=0) +β1K(·, xs,t=1) . . . ,+βT −1K(·, xs,t=T −1).

Thus, the reproducing kernel function of separatinghyperplane becomes

h(x) =N∑

s=1

ysαmK(x, xs)βTm + b, (12)

where b is obtained by b = 1N

∑Ns=1

∑Ns′=1 (ysy

′sαmK(xs ,

xs′) βTm − ys). In this way, the temporal trend parameter

vector’s length is increased in accordance with the dimen-sion of features mapped. Hence, it also could be consid-ered to estimate nonlinear temporal trends of the originalfeatures.

2.4. Feature Selection and Parameter Tuning

Feature selection is a critical step in supervised learning,as it can reduce the dimensionality of the feature space,

leading to increased robustness, improved stability of theclassifier, and reduced computational load. However, forlongitudinal HDD, feature selection based on ‘filtering’ maynot be applicable because elements of β are unknown andthus no statistical test can be conducted for each feature.Nevertheless, ‘wrapper’ procedures such as SVC basedrecursive feature elimination (RFE) algorithm is valid underour new LSVC model. The SVC-RFE algorithm was firstproposed by [11], and it ranks all the features accordingto a classifier based weight function and eliminates oneor more features with the lowest weights. This process isrepeated until the minimal set of features achieve highclassification accuracy. For a linear SVC, the weightsare simply summarized from the p × 1 vector w. Fornonlinear kernel SVC, the rank of a feature is determinedby the impact that its removal has on the variationof ‖w‖2. In context of longitudinal HDD, the rank isdetermined by:

|‖wnv‖2 − ‖w−vnv ‖2| = |αT

mGmαm − (α−vm )T G−v

m α−vm |,

(13)

where w−vnv , α−v

m , and G−vm are the estimates and inputs

without feature or features v.In addition, the tuning parameters such as cost C

are also important, as they can affect the estimate ofseparating hyperplane parameters as well as temporal trendparameters. If we consider C as the level of shrinkage,and large C corresponds to light regularization and smallC stands for heavy regularization. Therefore, we canuse the SVC path algorithm by starting with large C

(low regularization) and increase it gradually, and observethe path of shrinkage in terms of αm(C) [12]. Thisprocess will provide insight concerning the bias andvariance trade off. For LSVC, we can utilize this shrinkagepath algorithm to better estimate the temporal trendparameters.

3. RESULTS

We investigate the performance of the proposed methodby using simulation data and by using data from alongitudinal neuroimaging study.

3.1. Simulation Study

To evaluate the performance of our proposed LSVC,we generate longitudinal data for 200 subjects and evenlydivide them into two groups. Data for each subject includesp = 100 features at two time points (T = 2). We generatea group label ys ∈ {−1, 1} and features xs for each subject.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 5: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

608 Statistical Analysis and Data Mining, Vol. 4 (2011)

We also use a binary variable z to determine the baselinefeature expression level, e.g. if zs = 1 then xs = 1 other-wise xs = 0. Within each group, half of of the subjectshave zs = 1 at the lowest level of separability of the base-line data. If zs = ys , the baseline data are 100% separable.We then set up the temporal change variable � that dependson the group label y by letting �s = 1, if ys = 1, other-wise �s = 0. Thus, different groups have different temporaltrends. Therefore, the simulation is generated as follows:

xs,t=1|z = 0 ∼ N(0, I · σ 2),

xs,t=1|z = 1 ∼ N(1, I · σ 2),

�s |y = −1 ∼ N(0, I · τ 2),

�s |y = 1 ∼ N(1, I · τ 2),

and xs,t=2 = xs,t=1 + a · �, where a is scalar to denote themagnitude and direction of the change. In this simulation,we use σ 2 = 0.01, τ 2 = 0.001, and a = 1 to generate the

data. The generated data is depicted in Fig. 1, with thex -axis indicating the subject number (the first 100 subjectsare in group one, the rest are in group two), and the y-axisindicating the feature expression level. The three subplotsdescribe baseline, time one, and the temporal trend.

We test the performance of the model using differentparameter and separability conditions. The variance haslittle influence on the model if the data are not separable,but separability definitions do impact the results. Therefore,we consider four methods: SVC based on baseline data,SVC based on both baseline and time one data stackedand treated as independent (i.e., no temporal trends),our proposed LSVC, and SVC with a known trend.We test these methods using separability levels of 50%,60%, and 70%. The separability level between groupscould also be considered as a function of the variationbetween subjects within each group, where a lowerlevel of separability between groups results from higherbetween-subjects (in one group) variation. Also, we run

Fig. 1 Simulated data set: (a) baseline data, (b) time one data, and (c) temporal change. [Color figure can be viewed in the online issue,which is available at wileyonlinelibrary.com.]

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 6: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

Chen and Bowman: A Novel SVC for Longitudinal High-Dimensional Data 609

an additional simulation by introducing different random‘subject’ effects upon the features both at baseline andtime 1. The random effects are assumed to have zeromean and variance σ 2, 2σ 2, and 5σ 2 (totally blurring),and the higher level of noise leads to lower levelof SNR ratio. We then evaluate the performances ofthe classifiers under different levels of SNR ratios (seeTable 2). For all cases, we only consider linear kernelsfor equitable comparisons. In addition, for the tuningparameter C, there is no closed form estimator thoughcross validation can be applied to assist in determiningthe best-performing value of C from a pre-specified listof values [12]. We feel that generating prediction resultsbased on different levels of C provides a better evaluationof the SVC’s performance when comparing differentmodels.

We present the accuracy results (and standard deviation)for each method and for each simulation setting in Table 1.The results indicate that our LSVC has excellent perfor-mance, which is comparable to the ‘oracle’ model withperfect accuracy in our simulation example. Here, SVCwith ‘oracle’ represents the SVC as if the temporal trendis known and maximal information is obtained for LHDD.The traditional SVC performs very poorly across all simu-lation settings.

Similarly, Table 2 shows the results for the simulateddata with 50% separability and different levels of noise.When the noise level is low, LSVC performs better thantraditional methods and approximates the ‘oracle’ model.When we double the original noise level considered, thatis, the noise is taken to be 2σ 2, the LSVC still performsquite well and shows marked improvements over the tradi-tional SVC. We also consider a case where the noise levelsaturates the signal, specifically 5σ 2. Although this case is

Table 1. Simulation classification results with differentseparability.

Cost (C)SVC

baseline SVC stack LSVCSVC

‘oracle’

50% separable at baseline0.1 0.49 (0.06) 0.51 (0.04) 0.99 (0.01) 1(0.0)1 0.52 (0.03) 0.50 (0.03) 1 (0.00) 1(0.0)100 0.50 (0.02) 0.53 (0.04) 1 (0.01) 1(0.0)10000 0.53 (0.03) 0.48 (0.06) 0.99 (0.03) 1(0.0)

60% separable at baseline0.1 0.57 (0.16) 0.71 (0.31) 1 (0.0) 1(0.0)1 0.52 (0.23) 0.75 (0.11) 1 (0.0) 1(0.0)100 0.58 (0.12) 0.73 (0.14) 1 (0.01) 1(0.0)10000 0.63 (0.08) 0.72 (0.26) 1 (0.02) 1(0.0)

70% separable at baseline0.1 0.72 (0.13) 0.83 (0.04) 1 (0.01) 1(0.0)1 0.78 (0.07) 0.87 (0.03) 1 (0.02) 1(0.0)100 0.73 (0.12) 0.82 (0.04) 1 (0.02) 1(0.0)10000 0.71 (0.21) 0.81 (0.06) 1 (0.06) 1(0.0)

Table 2. Simulation classification results with different noiselevel.

Cost (C)SVC

baseline SVC stack LSVCSVC

“oracle”

Noise: σ 2

0.1 0.47 (0.13) 0.54 (0.07) 0.98 (0.03) 1(0.0)1 0.56 (0.09) 0.50 (0.12) 0.99 (0.01) 1 (0.0)100 0.51 (0.06) 0.61 (0.14) 0.99 (0.01) 1(0.0)10000 0.52 (0.10) 0.55 (0.08) 0.99 (0.01) 1(0.0)

Noise: 2σ 2

0.1 0.57 (0.16) 0.55 (0.21) 0.96 (0.03) 1(0.0)1 0.52 (0.23) 0.55 (0.14) 0.96 (0.02) 1(0.0)100 0.48 (0.12) 0.52 (0.18) 0.98 (0.01) 1(0.0)10000 0.55 (0.08) 0.49 (0.21) 0.97 (0.02) 1(0.0)

Noise: 5σ 2

0.1 0.48 (0.33) 0.47 (0.28) 0.56 (0.22) 0.73 (0.11)1 0.54 (0.17) 0.52 (0.23) 0.61 (0.18) 0.68 (0.20)100 0.53 (0.22) 0.51 (0.26) 0.54 (0.15) 0.70 (0.17)10 000 0.51 (0.24) 0.49 (0.16) 0.58 (0.06) 0.62 (0.13)

not likely to arise in practice, we wanted to evaluate theperformance of our method under extreme conditions. Nat-urally, the performance of our model declines in this setting,but it still outperforms the conventional SVC approaches.

3.2. Data Example

We analyze data from the ADNI database (www.loni.ucla.edu/ADNI), which includes longitudinal PET scansacquired at baseline, six months, and 12 months. We useddata from 80 subjects, 40 Alzheimer’s disease (AD) patientsand 40 healthy controls, ages 62–84. We used SPM5 fordata preprocessing. We illustrate our longitudinal SVCprocedure using PET scans from baseline and 12 months.

We use 1877 voxels within AD relevant regions ofinterest (ROI) as features, for example the hippocampus andentorhinal cortex (see Fig. 2). On the basis of voxels withinselected ROIs, we applied our novel longitudinal SVC todiscriminate healthy and AD groups. Our goal here is notto chase perfect classification of accuracy through tuningparameters and feature selection, rather we demonstratethe usage of the proposed method and compare with thealternatives. For the validation procedure, we choose leaveone out cross-validations. We tested on the data by usingthree classification methods SVC: with baseline sessiononly, SVC with two sessions stacked independently (Nby 2p); and the proposed LSVC. Also, different kernelsare used. The results show that the accuracies for the twoalternative methods across all costs are around 50% whenpolynomial (degree 2, 3, 5, 10) and Gaussian kernels (withvarious values of σ ) are used. We tune the cost parameteracross C = (0.1, 1, 100, 10 000). The accuracies are listedin Table 3 based on a leave-one-out cross validation across

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 7: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

610 Statistical Analysis and Data Mining, Vol. 4 (2011)

Fig. 2 Voxels in these ROIs are used for analysis. [Colorfigure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

Table 3. ADNI PET data classification results.

Cost (C) SVC baseline SVC stack LSVC

0.1 0.65 0.66 0.781 0.66 0.67 0.76100 0.65 0.67 0.7510 000 0.66 0.66 0.75

all costs. In general the accuracy of LSVC method is10–15% higher than the other two alternative methods.

Overall, on the basis of the simulation study and neu-roimaging data analysis, our proposed method outperformsthe traditional methods.

4. DISCUSSION

In this article, we present a novel support vectorclassifier for LHDD. Our proposed method estimatesdecision function parameters and longitudinal parameterssimultaneously using quadratic programming. The classifiercan be extended to any kernel that satisfies Mercer’scondition, and then the temporal trend is based on thenonlinear transformations of the original feature space.The SVC-RFE feature selection procedure can also beconducted in our LSVC, with ranking weight based on thewidth of the separating margins.

We apply the proposed method to longitudinal neu-roimaging data, which is a type of LHDD with tempo-ral and spatial correlation structure. A growing literaturehas addressed the issue of temporal and spatial correla-tion when modeling neuroimaging data as dependent vari-ables [13,14]. However, in our model the LHDD represent

independent variables, and the group label for each subjectis the dependent variable, and usually we do not explic-itly account for correlations of the predictors. Note thatwe model the temporal trend for the LHDD to accountfor the temporal correlations introduced by the longitu-dinal experimental design. For fMRI data, as we use the1st level analysis results as features, the scan to scan tem-poral correlation is considered in the first level analysisusing conventional approaches such as prewhitening orprecoloring.

In our data example, we use the biological informationto effectively reduce the number of features from around300 000 to 1877, rather than performing variable selec-tion empirically. When such biological information is notpresent, some supervised methods are applicable. On thebasis of the results from our simulation study and ourdata example, the LSVC leverages the additional infor-mation from longitudinal measurements to achieve higherprediction accuracy. The computational load of our LSVCtechnique is generally quite manageable, and on averagetraining a LSVC model of 200 subjects with 100 featurestakes roughly 14 min on a PC with Intel Core2 Duo 2.83GCPU and 4 GB memory.

5. APPENDIX: PROOF

PROPOSITION 1: f = αTmGmαm is a convex function regarding α

and β, where αm = (α, β1α, . . . , βT −1α).

Proof: The second order condition of convexity requires the Hessianmatrix ∇2f to be positive semidefinite (p.s.d.) and

∇2f =

∂2f

∂α2 , ∂2f

∂α∂β

∂2f

∂β∂α,

∂2f

∂β2

.

Here first present the case of two time points and extend it to T timepoints. Therefore,

f =(

α

βα

)T[

G0,0m G0,1

m

G1,0m G1,1

m

] (α

βα

),

where G0,0m = X

T

t=1Xt=1, G0,1m = X

T

t=1Xt=2, G1,0m = X

T

t=2Xt=1 and G1,1m =

XT

t=2Xt=2.Then, the four derivatives are:

∂2f

∂α2(N×N)

= G0,0m + βG0,1

m + βG1,0m + β2G1,1

m

∂2f

∂α∂β (N×1)

= (G0,1m + βG1,1

m )α

∂2f

∂β∂β (1×N)

= αT (G0,1m + βG1,1

m )

∂2f

∂β2(1×1)

= αT G1,1m α.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 8: A novel support vector classifier for longitudinal high-dimensional data and its application to neuroimaging data

Chen and Bowman: A Novel SVC for Longitudinal High-Dimensional Data 611

Next, we need to prove ∇2f is p.s.d. For any nonzero vector v of lengthN and scalar u,

(vu

)T

∂2f

∂α2 , ∂2f

∂α∂β

∂2f

∂β∂α,

∂2f

∂β2

(vu

)

= vT G0,0m v + vT G0,1

m αu + βvT G0,1m v + uαT G0,1

m v + vT G1,0m vβ

+ uαT G1,1m vβ + βvT G1,1

m αu + βvT G1,1m vβ + uαT G1,1

m αu

= [Xt=2(βv + uα) + Xt=1v]T [Xt=2(βv + uα) + Xt=1v] + βvT G1,1m vβ

+ uαT G1,1m αu ≥ 0,

because G1,1m is p.s.d.

Similarly for T time points data set, the Hessian matrix

∇2f =[

Xt=1v +T −1∑k=1

Xt=k+1(βkv + ukα)

]T

×[

Xt=1v +T −1∑k=1

Xt=k+1(βkv + ukα)

]+

T −1∑k=1

ukαT Gk,k

m αuk

+T −1∑k=1

βkvT G1,1m vβk

is also p.s.d.Moreover, the objective functions with nonlinear kernels are also

convex if each K(Xt=k, Xt=k) follows Mercer’s condition and is p.s.d.for k = 1, 2, . . . , T .

ACKNOWLEDGMENTS

This research was supported by NIH grants R01-MH079251 (PI. Dr. F. DuBois Bowman).

REFERENCES

[1] T. M. Mitchell, R. Hutchinson, R. S. Niculescu, F.Pereira, X. Wang, M. Just, and S. Newman, Learning to

decode cognitive states from brain images, Mach Learn 57(2004), 145–175.

[2] S. LaConte, S. Strother, V. Cherkassky, J. Anderson, and X.Hu, Support vector machines for temporal classification ofblock design fMRI data, Neuroimage 26 (2005), 317–329.

[3] S. Chen, D. Hong, and Y. Shyr, Wavelets-based proceduresfor proteomic mass spectrometry data processing, ComputStat Data Anal 52 (2007), 211–220.

[4] V. Vapnik, Statistical Learning Theory, New York, Wiley,1998.

[5] J. Mourao-Miranda, A. L. Bokde, C. Born, H. Hampel,and M. Stetter, Classifying brain states and determining thediscriminating activation patterns: support vector machineon functional MRI data, Neuroimage 28 (2005), 980–995.

[6] C. H. Y. Fu, J. Mourao-Miranda, S. G. Costafreda, A.Khanna, A. F. Marquand, S. C. R. Williams, andM. J. Brammer, Pattern classification of sad facial process-ing: toward the development of neurobiological markers indepression, Biol Psychiatry 63 (2008), 656–662.

[7] R. C. Craddock, P. E. Holtzheimer, X. P. Hu, and H.C. Mayberg, Disease state prediction from resting statefunctional connectivity, Magn Reson Med 62 (2009),1619:28.

[8] V. Vapnik, The Nature of Statistical Learning Theory, NewYork, Springer, 1996, 188.

[9] G. Wahba, Spline Models for Observational Data, Philadel-phia, PA, SIAM, 1990.

[10] T. Hastie, and R. Tibshirani, Generalized Additive Models,London, Chapman and Hall, 1990.

[11] I. Guyon, and A. Elisseeff, Introduction to variable andfeature selection, J Mach Learn Res 3 (2003), 1157–1182.

[12] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, The entireregularization path for the support vector machine, J MachLearn Res 5 (2004), 1391–1415.

[13] F. Bowman, B. Caffo, S. Bassett, and C. Kilts, A Bayesianhierarchical framework for spatial modeling of fMRI data,NeuroImage 39 (2008), 146–156.

[14] G. Derado, F. Bowman, and C. Kilts, Modeling the spatialand temporal dependence in fMRI data, Biometrics 66(2010), 949–957.

Statistical Analysis and Data Mining DOI:10.1002/sam