dictionary learning-based fmri data analysis for capturing

1

Dictionary Learning-Based fMRI Data Analysis forCapturing Common and Individual Neural

Activation MapsRui Jin Krishna K. Dontaraju Seung-Jun Kim M. A. B. S. Akhonda Tulay Adali

Dept. of Computer Science & Electrical EngineeringUniversity of Maryland, Baltimore County

Baltimore, MD 21250{rjin1,krishna5,sjkim,mo32,adali}@umbc.edu

Abstract—A novel dictionary learning (DL) method is pro-posed to estimate sparse neural activations from multi-subjectfMRI data sets. By exploiting the label information such asthe patient and the normal healthy groups, the activationmaps that are commonly shared across the groups as well asthose that can explain the group differences are both captured.The proposed method was tested using real fMRI data setsconsisting of schizophrenic subjects and healthy controls. TheDL approach not only reproduced most of the maps obtainedfrom the conventional independent component analysis (ICA),but also identified more maps that are significantly group-different, including a number of novel ones that were notrevealed by ICA. The stability analysis of the DL method andthe correlation analysis with separate neuropsychological testscores further strengthen the validity of our analysis.

I. INTRODUCTION

Functional magnetic resonance imaging (fMRI) has be-come a representative neuroimaging modality, thanks to itscontributions to understanding brain functions and patholo-gies [1]–[4]. FMRI can reveal the areas of high neural activi-ties non-invasively based on neurovascular coupling [5]. Thespatial resolution offered by fMRI is much finer than otherexisting modalities such as electroencephalography (EEG)and magnetoencephalography (MEG), often providing im-portant complementary views in multi-modal studies [6],[7]. The group analysis of fMRI data can capture commonfunctional networks across multiple subjects [8], [9]. ThefMRI analysis is also useful for identifying variations presentin different subgroups and individuals. In particular, fMRIstudies are instrumental for finding potentially distinguishingbiomarkers for a variety of brain diseases [1], [2], [7].

The model-driven approaches for fMRI data analysistypically correlate the time-series data with a hypothesizedreference signal such as the hemodynamic response function(HRF), as in the general linear model (GLM) implementedin the statistical parametric mapping (SPM) software [10].However, the results depend on the reliability of the assump-tions made. On the contrary, data-driven approaches makeminimal assumptions, rendering the analysis more robust to

This work was supported in part by NSF grants 1631838 and 1618551,as well as a UMBC seed grant.

modeling errors, adaptive to various nuisances in the data,and accommodating to individual traits in multiple data sets.

Blind source separation (BSS) approaches such as in-dependent component analysis (ICA) have been the majordata-driven analysis tools for fMRI data [11]. The ICAmethod aims to extract statistically independent sources fromtheir linear mixture observations. When applied to fMRIdata analysis, ICA obtains functionally meaningful spatialmaps, without imposing any constraints in the temporaldomain [8], [9]. ICA can be extended to the case wheremultiple data sets corresponding to different subjects are tobe processed together. Group ICA concatenates the multi-subject data sets across the time dimension, before per-forming ICA jointly [12]. The independent vector analysis(IVA) method captures the dependence across the data setsby forming so-called source component vectors (SCVs).Then, the independence across the SCVs and the dependencewithin each SCV are maximized [13], [14]. Some of thespatial components identified this way were shown to berevealing the differences between the patient and healthycontrol groups with better performance than the widely usedgroup ICA [13], [15].

More recent approaches include the use of deep learn-ing techniques, which exhibit impressive performance invarious machine learning and computer vision applications.A restricted Boltzmann machine (RBM) and a deep beliefnetwork (DBN) were employed for fMRI and structuralMRI (sMRI) data analysis in [16], and the interpretationof the learned deep network features was attempted usingtheir nonlinear embeddings. Convolutional neural networks(CNNs) were adopted for predicting Alzheimer’s diseaseand autism spectrum disorder (ASD) with good predictionperformances [17], [18]. In order to interpret the learnedbiomarkers, the difference in prediction performance wasanalyzed when a region of interest (ROI) was corruptedin the input data [18], or a direct sensitivity analysis wasperformed to the learned network [19]. In summary, whiledeep learning methods extract useful biomarkers, the inter-pretation remains challenging, especially as the depth of thenetwork increases.

In this work, a dictionary learning (DL) approach isdeveloped for fMRI data analysis [20], [21]. DL postu-

This is a preprint. For the final version, please refer to IEEE Journal of Selected Topics in Signal Processing, DOI: 10.1109/JSTSP.2020.2992430.

2

lates that each data sample can be represented as a linearcombination of a small number of atoms in the dictionary,thus capturing a union-of-subspace structure. Rather thanemploying a predefined dictionary such as the Fourier orthe wavelet basis, DL aims at learning the basis from thedata, which can even be overcomplete if the number ofmeasurements is small compared to the dimension of thesubspaces. The unsupervised learning method boils downto factorizing the data matrix into a dictionary matrix anda sparse code matrix, while minimizing the reconstructionerror. The approach is flexible in that one can incorporatevarious side information to the learning cost in the formof appropriate regularizers. DL has shown state-of-the-artperformance in a variety of applications including imagedenoising and inpainting, as well as object recognition [22]–[27].

The learned dictionary atoms are often interpretable,which is a critical merit in medical image analysis. In [28],a DL formulation was applied to task fMRI data with aminimum description length (MDL) criterion for determin-ing the sparsity level. It has been shown that the obtainedtime courses were highly correlated with the canonical HRFsand the spatial activation maps were localized to the relevantbrain areas. DL was employed for whole-brain task fMRIdata for brain functional network identification in [29]. Ahierarchical probabilistic model for brain activity patternswas proposed in the framework of DL for analyzing multi-subject resting-state fMRI data to estimate subject-specifictime courses and the spatial maps that are shared acrossthe population [30]. The performances of the DL and ICAmethods were compared using various metrics in [31], whereit was shown that the time course extracted from DL oftenexhibits the spectral patterns that match well with actualneural activities.

The DL framework can be used for supervised learningas well, where the dictionary is trained to capture distinctivepatterns in the data such that the corresponding sparse codescan be used for predicting the labels for the data [32], [33].This can be done by simply employing a cost function thataugments the reconstruction error with the label predictionerror, or by directly seeking a discriminative dictionary viaminimizing the prediction error of the classifier as a functionof the dictionary.

In the context of fMRI data analysis, one is not necessarilyinterested in improving the prediction accuracy itself, butrather in capturing detailed neural activities by incorporat-ing available labels in the data sets. For instance, usingmulti-subject data sets that have group labels available,one can obtain the patterns that are common across thepopulation, as well as the patterns that are representative ofthe distinct traits in different groups. A dictionary sharedacross all subjects and the set of subject-specific dictio-naries were learned in [34], where incoherence among thelearned dictionaries was ensured by penalizing high pairwisecorrelations. The task fMRI data parcellated to ROIs forschizophrenic patients and normal controls were analyzedin a DL framework in [35], where it was postulated thatthe schizophrenic group’s responses contain sparse subspace

deviations from the control group’s population mean, andthe patient-specific projections of the deviations were usedto predict the patients’ genetic risks.

In this work, the DL approach is taken for the jointanalysis of multi-subject fMRI data consisting of differentsubject groups. The goal is to estimate the spatial activationmaps that are shared across the population as well as themaps that can explain the group differences. The grouplabels are incorporated through the Fisher discriminant costadded to the DL objective, which encourages clusteringof the sparse codes in the same group, while maximizingthe distances between group centroids [33]. A second-levelanalysis is employed here, where each voxel’s time-seriesis regressed first to appropriate reference signals to obtaina single map per subject [36]–[38]. In order to learn theshared and individual brain spatial component maps, twosub-dictionaries are learned that collect the common andthe discriminative bases. Then, the vector of coefficientscomputed per subject capturing the activations of the dis-criminative component maps is used as the input to theFisher criterion for group differentiation.

The idea of estimating the shared and discriminative sub-dictionaries was explored for image classification problemsin [39], where the set of shared and discriminative featuresare learned as sub-dictionaries, and the sparse coefficientscorresponding to the discriminative sub-dictionaries are usedfor classification. In the fMRI data analysis, however, it ismore natural that the sparsity constraints are imposed on thespatial component maps [40], as it is known that the spatialbrain activations are localized and super-Gaussian [41].These sparse spatial component maps also play the role ofthe basis set for explaining the data, and the coefficientsassociated with the discriminative maps are used for groupprediction. In a nutshell, the sparsity is imposed on the basisset, rather than the coefficients in our work. Furthermore, in-stead of employing per-class discriminative sub-dictionariesas in [39], we adopt only a single discriminative sub-dictionary. This is because in fMRI data analysis, even thediscriminative component maps are often highly correlatedacross groups, and estimating per-group maps complicatestheir interpretation [42]. Moreover, our choice significantlysimplifies the parameter selection such as the selection ofthe model order, and improves the stability of the estimatedmaps.

For the evaluation of the proposed method, a real taskfMRI data set comprising schizophrenic and healthy controlsubjects is analyzed. To validate the learned brain activationmaps, a systematic comparison is done with the maps ob-tained from the conventional ICA method. To robustify ouranalysis against local optima, which emerge from solvingthe nonconvex DL formulation, a rigorous stability-ensuringprocedure is employed as well. The estimated maps showthat our proposed method can reproduce most of the ICAmaps, and also find novel discriminative and interpretablecomponent maps that are not revealed with ICA. A corre-lation analysis using a set of behavioral test scores furthervalidates the results.

The rest of the paper is organized as follows. The novel

3

DL formulation is presented in Sec. II. An algorithm tosolve the proposed DL problem is derived in Sec. III. Theproposed DL method is evaluated and compared with ICAusing real task fMRI data in Sec. IV. Finally, the conclusionsare provided in Sec. V.

II. PROBLEM FORMULATION

A. Unsupervised Dictionary Learning

The DL model postulates that the signal vector can berepresented by a linear combination of a small number ofatoms chosen from a dictionary. Thus, the signals residein a union of subspaces, and the dictionary constitutes anovercomplete set of basis for the subspaces. DL aims atlearning the dictionary from the data.

The conventional DL formulation is an unsupervisedlearning problem. Let xn ∈ RM for n = 1, 2, . . . , N bethe n-th datum. Collect them in the data matrix X :=[x1,x2, . . . ,xN ]. For a dictionary D := [d1,d2, . . . ,dK ] ∈RM×K , which consists of K atoms {dk ∈ RM}, the DLmodel assumes that each datum xn ≈

∑Kk=1 znkdk = Dzn,

for a sparse coefficient vector zn = [zn1, . . . , znk]>, where> denotes the transposition. Upon defining the sparse matrixZ := [z1, . . . , zN ] ∈ RK×N , the DL problem can beformulated as [43]

minD,Z

1

2‖X−DZ‖2F + λ‖Z‖1 (1a)

subject to D ∈ D := {D : ‖dk‖2 ≤ 1, k = 1, 2, . . . ,K}(1b)

where ‖ · ‖F is the Frobenius norm, ‖Z‖1 :=∑n,k |znk| is

the `1-norm, which promotes sparsity in Z, and λ > 0 is aparameter that can be varied to adjust the sparsity level ofZ. The constraints in (1b) normalize the dictionary atoms,which is necessary due to the scaling ambiguity inherent inthe model. That is, scaling the k-th column dk of D by αand the k-th row z>k ∈ R1×N of Z by 1/α will not alterthe product. Thus, the formulation essentially seeks a bi-factorization of data matrix X into a normalized dictionarymatrix D and a sparse coefficient matrix Z. The DL problem(1) is not a convex optimization problem, but an alternatingminimization algorithm can be employed to reach a locallyoptimal solution [44], [45].

B. Supervised Dictionary Learning

The DL method can also be used for supervised learningtasks. In this case, rather than learning the dictionary torepresent the input data with high fidelity, a discriminativedictionary can be learned, which captures the unique traits inthe data, characteristic of different classes. In the neuroimag-ing applications, discriminative DL can reveal neural activitypatterns that are unique to different (groups of) subjects.

One way to formulate a supervised DL problem is to aug-ment to the learning objective a classification cost. Similarto [33], we advocate employing the Fisher’s discriminantcost [46]. Suppose that the entire data set X is partitioned toC classes. Let Nc with cardinality Nc be the set of sampleindices belonging to class c, for c = 1, 2, . . . , C . Let yn

be the feature vector, to be learned by DL, correspondingto the input sample xn, for all n = 1, 2, . . . , N , andY := [y1, . . . ,yN ]. The class mean and the overall meanvectors are defined as

mc :=1

Nc

∑n∈Nc

yn (2)

m :=1

N

N∑n=1

yn (3)

respectively. Let us also define so-called the within-classscatter matrix Sw and the between-class scatter matrix Sbas

Sw(Y) :=C∑c=1

∑n∈Nc

(yn −mc)(yn −mc)> (4)

Sb(Y) :=C∑c=1

Nc(mc −m)(mc −m)> (5)

respectively. The Fisher criterion aims at learning the fea-tures such that they are clustered together in the same class,leading to a small intra-class scatter, and at the same timethe class means are far away among others, yielding a largeinter-class scatter. Thus, a suitable cost to minimize for theclassification task is

f(Y) := tr{Sw(Y)} − tr{Sb(Y)}+ ‖Y‖2F (6)

where the last term ensures the convexity of the cost functionwith respect to Y [33], [47].

In fact, upon defining N -by-N matrices H1 and H2,where the (n, n′)-entry h1,nn′ of H1 is defined as

h1,nn′ :=

{1Nc

if n, n′ ∈ Nc0 otherwise

(7)

and all entries of H2 are equal to 1/N , the Fisher’s objectivein (6) can be re-written as

f(Y) = ‖Y(I−H1)‖2F − ‖Y(H1 −H2)‖2F + ‖Y‖2F .(8)

Thus, the Hessian ∇2f(Y) is computed as

H := 2I− 2H1 + H2 (9)

where the symmetry of H1 and H2 as well as H22 = H2 =

H1H2 = H2 = H1 are used. It can be easily proved thatH is positive semi-definite by showing the eigenvalues ofH are nonnegative [47]. Furthermore, f(Y) can be simplywritten as

f(Y) = tr{YHY>}. (10)

Thus, a supervised DL problem can be posed as

minD∈D,Z

1

2‖X−DZ‖2F + λ‖Z‖1 +

µ

2f(Z) (11)

where D now captures the discriminative basis for data{xn}, the sparse codes {zn} play the role of the featuresinput to the classification cost f(·), and µ is a parameterthat balances the reconstructing error and the Fisher cost.

4

voxels

HC

SZ

weights associatedwith the maps

≈ … ×

… ……

sparse component maps𝐙:

subjects

𝐗:𝑀×𝑁

to Fisher’s criterion

HC SZ

𝐃:

𝐃) *

𝐾×𝑁𝑀×𝐾

𝐙,

𝐙-𝐃)𝐃.

Fig. 1: Illustration of the proposed DL model. The input data X is a row-wise concatenation of the fMRI volumes of N voxelsfrom M subjects, belonging to HC and SZ groups. The rows of the sparse matrix Z capture the sparse spatial componentmaps, where Z collects the common maps and Z the discriminative ones. The columns of D capture subject-wise weights ofindividual components, i.e., spatial maps. Only the discriminative activations D> are fed to the Fisher’s criterion for groupdifferentiation.

C. Capturing Common and Discriminative Features

In medical image analysis, the focus is not merely onextracting discriminative features for improving classifica-tion performance, but rather on obtaining features that canexplain both common and individual traits in the groupsof samples. For this purpose, a structured dictionary isemployed, where D contains both the common dictionaryD that is shared across all class data, and the discriminativedictionary D, which captures the features reflecting the classdifferences. That is, let D := [D, D], where D ∈ RM×Kand D ∈ RM×K . Here, K is the number of common atoms,and K the number of discriminative atoms. Likewise, thesparse coefficient matrix Z is also partitioned compatibly asZ = [Z

>, Z>]>, where Z ∈ RK×N and Z ∈ RK×N .

Since only the discriminative features would contributeto predicting the class labels, only Z is input to the Fishercost. The overall supervised DL formulation can thus beconstructed as

minD∈D,Z

1

2‖X−DZ‖2F + λ‖Z‖1 +

µ

2f(Z). (12)

Existing methods often adopt per-class discriminativesub-dictionaries and impose additional constraints relatedto incoherence of the sub-dictionaries, which can improveclassification performance in general [33], [34], [39]. Fur-thermore, to enhance identifiability, a low-rank constraintcan be added to the common dictionary for some imageclassification applications [39]. In this work, our goal is toobtain the brain activation maps from fMRI data that faith-fully capture the spatial activation maps and their per-subjectcontributions. In fMRI data analysis, even the discrimi-native component maps are often highly correlated acrossdifferent groups, and estimating per-group maps may hindertheir interpretation [42]. Furthermore, proper order selection,which is critical for estimating meaningful maps embedded

in noise, becomes more complicated when there are multiplesub-dictionaries. Thus, in this work, we advocate the simplestructure involving a single discriminative dictionary alongwith a common dictionary.

D. Proposed DL Formulation for fMRI Data Analysis

The fMRI data obtained in a scanning session from asubject is a 4-dimensional data consisting of 3-D scans ofN voxels taken at T different time points. One can flattenthe 3-D volume at each time point into an N -dimensionalvector, transforming the data into a T -by-N matrix. Tofacilitate the processing of multi-subject data sets collectedfrom M subjects, involving M such matrices, the DL modelis employed for the second-level analysis. That is, the timedimension in each of the voxel-wise time-series is firstregressed away against a reference signal of length T , toobtain a single spatial map per subject, an N -dimensionalrow vector. More detail is given in Sec. IV-A. Thus, themulti-subject fMRI data X to be analyzed is a M -by-Nmatrix, as shown in Fig. 1.

In conventional image classification works, the data vec-tors belonging to different classes are stacked as columns inthe data matrix for DL analysis. For example, (12) can beused with X> being the input, in which case, the columnsof D would capture the common and discriminative spatialmaps, and the columns of Z would estimate sparse com-ponent activations for each subject. We experimented withthe approach in our conference precursor [40]. One issuewith this approach is that, for fMRI data, it is more naturalto impose sparsity on the component maps, rather than theactivations, as it is well-documented that the componentmaps are super-Gaussian. Existing DL analysis methods forfMRI data also often take this approach [28], [29]. Ourown comparison also found that the latter approach tendsto obtain more meaningful results [40].

5

Input: X,D(0),Z(0), λ, µOutput: D(∞), Z(∞)

1: Initialize D(0) and Z(0) randomly. Set ` = 0.2: While not converged3: Set i = 0, t(0) = 1, Z(`,i) = Z(`), W(`,i) = Z(`), and L(`) = λmax((D(`))>D(`))

/* Update Z */4: While not converged5: G(`,i) ← −(D(`))>(X−D(`)Z(`,i))6: Z(`,i+1) ← Sλ/L(W(`,i) −G(`,i)/L(`))

7: t(i+1) ← (1 +√

1 + 4(t(i))2)/2

8: W(`,i+1) ← Z(`,i+1) + t(i)−1t(i+1) (Z(`,i+1) − Z(`,i))

9: i← i+ 110: End While11: Set s = 0, Z(`+1) = Z(`,i), and D(`,s) = D(`)

/* Update D */12: Set A = Z(`+1)(Z(`+1))> and B = X(Z(`+1))>

13: While not converged14: For k = 1, 2, . . . , K

15: uk ← 1akk

(bk −D(`,s)ak) + d(`,s)k

16: d(`,s+1)k ← 1

max{‖uk‖2,1}uk

17: Set the k-th column of D(`,s) to d(`,s+1)k

18: End For19: For k = K + 1, . . . ,K

20: uk ← [akkI + µH]−1

(bk −

K∑k′=1,k′ 6=k

ak,k′d(`,s)k′

)21: d

(`,s+1)k ← 1

max{‖uk‖2,1}uk

22: Set the k-th column of D(`,s) to d(`,s+1)k

23: End For24: s← s+ 125: End While26: Set D(`+1) = D(`,s)

27: `← `+ 128: End While29: D(∞) ← D(`) and Z(∞) ← Z(`)

TABLE I: Algorithm for solving (13).

In this work, we focus on stacking the per-subject mapsas rows in X, which is then input to DL. Thus, the rowsof Z correspond to the common and discriminative spatialcomponent maps. A row in D is the corresponding subject’sweights associated with the component maps. A columnin D can be regarded as the weights associated with thecorresponding map, where each entry represents each sub-ject’s contribution. Since only the discriminative activationcoefficients D would contribute to label prediction, D> isinput to the Fisher’s criterion. This is summarized in Fig. 1,where the subjects are shown to be partitioned into twoclasses, namely the healthy control (HC) group and theschizophrenic (SZ) group. The group labels serve as thesupervision signals. Thus, our DL formulation for fMRI dataanalysis is finally given by

minD∈D,Z

1

2‖X−DZ‖2F + λ‖Z‖1 +

µ

2f(D>). (13)

III. ALGORITHM DERIVATION

The proposed formulation (13) is a nonconvex optimiza-tion problem, and thus it is difficult to obtain a globallyoptimal solution efficiently. On the other hand, it is observedthat when D is fixed, the optimization with respect to(w.r.t.) Z is essentially a convex least absolute shrinkage andselection operator (LASSO) problem. Likewise, when Z isfixed, the problem for D is also convex as f is convex w.r.t.D, the squared error term is convex, and set D is convexas well. Thus, the block coordinate descent (BCD) methodcan be employed by alternating between blocks Z and D,which converges to a locally optimal solution.

The update for Z when D is fixed to its `-th iterate D(`)

is to solve

Z(`+1) := arg minZ

1

2‖X−D(`)Z‖2F + λ‖Z‖1. (14)

This problem is an instance of the well-known LASSO

6

problem applied to each column of X. There are a va-riety of methods to solve this problem. In this work, weadopt the Fast Iterative Shrinkage-Thresholding Algorithm(FISTA) [48], which iteratively solves a proximal probleminvolving a linear approximation of the differentiable partof the objective. The next iterate is chosen based on the“optimal” first-order method. The FISTA thus requires thegradient of the differentiable part h(Z;D(`)) := 1

2‖X −D(`)Z‖2F and the Lipschitz constant of the gradient, whichare given by

∂h(Z;D(`))

∂Z= −(D(`))>(X−D(`)Z) (15)

L(`) = λmax((D(`))>D(`)) (16)

respectively, where λmax(M) denotes the largest eigenvalueof a matrix M. The FISTA update is described in lines 4–10in Table I, where the (i, j)-entry of Sα(M) is given by soft-thresholding the (i, j)-entry mij of M, that is, [Sα(M)]ij :=sign(mij) ·max{0, |mij | − α}.

When Z is fixed at Z(`+1), the update for D is done bysolving [cf. (10)]

D(`+1) := arg minD∈D

1

2‖X−DZ(`+1)‖2F +

µ

2tr{D>HD

}.

(17)

As the constraint D ∈ D is decoupled to individual columns{dk} of D, another layer of BCD can be employed tosolve (17) [44]. First, upon defining A := Z(`+1)(Z(`+1))>

and B := X(Z(`+1))>, it is noted that D(`+1) in (17) canbe obtained from the equivalent formulation given by

D(`+1) = arg minD∈D

1

2tr{AD>D} − tr{BD>}

+µ

2tr{HDD>}. (18)

Thus, in iteration s, updating the k-th column of D with allother columns fixed at d

(`)k′ = d

(`,s)k′ for k′ = 1, . . . , k −

1, k + 1, . . . ,K , needs to be done differently dependingon whether the k-th atom belongs to the common dictio-nary D or the discriminative dictionary D. That is, fork = 1, 2, . . . , K, the following steps are used to update dk.

uk =1

akk

bk −∑k′ 6=k

ak,k′d(`,s)k′

(19)

d(`,s+1)k =

1

max{‖uk‖2, 1}uk (20)

where akk′ is the (k, k′)-entry of A, and ak and bk arethe k-th columns of A and B, respectively. Similarly, fork = K + 1, . . . ,K ,

uk = (akkI + µH)−1

bk −∑k′ 6=k

ak,k′d(`,s)k′

(21)

d(`,s+1)k =

1

max{‖uk‖2, 1}uk. (22)

The updates (19)–(22) are repeated over s until convergence,and the converged D(`,s) is the optimal solution to (18). The

overall dictionary update steps correspond to lines 12–25 inTable I. The entire algorithm for solving (13) is provided inTable I.

A remark on the computational complexity of the algo-rithm is in order. For updating Z in lines 4-10, (D(`))>Xand (D(`))>D(`) can be computed outside the while loop.Thus, the dominant operation is the multiplication of(D(`))>D(`) with Z(`,i), which can be done in O(K2N)operations. For updating D, the most demanding operationis the inversion of an M -by-M matrix in line 20, whichrequires O(M3) operations. Since this has to be done Ktimes, the computational cost per while iteration is O(M3K)(assuming that K and K are similar.) The overall computa-tional cost is also dependent on the number of iterations forthe while loops. In the computational setup to be explainedin Sec. IV, the average number of while iterations forthe Z-update was around 200, and the same for the D-update was around 30. Finally, the outer while loop typicallytook around 150 iterations. Executing the algorithm on anIntel Zeon processor with 8 cores at 3.4 GHz clock speed,typically took 5 to 30 minutes.

IV. EVALUATION

A. FMRI Data Set and Regression Analysis

The proposed method was evaluated using real fMRI data.We used task fMRI data sets published by MIND ClinicalImaging Consortium (MCIC), which can be accessed pub-licly at https://coins.trendscenter.org. The data set contains150 healthy control (HC) subjects and 121 patients withschizophrenia (SZ), collected as the subjects, who performedthe auditory oddball (AOD) task. The subjects were pre-sented with three types of auditory stimuli: the standard,novel, and target stimuli. The standard stimulus was 1 kHz-tones that occurred with probability 0.82. The novel stimuluswas randomly generated, complex and non-repeating digitalnoise presented with probability 0.09. The target stimuluswas 1.2 kHz-tones that occurred with probability 0.09. Thesubjects were instructed to press a button with their rightindex fingers when they heard the target tones. For each run,overall 90 stimuli with a duration of 200 ms were playedat random intervals. The stimulus sequences were designedto produce orthogonal BOLD responses [49]. Furthermore,the order of the novel and the target stimuli was shuffledbetween runs to make sure the responses do not dependon the stimulus order. More detailed description of thedata sets can be found in [50]. The preprocessing steps ofindividual fMRI data including motion correction, artifactremoval, spatial normalization, smoothing and subsamplingare explained in [38] and [50].

The regressors were then created by modeling the targetand standard stimuli as the convolution of the delta functionscapturing the stimulus onsets and the default SPM HRF inaddition to their temporal derivatives [51]. Finally, the voxel-wise time-series was regressed in the GLM framework andthe contrast images between the target versus the standardstimuli were obtained as the input matrix X ∈ RM×Nfor the proposed method, where M = 271 subjects and

7

(a) HC group

(b) SZ group

Fig. 2: Averages of the rows of X for the HC and the SZgroups

N = 48, 546 voxels. More details on feature extractioncan be found in [52]. The contrast images convey thedifference in spatial activations associated with the detectionof the target tone and pressing the button as instructed,relative to the background state, which involves listeningthe standard and novel tones, which require no response.The task involves a variety of cognitive processes as well asauditory and motor functions. Fig. 2 shows the averages ofthe rows of X for the HC and the SZ groups. While there arecommon activations across the groups, groups differencesare also clearly visible. Note that in order to increase thelikelihood of getting components in the areas of interest,non-brain voxels were removed through masking and thenumber of voxels was reduced to N .

B. Parameter TuningThe model presented in Sec. II contains some model

parameters that must be determined. These are the size theof the shared dictionary K, the size of the discriminativedictionary K, the parameter λ for adjusting sparsity of Z,and the weight µ for the Fisher cost. We employed cross-validation to fix these parameters.

0 200 400 600 800 1000 1200 1400

Maximum T-values

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Em

piric

al C

DF

Empirical CDF

Proposed Method

ICA

Unsupervised DL

Fig. 3: The CDFs of the maximum T -map values from theproposed algorithm, unsupervised DL in [28] and the ICAmethod.

First, a training set and a validation set were created byrandomly selecting subjects from the entire data sets. As thenumbers of the HC and SZ subjects are not the same (150and 121, respectively), a balanced training set was obtainedby randomly choosing 85 HC subjects and 85 SZ subjects.Similarly, 36 HC and 36 SZ subjects were picked for thevalidation set. A total of 100 different training/validationsets were constructed in this way. Then a grid search wasperformed over K, K, λ and µ to find the combinationthat yielded the best average classification accuracy on thevalidation sets, averaged over the 100 sets.

The classification was done as follows. Based on thespatial maps Z obtained from the training set, the weightsassociated with corresponding maps D = [D, D] were com-puted by the least-square (LS) regression for the validationset. Then, for each subject m, the m-th row of D wasinput to the 1-nearest neighbor classifier. The best averageclassification accuracy achieved was 68% with parametersK = 12, K = 26, λ = 0.003, and µ = 0.2. This accuracy isslightly higher than what was reported in a prior study usingthe same data set [52], which was around 66%, although asin [52] we did not try to optimize the classifier itself.

C. Obtaining Stable Maps

As was discussed in Sec. II-D, (13) is a nonconvex op-timization problem, and thus only locally optimal solutionscan be obtained in practice. Therefore, the algorithm in Ta-ble I would yield different solutions and thus different spatialmaps, depending on the initialization. Consequently, it isprudent to use the spatial maps that are more consistentlyobtained across multiple runs using random initializationsin order to improve the reliability of the conclusions drawnfrom the analysis.

In this work, the framework proposed in [53] was adaptedfor our setting. First, it is noted that permuting the columnsof D and the rows of Z using the same permutation doesnot alter the objective function value. Thus, this permutation

8

(a) Proposed method

20 40 60 80 100

Run index

5

10

15

20

25

30

35

Com

po

nent

ind

ex

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(b) Unsupervised DL

(c) ICA

Fig. 4: Pearson correlation between the component mapsfrom each run and the T -maps.

ambiguity must be resolved in order to compare any two setsof maps. This can be done by solving a linear assignmentproblem (LAP), which finds the bipartite matching thatminimizes the total edge weight for a weighted bipartitegraph [54]. Specifically, suppose that one obtained K spatialmaps from each of the R runs and denote the k-th mapfrom the r-th run as z

(r)k , where k = 1, 2, . . . ,K and

r = 1, 2, . . . , R. The weight between the k-th map z(r)k

from run r and the k′-th map z(r′)k′ from run r′ is defined

as w(r,r′)(k, k′) := 1 − |(z(r)k )>z(r′)k′ |/‖z

(r)k ‖2‖z

(r′)k′ ‖2. The

LAP is to find a bijection f : {1, . . . ,K} → {1, . . . ,K}to assign each map k in run r to map k′ = f(k) in run r′

such that∑Kk=1 w

(r,r′)(k, f(k)) is minimized. Let us denotethe resulting minimum cost as w(r,r′)∗. The problem can besolved using Hungarian algorithm [55].

Then, a weighted graph is constructed where the R runsconstitute the nodes, and the edge weight between nodes rand r′ is w(r,r′)∗. A minimum spanning tree (MST) is foundon this graph, which is defined as the subgraph connectingall nodes in the graph with minimum total weights [56].Then, the node with the maximum number of edges isselected as the central node. If there are multiple nodes thathave the same maximum number of edges, then the one withthe minimum total edge weight is chosen. The central nodeis essentially the run that yielded the spatial maps that arehighly correlated with the most other runs. Subsequently, themaps in all runs are reordered so that they are aligned withthe maps in the central node, to mitigate the permutationambiguity.

After the alignment, the one-sample t-test statistic iscomputed for each voxel by regarding R runs as R samples.The resulting T -map captures the individual voxel’s stabilityin each map. Then, the run r∗ whose spatial maps have thehighest correlation with this T -map is found. The maps fromrun r∗ are our most stable maps.

We applied the aforementioned strategy for analyzing thedata set X containing 121 HC and 121 SZ subjects. This dataset X is the concatenation of the training and the validationsets that yielded the best classification accuracy during theparameter tuning stage explained in Sec. IV-B. We ran thealgorithm in Table I with R = 100 random initializations ofD(0) and Z(0). For comparison, we also ran the unsupervisedDL method proposed in [28] as well as the entropy boundminimization (EBM)-based ICA algorithm in [57] with Rrandom initializations, and applied the same strategy.

To compare the stability of the proposed algorithm withunsupervised DL method and the ICA algorithm, the maxi-mum value of the T -map (maximized over the voxels) wascomputed for each map. Although the spatial distribution ofthe T -values is of course not uniform, having only a fewvoxels estimated stably while others unstably is unlikely.Thus, the maximum T -values for each component is areasonable indicator of how stable the entire component is.In Fig. 3, the empirical cumulative distribution functions(CDFs) of the resulting maximum T -values are plotted. Thenumber of components was set to K = K + K = 38 forall the methods. It can be seen that a given percentile of the

9

DL#26 DL#16 DL#23 DL#24 DL#37DL#22

ICA#5 ICA#2 ICA#20 ICA#14 ICA#9 ICA#19ICA#11p = 5.7E-08 p = 5.8E-04 p = 0.0013 p = 0.0017 p = 0.0024p = 7.1E-06

p = 5.8E-08 p = 1.4E-04 p = 0.15 p = 2.6E-04 p = 0.034 p = 0.013p = 4.8E-04

DL#29 DL#21

DL#27 DL#36 DL#35 DL#28 DL#13 DL#19

ICA#4 ICA#6

ICA#13 ICA#10 ICA#7 ICA#8 ICA#21 ICA#24

p = 0.015 p = 0.055

p = 0.058 p = 0.084 p = 0.11 p = 0.28 p = 0.67 p = 0.88

p = 0.64 p = 0.34

p = 0.44 p = 0.18 p = 0.0057 p = 0.34 p = 0.79 p = 0.32

DL#18

p = 2.4E-04

DL#32

p = 0.24ICA#16

p = 0.067

DL#31

p = 0.43ICA#12

p = 0.81

DL#38

p = 0.58ICA#18

p = 0.71

Fig. 5: The discriminative maps Z from DL and the matching ICA maps. The p-values lower than 0.05 are highlighted inyellow. The ICA maps in the red boxes were found to be split into multiple DL maps (see Fig. 7). It can be seen that theDL maps are not only readily interpretable, but also more focal and group-different with higher significance than the ICAcounterparts in many cases.

DL#10 DL#6 DL#1 DL#9 DL#8 DL#2 DL#5

ICA#25 ICA#22 ICA#15 ICA#17 ICA#23 ICA#1 ICA#3

p = 8.2E-04 p = 0.0043 p = 0.042 p = 0.11 p = 0.12 p = 0.73 p = 0.88

p = 0.21 p = 0.069 p = 0.0075 p = 0.12 p = 0.57 p = 0.70 p = 0.015

Fig. 6: The common maps Z from DL with matching ICA maps. The DL maps are again cleaner than the ICA maps. Somecommon maps turn out to be group-different, which may be due to the local optima of the proposed optimization formulation.

maximum T -value is higher in the proposed method than theunsupervised DL or the ICA algorithms, which means thatthe proposed method provides maps that are more stable.In particular, it is noted that our method yields maps thatare more stable than those from unsupervised DL, indicatingthat exploiting the label information helps stability.

Fig. 4 shows the actual correlation values of the mapsobtained from the individual runs with the T -maps. Fig. 4acorresponds to the proposed method, Fig. 4b the unsuper-vised DL, and Fig. 4c the ICA. It is clearly seen that theproposed method tends to yield more consistent maps thanthe unsupervised DL and ICA counterparts. Thus, the maps

obtained from the proposed method are sufficiently stablefor further analysis.

Remark: It should be noted that the ICA and DL methodsare inherently different and with a simpler but less flexibleICA algorithm such as Infomax [58], one would obtainmore stable components, but at the expense of limitedapproximation to the density of the underlying sources.When more flexible and powerful algorithms like EBM areused, a common practice is to use a scheme for selectingthe most consistent one among multiple runs [31], [53].

10

ICA#1 ICA#3 ICA#7 ICA#10

DL#5 DL#33 DL#35 DL#12 DL#36 DL#30DL#14DL#2

p = 0.73 p = 0.50 p = 0.88 p = 0.038 p = 0.11 p = 0.021 p = 0.084 p = 0.092

p = 0.70 p = 0.015 p = 0.0057 p = 0.18

ICA#19 ICA#22ICA#15

DL#6DL#37DL#20DL#1 DL#34 DL#11 DL#4 DL#3 DL#17

p = 1.5E-05p = 0.042 p = 0.35 p = 0.0024 p = 0.54 p = 0.69p = 0.0043 p = 0.61 p = 0.10

p = 0.0075 p = 0.013 p = 0.069

Fig. 7: The ICA maps that split into multiple DL maps. The red boxes indicate the matching components between DL andICA, found by solving the LAP. In the blue dashed boxes, group-different ICA components are seen to be split into multipleDL components that include the ones that are not group-different, highlighting that DL method is finding more localizeddiscriminative regions without losing common features.

D. Analysis of Obtained Spatial Maps

The common and discriminative brain activation mapsobtained from Sec. IV-C are analyzed in this section. It isrecalled that the k-th column dk of the learned dictionaryD can be interpreted as the weights associated with the k-th component map, which is the k-th row zk of Z. Thus,the group difference in the activation can be revealed byperforming a two-sample t-test on dk, whose test statisticis defined as

tk =µHC,k − µSZ,k√σ2HC,k

MHC+

σ2SZ,k

MSZ

(23)

where MHC and MSZ are the numbers of the HC and SZsubjects, respectively, which are both equal to 121 in the dataset used, and µHC,k and µSZ,k are the sample means of theentries of dk corresponding to the HC subjects and the SZsubjects, respectively. Sample variances σ2

HC,k and σ2SZ,k

are defined in the same way. To determine components withsignificance, we have used a p-value of 0.05 for the statistic(uncorrected).

In Figs. 5–8, the learned spatial maps are plotted. All thespatial map plots in this work represent thresholded Z-maps,where the Z-values are calculated per voxel by dividing theentries in zk by the standard deviation of the entries, and

then the Z-values whose absolute values are larger than 2are plotted. Furthermore, as there is a sign ambiguity on thesigns of a component map zk and its activations dk, thesign is fixed such that tk in (23) is nonnegative. That is, allthe component maps shown have higher activations in theHC group than the SZ group. The red colors in the maprepresent positive values in the Z-map, with bright yellowindicating the highest intensity, and the blue colors representnegative values, with bright skyblue indicating the highestintensity.

To assess the performance of the proposed approach,we compared the maps with those obtained from ICA,another data-driven method, and one that has been now wellestablished for such studies1. We reduced the dimensionalityprior to ICA analysis using an information theoretic orderselection method based on MDL, which also takes sample

1The shared and specific independent component analysis (SSICA)algorithm proposed in [59] can be used to find the shared and the group-specific components based on structural constraints on the mixing matrix.However, the algorithm finds a separate set of components for each group,whereas our method finds one set for the discriminative components ofall groups. Thus, matching and comparing the resulting maps with ourDL component maps would entail additional steps. Furthermore, SSICAis originally designed for the first-level analysis, and thus would requireadditional investigation to adapt for the second-level analysis done in thepresent work. For these reasons, the well-established ICA method is adoptedfor comparison with our DL method.

11

DL#15 DL#25 DL#7

p = 0.72 p = 0.059 p = 0.054

Fig. 8: DL maps without matching ICA maps.

DL#33

p = 0.038

DL#18

p = 2.4E-04

p = 0.097

DL#29

p = 0.015

p = 0.064

SSPM#5 SSPM#15

Fig. 9: Samples of the maps from the sparse SPM and thecorresponding maps from the proposed DL method. Notethat DL#18 was not identified from the sparse SPM

dependence into account [60]. The order for ICA wasdetermined to be 25, thus enabling better generalizationperformance for ICA. That is, the ICA maps showed thebest statistical significance and interpretability around order25, while deviating from this order resulted in the maps withsignificantly degraded p-values and interpretability. Then,stable maps were determined as explained in Sec. IV-C. Forthis, the rectangular LAP was solved to match the K = 38component maps from the DL and K = 25 maps from theICA. In addition, the component maps were also furtherchecked by visual inspection to make sure the results werereliable.

Figs. 5 and 6 show the maps learned from DL and ICAthat are found matching both from LAP and visual inspec-tion. Fig. 5 lists the discriminative maps Z from the DL andthe matching ICA maps, whereas Fig. 6 depicts the commonmaps Z and those matching from ICA. The maps in the toprow of each figure are the maps from the DL and the bottomfrom the ICA. The map index k is indicated on the top ofeach map. The indices k = 1, 2, . . . , K = 12 correspond tocommon components, and k = K+1 = 13, . . . , K+K = 38to discriminative components. The p-values are providedat the bottom of the maps. When the map is significant(p < 0.05), the p-value is highlighted in yellow. It can beseen that DL identifies 25 component maps that are in good

match with the ICA-based maps, which validates the mapestimation performance of the DL algorithm2. Furthermore,it is observed from Fig. 5 that many of the maps in Z indeedshow significant group difference. In fact, the p-values of thediscriminative maps turn out to be smaller than the p-valuesof the corresponding ICA maps in most cases, which verifiesthat the supervised DL formulation is working as designed.In particular, DL#24 and DL#22, which can be ascribed tomotor and sensory motor functions, respectively, and thusare strongly related with the AOD task, are showing moresignificant group difference than their ICA counterpartsICA#9 and ICA#11. Furthermore, the DL maps #16 and #29are interesting, since the maps are group-different while thematching ICA maps #20 and #4 are not. The map DL#29is showing significant activations in inferior and middletemporal gyrus, which are known to be associated withlanguage and semantic memory processing, visual and facialperception, and multimodal sensory integration. On the otherhand, the map DL#16 is showing activations in the brain-stem area, directly related to controlling respiration, painmodulation, motor, and cardiac output [61]. These highlightthat the proposed DL is finding more discriminative maps byincorporating the label information. Many maps in Figs. 5and 6 are readily interpretable as motor (DL#23, DL#24),sensory motor (DL#22), auditory (DL#27), anterior defaultmode network (DMN) (DL#36), posterior DMN (DL#31)and frontal parietal regions (DL#35). It can be observedthat the DL-based maps are often much more localized andcleaner than the ICA counterparts. This is because the DLis formulated to estimate sparse component maps.

As the DL aims at finding sparse maps, we observesplitting of components into multiple sparse maps [62], whencompared with maps estimated using ICA. This is observedin Fig. 7, which depicts some of the ICA maps containingactivation regions explained by multiple DL maps. The redboxes in Fig. 7 indicate the matching found by LAP, whichare also shown in Figs. 5–6. In particular, it is observed thatthe ICA maps in the blue boxes are group different, whichare split into a set of DL maps that include the maps thatare not group different. For example, component in DL#12is showing activation in frontal parietal region associatedwith attention network, while the matching ICA map #7 hasactivations in other areas not significant and not related to thetask. This underlines that DL map is finding more localizedregions that are group different, without losing the commonregions, based on our DL formulation. Similar trends canbe found in other components shown in Fig. 7. DL map#37 is showing activations in anterior cingulate cortex witha lower p-value, while ICA#19 has activations in insularareas as well. DL map #20 is related to parietal areas andshowing very significant activations compared to ICA#15.

2As was explained, the colors in the maps capture the signs of the mapvalues. Since we fixed the sign ambiguity for each map such that the two-sample t-test statistic is positive, the color may be misleading when thet-statistic is not significant. Since the p-values for the three DL maps #31,#32 and #38 are quite high, we decided to neglect the signs for these mapsand match them with the ICA maps #12, #16, and #18, respectively, inspite of the color mismatch.

12

1 5 10 15 20 25 30 35 38

Component index

0

0.2

0.4

0.6

0.8

1

p-v

alu

es

Common

Discriminative

p = 0.05

Fig. 10: Bar plot of the p-values of all DL maps.

Fig. 8 shows the DL maps that are not matched withany of the ICA maps. These are the maps that were notassigned from solving the rectangular LAP, nor were asso-ciated with the split maps in Fig. 7. Although the p-valuesfor components DL#25 and DL#7 are not lower than thehard threshold 0.05, they are quite close to it, potentiallycontributing meaningfully for discrimination. Furthermore,they correlate well with known functional areas, such asthe visual (DL#25) and the insular (DL#7) areas. Onecomponent (DL#15) is picking up the white matter area,presumably owing to noise and artifacts present in the data.Note that there are no ICA maps left unmatched with DLmaps.Remark: Note that even though our proposed formulationencourages the discriminative maps to be collected sepa-rately from the common maps, the formulation is nonconvex,hence is prone to have local optima. Thus, even after oureffort to ensure stability, the found common maps may turnout to be discriminative, and the found discriminative mapscommon. However, the general tendency can be seen tobe clearly toward the intended separation. From Fig. 10,it can be seen that most of the p-values for the commonmaps tend to be much larger than the 0.05 threshold, whilefor the discriminative maps, the p-values are often quitesmall. In fact, the five smallest p-values are all associatedwith the discriminative maps. Some summary statistics de-rived from the p-values can help convey the point moreclearly. The geometric mean of the p-values of the commonmaps is computed to be 0.091, whereas the same for thediscriminative maps is equal to 0.017, indicating that thediscriminative maps tend to be more significant in detectingthe group differences. Similarly, geometric median (i.e.,exp(med{logpi})) for the common maps is equal to 0.11while the same for the discriminative maps is 0.070, againindicating that the discriminative maps are more significantlygroup different.

We also compared the spatial maps from our DL methodwith the maps obtained from the sparse SPM algorithmin [28], which is an unsupervised DL method. The p-

values of the maps were computed again using the two-sample t-tests. The comparison between the two sets ofmaps reveals that most of the maps from the two methodsmatch very well. However, it is also observed that someof the maps identified by our proposed DL method showgroup differences, whereas the corresponding componentsfrom the sparse SPM are not. This can be ascribed to theuse of the label information in our method. The maps areshown in Fig. 9. Note that DL#18 was not identified fromthe sparse SPM, but it actually exhibits highly significantgroup-difference. On the average, the geometric mean ofthe p-values of the maps from the sparse SPM turns out be0.036, whereas the same for the maps due to our methodis 0.028. (If only the discriminative maps Z is considered,the number goes down to 0.017.) In summary, our methodextracts faithfully the components of the sparse SPM butwith enhanced group-difference in the discriminative maps.

In the conference paper [40], we also performed a com-parison with the method in [39], which is a discriminativeDL method, developed for image classification. Specifically,formulation (P1) in [40] contains a low-rank constraint onthe common dictionary, and the sparse coefficients cor-responding to the discriminative dictionary are used forclassification. It was observed that the resulting spatial mapsshowed less similarity to the ICA maps, than our proposedmethod. It was also found that the maps from our method(equivalently, formulation (P2) in [40]) were often moreinterpretable and showed more significant group differences.The detailed estimated maps and discussion can be foundin [40].

In conclusion, the proposed algorithm can find manycomponent maps that can be readily cross-validated withICA. By exploiting sparsity, the maps from DL are cleanerand more localized. Through incorporation of the labelinformation during the learning process, our method tends todiscover more meaningful activations, and in particular thosethat provide more specific areas with greater sensitivity indifferentiating the patients from the healthy controls.

E. Correlation Analysis with Behavioral Variables

The spatial component maps obtained from the DL anal-ysis can be further validated and interpreted by using thebehavioral test score data that we have. A total of 105behavioral variables (BVs) were available for 193 of the 271subjects. Out of all the BVs, 7 were found to be significantlygroup different based on a two-sample t-test. Table II liststhe 7 BVs along with their p-values.

Then, we computed the Pearson correlation between theBVs and the weight vector associated with k-th componentmap dk (only the part for the 193 subjects for whom theBVs are available) learned from the DL analysis. Table IIIlists all the component maps that are significantly correlatedwith each of the BVs, that is, the components that yield p-values less than 0.05 from one-sample t-tests on the Pearsoncorrelations.

The maps listed in Table III are sorted in the order ofdescending correlation. It can be observed that the correlated

13

BV index #9 #23 #32 #37 #53 #88 #94

Behavioral test BVRT FAS WMS-3: Logical Memory I TMT WMS-3: Logical Memory II - Delay WAIS-3 HVLT: Immediatep-value 5.6E-13 3.3E-12 0.0E+00 1.1E-05 0.0E+00 7.8E-06 0.0E+00

TABLE II: Behavioral tests and corresponding p-values.

BV index DL spatial map index

#9 DL#16[p=2E-6], DL#37, DL#22, DL#20, DL#26, DL#25, DL#10, DL#19, DL#32, DL#14 [p=2E-2]#23 DL#26 [7E-4], DL#16, DL#37, DL#23 [2E-2]#32 DL#16 [3E-4], DL#22, DL#20, DL#10, DL#25, DL#24, DL#19, DL#6, DL#23, DL#12, DL#37, DL#26 [4E-2]#37 DL#16 [3E-5], DL#10, DL#22, DL#21, DL#20, DL#14 [4E-2]#53 DL#22 [4E-4], DL#16, DL#20, DL#25, DL#6, DL#23, DL#12, DL#18, DL#10, DL#24, DL#26 [4E-2]#88 DL#20 [4E-5], DL#16, DL#19, DL#26, DL#33 [3E-2]#94 DL#16 [2E-4], DL#20, DL#26, DL#23, DL#25, DL#5, DL#37, DL#22, DL#19, DL#12, DL#10 [4E-2]

TABLE III: BVs and correlated spatial maps with correlation p-value lower than 0.05. The ranges of the p-values are indicatedfor each BV.

maps are mostly from discriminative maps Z (underlinedin Table III), which verifies again that the obtained dis-criminative maps Z are indeed extracting discriminativefeatures, and the common maps Z are indeed obtainingshared features. In fact, most of the BVs are showingsignificant correlations with DL maps DL#16 (brainstem),DL#22 (sensory motor and motor), DL#37 (anterior cingu-late cortex) and DL#26 (associated with logical conditionand item recognition). Since the listed BVs are mainlyassociated with working and visual memory (BV#9, BV#32,BV#53), and verbal and working memory (BV#88, BV#94),finding highly correlated discriminative neural activations insensory, anterior cingulate cortex, logical condition and itemrecognition related areas is justified and is in line with manyprevious studies [63]–[67]. For instance, DL#22, which isshowing activations in motor and sensory motor regions thatare related to planning, monitoring, decision making andexecution of motor activity, is showing high associationswith almost all the BVs. DL#37 is a map for anteriorcingulate cortex, known to be associated with attention,decision making, emotion, performance monitoring, anderror detection. It is also worth noting that DL#25, which isidentified by the DL method (see Fig. 8) is highly correlatedwith BVs #9, #32, #53, and #94, further illustrating theadvantage of the proposed discriminative DL method.

V. CONCLUSION

A novel supervised DL method has been proposed formulti-subject fMRI data analysis to extract brain activationmaps that are common across different groups of subjectsas well as the maps that are discriminative for predictinggroup labels. The dictionary and the corresponding sparsematrix have been structured with common and individualsubmatrices for this purpose, and the labels were incorpo-rated using Fisher’s discriminant criterion. Given that spatialfunctional activations are typically sparse and localized, thesparsity constraint has been imposed on the spatial mapfactor, and the corresponding weights of different subjects’contributions were used for classification. An optimization

algorithm to solve the proposed formulation was derivedusing alternating minimization based on convex subprob-lems. The stability of the resulting algorithms was comparedto that of the unsupervised DL method, and it was foundthat our algorithm yielded more stable maps when the samenumber of components were extracted. The estimated brainmaps were also compared carefully with the maps fromthe ICA. It was observed that the DL formulation not onlyreproduced most of the ICA components but also extractedcomponents that are more discriminative, including somenovel maps that were not discovered by the ICA approach.A correlation analysis with separate behavioral test scoresavailable for the same set of subjects further verified thevalidity and usefulness of the DL analysis.

ACKNOWLEDGMENTS

The authors would like to thank Dr. Vince Calhoun forproviding the feature dataset, and Qunfang Long and SuchitaBhinge for their helpful feedback on the analysis results.

REFERENCES

[1] G. D. Jackson, A. Connelly, J. H. Cross, I. Gordon, and D. G.Gadian, “Functional magnetic resonance imaging of focal seizures,”Neurology, vol. 44, no. 5, pp. 850–850, 1994.

[2] J. H. Callicott, M. F. Egan, V. S. Mattay, A. Bertolino, A. D. Bone,B. Verchinksi, and D. R. Weinberger, “Abnormal fMRI response of thedorsolateral prefrontal cortex in cognitively intact siblings of patientswith schizophrenia,” Am. J. Psychiatry, vol. 160, no. 4, pp. 709–719,2003.

[3] B. Luna and J. A. Sweeney, “The emergence of collaborative brainfunction: FMRI studies of the development of response inhibition,”Ann. New York Acad. Sci., vol. 1021, no. 1, pp. 296–309, 2004.

[4] J. Taghia, W. Cai, S. Ryali, J. Kochalka, J. Nicholas, T. Chen, andV. Menon, “Uncovering hidden brain state dynamics that regulateperformance and decision-making during cognition,” Nat. Comm.,vol. 9, no. 1, pp. 2505–2523, 2018.

[5] A. A. Phillips, F. H. Chan, M. M. Z. Zheng, A. V. Krassioukov,and P. N. Ainslie, “Neurovascular coupling in humans: Physiology,methodological advances and clinical implications,” J. Cerebral BloodFlow & Metabolism, vol. 36, no. 4, pp. 647–664, 2016.

[6] R. J. Huster, S. Debener, T. Eichele, and C. S. Herrmann, “Methodsfor simultaneous EEG-fMRI: An introductory review,” J. Neurosci.,vol. 32, no. 18, pp. 6053–6060, 2012.

14

[7] E. Acar, C. Schenker, Y. Levin-Schwartz, V. D. Calhoun, andT. Adali, “Unraveling diagnostic biomarkers of schizophrenia throughstructure-revealing fusion of multi-modal neuroimaging data,” Front.Neurosci., vol. 13, pp. 416–431, 2019.

[8] J. Xu, M. N. Potenza, and V. D. Calhoun, “Spatial ICA reveals func-tional activity hidden from traditional fMRI GLM-based analyses,”Front. Neurosci., vol. 7, pp. 154–157, 2013.

[9] V. D. Calhoun, T. Adali, L. K. Hansen, J. Larsen, and J. J. Pekar,“ICA of functional MRI data: An overview,” in Proc. Int. WorkshopIndependent Component Analysis and Blind Signal Separation, 2003,pp. 281–288.

[10] S. M. Smith, “Overview of fMRI analysis,” British J. Radiology, vol.77, no. suppl 2, pp. S167–S175, 2004.

[11] V. D. Calhoun and T. Adali, “Multisubject independent componentanalysis of fMRI: A decade of intrinsic networks, default mode, andneurodiagnostic discovery,” IEEE Rev. Biomed. Eng., vol. 5, pp. 60–73, 2012.

[12] V. D. Calhoun, T. Adali, G. D. Pearlson, and J. J. Pekar, “A method formaking group inferences from functional MRI data using independentcomponent analysis,” Hum. Brain Mapp., vol. 14, no. 3, pp. 140–151,2001.

[13] T. Adali, M. Anderson, and G. Fu, “Diversity in independent compo-nent and vector analyses: Identifiability, algorithms, and applicationsin medical imaging,” IEEE Signal Process. Mag., vol. 31, no. 3, pp.18–33, 2014.

[14] T. Kim, I. Lee, and T. Lee, “Independent vector analysis: Definitionand algorithms,” in Proc. Asilomar Conf. Signals, Systems, andComputers, Pacific Grove, CA, May 2006, pp. 1393–1396.

[15] A. M. Michael, M. Anderson, R. L. Miller, T. Adalı, and V. D.Calhoun, “Preserving subject variability in group fMRI analysis:Performance evaluation of GICA vs. IVA,” Front. Syst. Neurosci.,vol. 8, pp. 106–123, 2014.

[16] S. M. Plis, D. R. Hjelm, R. Salakhutdinov, E. A. Allen, H. J. Bockholt,J. D. Long, H. J. Johnson, J. S. Paulsen, J. A. Turner, and V. D.Calhoun, “Deep learning for neuroimaging: A validation study,”Front. Neurosci., vol. 8, pp. 229–239, 2014.

[17] S. Sarraf and G. Tofighi, “Deep learning-based pipeline to recognizeAlzheimer’s disease using fMRI data,” in Proc. Future Tech. Conf.,San Francisco, CA, Dec. 2016, pp. 816–820.

[18] X. Li, N. C. Dvornek, J. Zhuang, P. Ventola, and J. S. Duncan, “Brainbiomarker interpretation in ASD using deep learning and fMRI,” inProc. Int. Conf. Med. Imag. Comput. Computer-Assisted Intervention,Granada, Spain, Sep. 2018, pp. 206–214.

[19] S. Koyamada, Y. Shikauchi, K. Nakae, M. Koyama, and S. Ishii,“Deep learning of fMRI big data: A novel approach to subject-transferdecoding,” arXiv preprint arXiv:1502.00093, 2015.

[20] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee,and T. J. Sejnowski, “Dictionary learning algorithms for sparserepresentation,” Neurocomputing, vol. 15, no. 2, pp. 349–396, 2003.

[21] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete repre-sentations,” Neurocomputing, vol. 12, no. 2, pp. 337–365, 2000.

[22] M. Elad and M. Aharon, “Image denoising via sparse and redun-dant representations over learned dictionaries,” IEEE Trans. ImageProcess., vol. 15, no. 12, pp. 3736–3745, 2006.

[23] W. Dong, X. Li, L. Zhang, and G. Shi, “Sparsity-based imagedenoising via dictionary learning and structural clustering,” in Proc.Conf. Comput. Vis. Pattern Recogn., Providence, RI, Jun. 2011, pp.457–464.

[24] L. Shao, R. Yan, X. Li, and Y. Liu, “From heuristic optimizationto dictionary learning: A review and comprehensive comparison ofimage denoising algorithms,” IEEE Trans. Cybern., vol. 44, no. 7,pp. 1001–1013, 2013.

[25] H. Hu, B. Wohlberg, and R. Chartrand, “Task-driven dictionarylearning for inpainting,” in Proc. IEEE Intl. Conf. Acoust., Speech,Signal Process., Florence, Italy, May 2014, pp. 3543–3547.

[26] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learningin face recognition,” in Proc. Conf. Comput. Vis. Pattern Recogn.,San Francisco, CA, Jun. 2010, pp. 2691–2698.

[27] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent K-SVD: Learninga discriminative dictionary for recognition,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 35, no. 11, pp. 2651–2664, 2013.

[28] K. Lee, S. Tak, and J. C. Ye, “A data-driven sparse GLM for fMRIanalysis using sparse dictionary learning with MDL criterion,” IEEETrans. Med. Imaging, vol. 30, no. 5, pp. 1076–1089, May 2011.

[29] J. Lv, X. Jiang, X. Li, D. Zhu, H. Chen, T. Zhang, S. Zhang, X. Hu,J. Han, H. Huang, et al., “Sparse representation of whole-brain fMRI

signals for identification of functional networks,” Med. Imag. Anal.,vol. 20, no. 1, pp. 112–134, 2015.

[30] G. Varoquaux, A. Gramfort, F. Pedregosa, V. Michel, and B. Thirion,“Multi-subject dictionary learning to segment an atlas of brain sponta-neous activity,” in Proc. Biennial Int. Conf. Inf. Process. Med. Imag.,Pacific Grove, CA, Jun.–Jul. 2011, pp. 562–573.

[31] Q. Long, S. Bhinge, Y. Levin-Schwartz, Z. Boukouvalas, V. D.Calhoun, and T. Adalı, “The role of diversity in data-driven analysisof multi-subject fMRI data: Comparison of approaches based onindependence and sparsity using global performance metrics,” Hum.Brain Mapp., vol. 40, no. 2, pp. 489–504, 2019.

[32] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 791–804,Apr. 2012.

[33] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discriminationdictionary learning for sparse representation,” in Proc. Int. Conf.Computer Vis., Nov. 2011, pp. 543–550.

[34] A. Iqbal, A.-K. Seghouane, and T. Adali, “Shared and subject-specificdictionary learning (ShSSDL) algorithm for multisubject fMRI dataanalysis,” IEEE Trans. Biomed. Eng., vol. 65, no. 11, pp. 2519–2528,Nov. 2018.

[35] S. Ghosal, Q. Chen, A. L. Goldman, W. Ulrich, K. F. Berman, D. R.Weinberger, V. S. Mattay, and A. Venkataraman, “A generative-predictive framework to capture altered brain activity in fMRI andits association with genetic risk: Application to schizophrenia,” inProc. SPIE Med. Imag.: Image Process., San Diego, CA, Jun. 2019,vol. 10949, pp. 1–11.

[36] C. F. Beckmann, M. Jenkinson, and S. M. Smith, “General multilevellinear modeling for group analysis in FMRI,” Neuroimage, vol. 20,no. 2, pp. 1052–1063, 2003.

[37] V. D. Calhoun and T. Adalı, “Feature-based fusion of medical imagingdata,” IEEE Trans. Info. Technology in Biomedicine, vol. 13, no. 5,pp. 711–720, 2008.

[38] V. D. Calhoun, T. Adalı, K. A. Kiehl, R. Astur, J. J. Pekar, andG. D. Pearlson, “A method for multitask fMRI data fusion appliedto schizophrenia,” Hum. Brain Mapp., vol. 27, no. 7, pp. 598–610,2006.

[39] T. H. Vu and V. Monga, “Fast low-rank shared dictionary learningfor image classification,” IEEE Trans. Image Process., vol. 26, no.11, pp. 5160–5175, 2017.

[40] K. Dontaraju, S.-J. Kim, M. Akhonda, and T. Adali, “Capturingcommon and individual components in fMRI data by discriminativedictionary learning,” in Proc. Asilomar Conf. Signals, Systems, andComputers, Pacific Grove, CA, Oct. 2018, pp. 1351–1356.

[41] M. J. McKeown, T.-P. Jung, S. Makeig, G. Brown, S. K. Kindermann,T.-W. Lee, and T. J. Sejnowski, “Spatially independent activitypatterns in functional MRI data during the Stroop color-naming task,”Proc. Natl. Acad. Sci., vol. 95, pp. 803–810, Feb. 1998.

[42] V. D. Calhoun, P. K. Maciejewski, G. D. Pearlson, and K. A. Kiehl,“Temporal lobe and default hemodynamic brain modes discriminatebetween schizophrenia and bipolar disorder,” Hum. Brain Mapp., vol.29, no. 11, pp. 1265–1275, 2008.

[43] I. Tosic and P. Frossard, “Dictionary learning: What is the rightrepresentation for my signal?,” IEEE Signal Process. Mag., vol. 28,pp. 27–38, 2011.

[44] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning formatrix factorization and sparse coding,” J. Mach. Learning Res., vol.11, no. Jan, pp. 19–60, 2010.

[45] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,” IEEETrans. Signal Process., vol. 54, no. 11, pp. 4311–4322, 2006.

[46] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, JohnWiley & Sons, 2012.

[47] S. Li and Y. Fu, “Learning robust and discriminative subspace withlow-rank constraints,” IEEE Trans. Neural Net. Learn. Syst., vol. 27,no. 11, pp. 2160–2173, 2015.

[48] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2,no. 1, pp. 183–202, 2009.

[49] V. P. Clark, S. Fannon, S. Lai, R. Benson, and L. Bauer, “Responsesto rare visual target and distractor stimuli using event-related fMRI,”J. Neurophysiology, vol. 83, no. 5, pp. 3133–3139, 2000.

[50] R. L. Gollub, J. M. Shoemaker, M. D. King, T. White, S. Ehrlich, S. R.Sponheim, V. P. Clark, J. A. Turner, B. A. Mueller, V. Magnotta, et al.,“The MCIC collection: A shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia,”Neuroinform., vol. 11, no. 3, pp. 367–388, 2013.

15

[51] A. M. Michael, S. A. Baum, J. F. Fries, B. Ho, R. K. Pierson, N. C.Andreasen, and V. D. Calhoun, “A method to fuse fMRI tasks throughspatial correlations: Applied to schizophrenia,” Hum. Brain Mapp.,vol. 30, no. 8, pp. 2512–2529, 2009.

[52] Y. Levin-Schwartz, V. D. Calhoun, and T. Adalı, “Quantifying theinteraction and contribution of multiple datasets in fusion: Applicationto the detection of schizophrenia,” IEEE Trans. Med. Imaging, vol.36, no. 7, pp. 1385–1395, 2017.

[53] W. Du, Y. Levin-Schwartz, G. Fu, S. Ma, V. D. Calhoun, and T. Adalı,“The role of diversity in complex ICA algorithms for fMRI analysis,”J. Neurosci. Methods, vol. 264, pp. 129–135, 2016.

[54] T. C. Koopmans and M. Beckmann, “Assignment problems and thelocation of economic activities,” Econometrica, pp. 53–76, 1957.

[55] H. W. Kuhn, “The Hungarian method for the assignment problem,”Nav. Res. Log. Qtly., vol. 2, no. 1-2, pp. 83–97, 1955.

[56] J. B. Kruskal, “On the shortest spanning subtree of a graph and thetraveling salesman problem,” Proc. Amer. Math. Soc., vol. 7, no. 1,pp. 48–50, 1956.

[57] X. Li and T. Adali, “Independent component analysis by entropybound minimization,” IEEE Trans. Signal Process., vol. 58, no. 10,pp. 5151–5164, 2010.

[58] A. J. Bell and T. J. Sejnowski, “An information-maximization ap-proach to blind separation and blind deconvolution,” Neural Comput.,vol. 7, no. 6, pp. 1129–1159, Nov. 1995.

[59] M. Maneshi, S. Vahdat, J. Gotman, and C. Grova, “Validation ofshared and specific independent component analysis (SSICA) forbetween-group comparisons in fMRI,” Front. Neurosci., vol. 10, pp.417–435, 2016.

[60] G. Fu, M. Anderson, and T. Adalı, “Likelihood estimators fordependent samples and their application to order detection,” IEEETrans. Signal Process., vol. 62, no. 16, pp. 4237–4244, 2014.

[61] J. Brooks, O. Faull, K. Pattinson, and M. Jenkinson, “Physiologicalnoise in brainstem fMRI,” Front. Human Neurosci., vol. 7, pp. 1–13,2013.

[62] Y. Kopsinis, H. Georgiou, and S. Theodoridis, “FMRI unmixing viaproperly adjusted dictionary learning,” in Proc. Eur. Signal Process.Conf., Lisbon, Portugal, Sep. 2014, pp. 2075–2079.

[63] J. Vannest, J. P. Szaflarski, M. D. Privitera, B. K. Schefft, andS. K. Holland, “Medial temporal fMRI activation reflects memorylateralization and memory performance in patients with epilepsy,”Epilepsy & Behavior, vol. 12, no. 3, pp. 410–418, 2008.

[64] S. J. Astley, E. H. Aylward, H. C. Olson, K. Kerns, A. Brooks, T. E.Coggins, J. Davies, S. Dorn, B. Gendler, T. Jirikowic, et al., “Func-tional magnetic resonance imaging outcomes from a comprehensivemagnetic resonance study of children with fetal alcohol spectrumdisorders,” J. Neurodevelopmental Disorders, vol. 1, no. 1, pp. 61–80,2009.

[65] B. D. Bell, “WMS-III logical memory performance after a two-week delay in temporal lobe epilepsy and control groups,” J. ClinicalExperimental Neuropsychology, vol. 28, no. 8, pp. 1435–1443, 2006.

[66] J. L. Cuzzocreo, M. A. Yassa, G. Verduzco, N. A. Honeycutt, D. J.Scott, and S. S. Bassett, “Effect of handedness on fMRI activationin the medial temporal lobe during an auditory verbal memory task,”Hum. Brain Mapp., vol. 30, no. 4, pp. 1271–1278, 2009.

[67] R. Massuda, J. Bucker, L. S. Czepielewski, J. C. Narvaez, M. Pedrini,B. T. Santos, A. S. Teixeira, A. L. Souza, M. P. Vasconcelos-Moreno,M. Vianna-Sulzbach, et al., “Verbal memory impairment in healthysiblings of patients with schizophrenia,” Schizophrenia Res., vol. 150,no. 2-3, pp. 580–582, 2013.

dictionary learning-based fmri data analysis for capturing

Documents