a new approach to identify functional modules using...

A New Approach to Identify Functional Modules

Using Random Matrix Theory

Mengxia ZhuComputer Science Dept

Southern Illinois UniversityCarbondale, IL 62901mengxiagcs.siu.edu

Qishi WuComputer Science DeptUniversity of MemphisMemphis, TN 38152qishiwugmemphis.edu

Yunfeng YangEnvironmental Sciences DivisionOak Ridge National Laboratory

Oak Ridge, TN 37831yangy(ornl.gov

Jizhong ZhouBotany & Microbiology DeptUniversity of Oklahoma

Norman, OK 73019jzhougou.edu

Abstract- The advance in high-throughput genomic technolo-gies including microarrays has generated a tremendous amountof gene expression data for the entire genome. Decipheringtranscriptional networks that convey information on membersof gene clusters and cluster interactions is a crucial analysis taskin the post-sequence era. Most of the existing analysis methodsfor large-scale genome-wide gene expression profiles involveseveral steps that often require human intervention. We proposea random matrix theory-based approach to analyze the crosscorrelations of gene expression data in an entirely automatic andobjective manner to eliminate the ambiguities and subjectivityinherent to human decisions. The correlations calculated fromexperimental measurements typically contain both "genuine" and"random" components. In the proposed approach, we remove the"random" component by testing the statistics of the eigenvaluesof the correlation matrix against a "null hypothesis" a trulyrandom correlation matrix obtained from mutually uncorrelatedexpression data series. Our investigation on the components ofdeviating eigenvectors using varimax orthogonal rotation revealsdistinct functional modules. We apply the proposed approach tothe publicly available yeast cycle expression data and produce a

transcriptional network that consists of interacting functionalmodules. The experimental results nicely conform to thoseobtained in previously published literatures.Keywords: Random Matrix, Microarray, Pearson correlation, Eigen-value, Eigenvector, Varimax orthogonal rotation.

I. INTRODUCTION

The exponential growth of genomic sequence data startingin early 1980s has spurred the development of computa-tional tools for DNA sequence similarity searches, structuralpredictions, and functional predictions. The emergence ofhigh-throughput genomic technologies in the late 1990s hasenabled the analysis of higher order cellular processes basedon genome-wide expression profiles such as oligonucleotideor cDNA microarray. Genes now can be affiliated by theirco-regulated expression waveforms in addition to sequence

similarity and proximity on the chromosome as in gene contentanalysis. Genes ascribed to the same cluster are usuallyresponsible for a specific physiological process or belong tothe same molecular complex. Such transcriptome (mRNAs)datasets deliver new knowledge and insights to the existinggenome (genes) datasets, and can be used to guide proteome(proteins) and interactome research that aims to extract keybiological features such as protein-protein interactions andsubcellular localizations more accurately and efficiently.

However, organizing genome-wide gene expression datainto meaningful function modules remains a great challenge.Many computational techniques have been proposed to con-jecture the cellular network based on microarray hybridizationdata. Examples include Boolean network methods, differentialequation-based network methods, Bayesian network methods,hierarchical clustering, K means clustering, self-organizingmap (SOM), and correlation-based association network meth-ods.

Boolean network method [7], [3] is a coarse simplificationof gene network to determine the gene state as either 0 or 1from the inputs of many other genes. Differential equation-based network models [4] gene networks as a set of non-linear differential equations that can indicate the gene ratechange without the assumption of discrete time steps. Bayesiannetwork gives a graphical display of dependence structurebased on conditional probabilities among genes. In hierarchicalclustering, a dendogram is constructed by iteratively groupingtogether genes with the highest correlation, which is essentiallya greedy algorithm achieving local optimality and disregardsnegative association [10]. K means clustering [8] serves asan improved approach to hierarchical clustering but requiringa subjective specification on the number of clusters. SOM [11]is a neural network-based iterative clustering method andalso requires the user to estimate the initial cluster number.The correlation-based association network technique has beencommonly adopted to identify cellular networks due to itscomputational simplicity and the nature of microarray data(typically noisy, highly dimensional and significantly under-sampled). However, the association network method relies onarbitrarily assigned thresholds for link cutoff, which inevitablyintroduces subjectivity in network structure and topology.A novel technology, which can determine the structure oftranscriptional networks and uncover biological regularities ina computerized and unbiased way, has been under active studyby biological scientists.We propose and develop a system to construct and analyze

various aspects of transcriptional networks based on randommatrix theory (RMT). Correlation matrix for yeast genomedemands a significant amount of computing cycles to calculateall eigenvalues and eigenvectors. High performance computingresources such as supercomputer and Linux cluster as well as

parallel programming techniques should be utilized to addressthis problem. We aim to tackle computing problems withtens or hundreds of thousands genes within short period ofcomputing time.

The rest of the paper is organized as follows: mathematicalmodel and data preparation are discussed in Section II. InSection III, we present the statistics of correlation matrix.Discussion on deviating eigenstates based on random ma-trix theory is given in Section IV. Genes are clustered andfunctional modules are identified in Section V. Experimentalresults on yeast cycle data are presented in Section VI todemonstrate the effectiveness of our method. We conclude ourwork in Section VII.

II. PROBLEM FORMULATIONWe define the expression signal of gene i 1, ...,N in

various samples s= 1,...,K as:

Wi (s) ln(In(Es ) (1)

where Esi (s) denotes the expression signal of sample s forgene i, and Eci (s) is the corresponding control signal. Dueto the various levels of expression signal shown by differentgenes, we normalize the data as:

wi (s)WI (s) -(Wi

(i

where vi- KW72)-(Wi)2 represents the standard deviationof Wi, and (Wi) stands for the average over different samplesfor gene i. From this normalized N x K data matrix M, we

calculate the cross-correlation matrix C according to

C-( I MMT. (3)K

Pearson correlation coefficient Cy between gene x and y,

each with k data series, can also be calculated from Eq. 4:

C , (Xi - X1 (yi -YP (4)(K -l)sxsy

where sx and sy denotes the standard deviations. Pearsoncorrelation ranges from as perfect correlation to -1 as perfectanti-correlation. When C11 = 0, no correlation exists betweengenes i and j.

However, conducting direct study on these empirical cross-

correlation coefficients is rather difficult due to the uniqueproperties of microarray experiments. Firstly, the cross-

correlation between any pair of genes may not be constant:such co-regulations can fluctuate over time or under differentsample conditions. Secondly, the limited number of samplesthat a microarray is typically conducted upon, may introducesignificant "measurement noise" that compromises the accu-

racy of the underlying correlations.In order to filter out randomness contained within the em-

pirical cross-correlation matrix, we test the eigenstates of thiscorrelation matrix against those of a controlled counterpart,a truly random correlation matrix generated by computer

random generator. Statistic properties that conform to the trulyrandom matrix are labeled as noise contributions; on the otherhand, any deviating eigenstates are treated as genuine correla-tions, which will be amplified and analyzed for transcriptionalnetwork construction.

III. STATISTICS OF CORRELATION MATRIX

A. Distribution of correlation coefficientsWe contrast the distribution P(C1j) for cross-correlation

matrix C with P (Rj1), where R denotes a random correlationmatrix constructed from a series of mutually uncorrelated datawith zero mean and unit variance generated by a computer.Fig. 1(a) shows that P (Rij) demonstrates a Gaussian distri-bution with zero mean, which indicates complete randomnesswithin the data. However, P(C11) as shown in Fig. l(b) isasymmetric and centered around a positive value in contrastto P(Rij), which implies that positive correlation is morepronounced than negative correlation among genes.

B. Distribution of eigenvaluesWe further compare the probability distributions pC PI)

and pR (i ) of the eigenvalues )i calculated from the cross-correlation matrix C and the random matrix R, respectively.Eigenvalues are arranged in a descreasing order such that2i > i+ 1. The probability distributions PC (Ak) and pR (2k)are plotted in Fig. 2(a) and Fig. 2(b). It has been observedthat a set of the eigenvalues of C fall within the well-definedrange of [A,A+] calculated from R, with a few deviating fromthe upper and lower bounds conveying the true correlationinformation. This observation enables us to separate the realcorrelation from the randomness. Such denoising process isnecessary since microarray data is extremely undersampledand may introduce significant measurement noise. Interest-ingly, Kwapien et al. [6] found that increasing the lengthof time series or number of samples would cause eigenvaluesto deviate more from the random matrix eigenvalue bounds.They declared that the bulk of the correlation matrix is notpure noise as conventionally thought to be. Based on theirresults, it is possible that more subtle and less prevalent co-regulated gene groups could be squeezed out of the noisesegment if we are able to acquire a larger sample size K.However, experimental results are still needed to validatethis assumption. In practice, a large sample size K from theperspective of mathematical view is not always feasible formost biological datasets due to the considerable time andmaterial resources involved in bio-related experiments.

C. Distribution of nearest-neighbor eigenvaluesThe comparison made above between PC(A) and pR(i)

alone is not sufficient to show that the majority of theeigenvalue spectrum of C is random. In general, matrices withthe same eigenvalue distribution may have different eigenvaluecorrelations, and vice versa [9]. Hence, we also need toexamine the correlation in the eigenvalues of C to determineif it conforms to that of a random matrix.

(b) Distribution of gene expression correlation coefficients.

Fig. 1. Comparison of correlation coefficient distributions.

eigervalue distribution eigenvalue distribution for random data

6[

4

0 100 20-0 300 400 500 600 700 800 900

(a) Distribution of eigenvalues of gene expression correlation matrix.

0 100 200 300 400 500 600 700 800 900

(b) Distribution of eigenvalues of random system.

Fig. 2. Comparison of eigenvalue distributions: x axis represents the eigenvalues and y axis represents the probability densities.

RMT makes two universal predictions for real symmetricmatrices: the nearest neighbor spacing distribution (NNSD)of eigenvalues follows Gaussian orthogonal ensemble(GOE)statistics as in Eq. 5, if there exists correlation between nearest-neighbor eigenvalues, while follows Poisson if there is nocorrelation. In order to ensure the uniform average valuefor eigenvalue spacing throughout the spectrum, we map theeigenvalues to new unfolded eigenvalues, whose distribution isuniform. Unfolding procedure guarantees that eigenvalue spac-ing is represented in units of local mean eigenvalue spacing. Torealize this, one can replace )Lj by the unfolded spectrum Xi'fa, ()y), where fa, ()li) is the smoothed integrated density ofeigenvalues obtained by fitting the original integrated densityto a cubic spline or by local density average. We compute thenearest neighbor spacing distribution P (n), n = )U-i' 1. Weknow that P (n) for random matrix follows the Wigner-Dysondistribution. Our experiments show that the NNSD P (n) for

C conforms well with PGOE (n).

PGOE (n) -2 exp( 22). (5)

These results support the assumption that the majority ofthe eigenvalues are of randomness in nature both from theperspective of eigenvalue distribution and eigenvalue corre-lation distribution. Thus, random matrix theory serves as anideal mathematical tool to investigate microarray datasets thattypically have a significant amount of noise and errors.

IV. DEVIATING EIGENVALUES AND EIGENVECTORS

A. Deviating eigenvalueWe consider the set of eigenvalues that deviate from the

eigenvalue range of the random matrix as genuine correlation.The amount of variance contributed by each eigenvector (fac-tor) can be reflected by the proportion of eigenvalue over the

0 L--A

(a) Distribution of random correlation coefficients.

Fig. 3. Variance explained by sorted descending eigenvalues.

sum of all eigenvalues based on principle component analysis(PCA). In other words, principle factors are responsible forthe majority of variation within the data. Thus, only largeeigenvalues, usually greater than 1, and their correspondingeigenvectors are retained for further treatment and gene groupanalysis. The rest of the eigenstates contain either insignificantor noisy information. We can see from Fig. 3, a plot of varianceversus eigenvalue, that a large proportion of variation is pickedup by the first several large eigenvalues.

B. Deviating eigenvector componentsDeviating eigenvalues naturally lead us to the investigation

of their corresponding deviating eigenvectors. There are Neigenvectors u' in total, i= 1 ...N. Each eigenvector ui hasN components corresponding to N gene variables. All eigen-vectors are perpendicular(orthogonal) to each other and arenormalized to length of 1. The probability distribution ofeigenvector components for different eigenvalues are plottedand compared against that of a random matrix, which followsGaussian distribution with zero mean and unit variance.

1 7 u2\/3(u) = exp 2 (6)

The probability distribution of the eigenvector componentswith the corresponding eigenvalue Ak from the bulk A_ < Sk <A+ shows a good agreement with Gaussian distribution asindicated by the lower right graph in Fig. 4. The deviatingeigenvector components demonstrate a significant deviationfrom the Gaussian distribution as shown by the upper andlower left graphs in Fig. 4. It has been also observed that thedistribution curve is gradually reforming to approximate theshape of a Gaussian distribution when eigenstates approachthe characteristic region represented by a random matrix.

V. FUNCTIONAL MODULES IDENTIFICATION

A. Collective behavior from the largest eigenvalueThe observation also shows that if the majority of gene

expression correlations are co-regulated, the eigenvector com-ponents corresponding to the largest eigenvalue with contri-bution from almost all genes have the same sign as shownin Fig. 5. Such eigenvector components distribution can becommonly found in a specific gene cluster, where most ofthe genes are co-regulated with a few to be anti-co-regulated.

Fig. 5. Distribution of eigenvector component for u1.

It further supports the existence of housekeeping genes in asignificant number, which are constitutively expressed to carryout basic cell functions needed for the sustenance of the cell.Similar phenomena are also observed in financial stocks data,and can be interpreted as a common influence to all stocks bycertain stimuli such as newsbreaks of interest rate increase [9].We quantify the alike collective behavior of the entire genomeby eigensignals computed as the scalar product of the sampleseries on the first eigenvector u1:

N1 (S) = IUi1 Wi (S).

i=l(7)

We have the following when the common influence effect isconsidered:

(8)

where z1 (s) is common to all genes, and aci and fPi aregene-specific constants, which can be estimated by leastsquares regression. The largest eigenvalue is usually an orderof magnitude larger than the rest of the eigenvalues. Suchstrong eigensignal can significantly suppress the effect of other

Neigenvalues because of the fact that ~,i = Tr(C) =N. We

want to remove the effect of ),j in order to augment the impactof the remaining eigenvalues for easy and reliable study. FromEq. 8, we calculate the residuals ei (s) as the matrix elementsto construct a new correlation matrix C'. The eigenstates ofC' are then analyzed to build transcriptional networks capableof revealing some subtle gene clusters that might have beenmasked by the largest eigenvalue ),I.

B. Loading factor and orthogonal rotation

After acquiring a set of normalized eigenvectors, we trans-form the eigenvector components to loading factors by takingthe multiplication of vector components and the square rootof corresponding eigenvalue. Each eigenvector represents onefactor leading to one gene cluster. A larger loading factorindicates that the corresponding gene "load" more on thateigenvector, or that gene is more expression-dominating forthat cluster. To simplify the eigenvector structure and make

a.G. 04 -0 03 -0-02 -0-01 0

Wi (s) = ai + Piz, (s) + -ei (s),

10 _

-S.:I x0.0 -0O.18 -0. 94 -E G2 0 0- C2 C---G 0.0S 0.08 0Eige. torcomrponlel fr'1st eigelue

EiGi -a -a Ce hGal

70

30 -

20 -

10-

0 -- I

so

SS

c10

L

I 'fEn-C05f0005E genrectr ompotnent for 2.id eigenwaluLe

E gfr' Uctor ccmpnpent for

Fig. 4. Distribution of eigenvector component for u ,u , u80.

the interpretation of gene clusters easier and more reliable,we apply orthogonal rotation to the retained eigenvectors.Since the rotation is performed in the subspace of the entireeigenstates, the total variation explained by the newly rotatedfactors are always less then the original space, but the variationwithin the subspace remains the same before and after rotation,only the partition of variation changes.VARIMAX [5] is a simple and popular rotation method that

transforms the principle data axes such that each eigenvectorwill contain a small number of large loadings and a largenumber of zeroes or small loadings. Biologically speaking,each gene tends to load heavily on only one or a few gene

clusters. Thus, gene clusters consist of a reduced numberof genes compared with pre-rotation results. The rationalebehind VARIMAX is that a rotation (linear combination) of theoriginal factors is searched in order to maximize the variancewithin factor loadings. A rotation matrix R can be determinedto specify such rotation as following:

R [ Cos Oj Cos ijj 1

Cosojs cos Jj j

where Oi,j is the rotation angle from old axis i to new axis j.The graphical representation for a 2D orthogonal rotation isillustrated in Fig. 6 with dotted lines representing new axes.

C. Stability ofgene clusters in samplesThe stability of gene clustering based on our eigenstates

analysis can be evaluated in terms of variance of total expres-

sion signals denoted by Z' for eigenvector i among differentsamples and time series. The variances are directly associ-ated with the corresponding eigenvalues as one of the mostimportant properties of eigensignals [6] in Eq. 10. The gene

cluster derived from eigenvector with larger eigenvalue is moreunstable compared with gene cluster associated with smallereigenvalue. Note that variance levels indicate the consistenceof gene members across different samples.

NZ () UiWk- (S)z'(s) XuY Wk

k-

(9)

Stab(u') (Z') (ul)TCUl = k where i = 1, ...N (10)

VI. EXPERIMENTAL DATA ANALYSIS

The program in this work is implemented in C++ andcurrently runs on a single workstation1.

The components of a deviating eigenvector with large valuesare identified as gene members belong to a specific functional

1We are now in the process of transitting our system from a workstationto a supercomputer or Linux cluster running ScaLAPACK [1] for paralleleigenstates computation.

0.'

SJ,I I SU.

a -J, I I

;ps 403

; so

W.

Fig. 9. Some functional modules and their gene members.,xi

Fig. 6. A 2D orthogonal rotation.

Fig. 7. Yeast transcriptional network.

Fig. 8. Subgroups within one group.

Fig. 10. Sub-functional modules and their gene members.

module involved in a similar cellular pathway. Yeast cellcycling data, as used in the manuscript, is one of the bestknown microarray dataset and has been extensively evaluated.Therefore, the structure of the network has been quite wellunderstood. By applying RMT method on this dataset, we

have demonstrated that our results are consistent with availablebiological knowledge, which allows us to reach the conclusionthat we have identified functional modules. The entire yeastgenome is partitioned into a large number of functionalmodules sharing similar expression patterns based on publicavailable microarray data downloaded from yeast cell cycleproject at [2].

Fig. 7 shows 17 distinct modules such as protein biogen-esis, DNA replication and repair, energy metabolism, proteindegradation, heat shock protein, TCA cycle, protein folding,allantoin mechanism, and histone. Various colors of the edgerepresent different ranges of correlation value between twogenes (vertices). It can be visually observed that correlationswithin groups are much higher represented by red or orangelinks than correlations between different groups indicated byblue or green links, which strongly indicates the effectivenessof our clustering approach. For groups with a large numberof genes, we recursively apply the same method to identifysubgroups within large groups. Fig. 8 is the refined submodular

yj

YJ

l>1

.U-

-1O-ldC..1

0. .Z.I~~~~~~~~~~~~~~~~~.;I

results for the first group in blue with 230 genes in Fig. 7. Twomajor submodules are identified as glycolysis and cell cycle.

VII. CONCLUSIONSHigh-throughput genomic technologies such as microarrays

have provided gene expression data at the transcription level.Its unprecedented power for the study of gene expression ofthousands of genes simultaneously can be potentially used tounveil the topology and functions of transcriptional networks.In this paper, we explored random matrix theory and orthogo-nal rotation techniques to dissect transcriptional networks andidentify various functional modules.Luo et al [?] also proposed a random matrix theory-based

approach to infer transcriptional networks based on microarraydata. However, their analysis is mainly focused on eigenvalues.In addition, their method require more computation cycles tocalculate eigenvalues for many different correlation matrices.In our approach, we only need to compute eigenstates for onecorrelation matrix.Most previous clustering methods partition members into

non-overlapping groups. However, in our method, one gene isallowed in multiple groups, which is a legitimate assumptionfrom the biological perspective since a single gene mayget involved in different pathways. Transitively co-regulatedgenes, which are not directly correlated but both of whichhave correlation with the same gene, can also be detected andgrouped. Our method is computationally efficient, objectivewithout human intervention, and robust to high levels of noise.Function of unknown genes are conjectured and exploredthrough their associated function modules.

Since our computational analysis is solely based on asingle microarray dataset, we only obtain rough structure offunctional modules. If genes in the same functional module donot show significant correlation in expression pattern, we willnot be able to identify them using RMT method. It is likelythat genes in the same functional module show significantcorrelation under one condition but not under another con-dition (For example, module of heat shock proteins are rarelyidentified in other yeast microarray dataset). By consolidatingresults from multiple microarray datasets, we could improvethe integrity of functional modules. The authors will worktoward this direction in the future.

In general, we think that "all the subjective factors inducedby humans built into the microarray data itself' can bedivided into two categories: systematic subjective factors (e.g.overlook low-density signals of spots if signal/backgroundsignal is set to be high, which will impact every slide of thewhole dataset) and random subjective factors. RMT method,or any existing clustering method (e.g. hierarchical clustering,K-means, SOM, etc.), is unable to deal with systematic sub-jective factors. On the other hand, RMT method is capable ofremoving the random subjective noise, which normally leadto low correlation between genes.

It would be our future interest to apply this method to humangenome data with 30k genes. A highly parallel implementationof our algorithm is needed to address large-scale biological

applications. Our code can easily migrate to supercomputersor cluster machines to utilize high performance computingresources. Some advanced visualization techniques will alsobe introduced at a later stage to aid data comprehension andinspire discoveries.

REFERENCES

[1] Scalapack. http://www.netlib.org/scalapack/scalapack-home.html.[2] Yeast. http://genome-www.stanford.edu/cellcycle.[3] T. Akutsu, S. Miyano, and S. Kuhara. Identification of genetic networks

from a small number of gene expression patterns under thebooleannetwork model. Pacific Symp. Biocomp, 4:17-28, 1986.

[4] T. Chen, H.L. He, and G.M. Church. Modeling gene expression withdifferential equations. Pacific Symp. Biocomp, 4:29-40, 1999.

[5] H.F. Kaiser. The varimax criterion for analytic rotation in factor analysis.volume 23, pages 187-200, 1958.

[6] J. Kwapien, P. Oswiecimka, and S. Drozdz. The bulk of the stock marketcorrelation matrix is not pure noise. Physica A, 359:589-606, 2006.

[7] S. Liang, S. Fuhrman, and R. Somogyi. Reveal, a general reverseengineering algorithm for inference of genetic network architectures.Pacific Symp. Biocomp, 3:18-29, 1998.

[8] J. B. MacQueen. Some methods for classification and analysis ofmultivariate observations. Proceedings of 5-th Berkeley Symposium onMathematical Statistics and Probability, 1:281-297, 1998.

[9] Vasiliki. Plerou, Gopikrishnan. Parameswaran, Rosenow. Bernd, LusA. Nunes. Amaral, and Thomas. Guhr. Random matrix approach tocross correlations in financial data. Physical Rev, 65(066126):1-18,2002.

[10] R.R. Sokal and C.D. Michener. A statistical method for evaluatingsystematic relationships. Univ. Kansas Sci Bull, 38:1409-1438, 1958.

[11] P. Toronen, M. Kolehmainen, G. Wong, and E. Castre'n. Analysis ofgene expression data using self-organizing maps. FEBS L., 451:142-146, 1999.

a new approach to identify functional modules using...

Documents