probabilistic sparse matrix factorization with an application to...

8
Probabilistic sparse matrix factorization with an application to discovering gene functions in mouse mRNA expression data Delbert Dueck, Quaid Morris, and Brendan Frey Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4 {delbert,quaid,frey}@psi.toronto.edu Abstract We address the problem of sparse matrix fac- torization using a generative model. Our al- gorithm, probabilistic sparse matrix factor- ization (PSMF), is a probabilistic extension of a previous hard-decision algorithm for this problem (Srebro and Jaakkola 2001). PSMF allows for varying levels of sensor noise in the data, uncertainty in the hidden prototypes used to explain the data, and uncertainty as to which prototypes are selected to explain each data vector. We present experimental results demonstrating that our method can better recover functionally relevant cluster- ings in mRNA expression data than standard clustering techniques, and we show that by computing probabilities instead of point esti- mates, our method avoids converging to poor solutions. 1 Introduction Many kinds of data can be viewed as consisting of a set of vectors, each of which is a noisy combination of a small number of noisy prototype vectors. More- over, these prototype vectors may correspond to dif- ferent hidden variables that play a meaningful role in determining the measured data. For example, a gene’s expression is influenced by the presence of transcrip- tion factor proteins, and two genes may be activated by overlapping sets of transcription factors. Conse- quently, the activity of each gene can be explained by the activities of a small number of transcription fac- tors. This task can be viewed as the problem of fac- torizing a data matrix, while taking into account con- straints reflecting structural knowledge of the problem and probabilistic relationships between variables that are induced by known uncertainties in the problem. A simple example of a technique for finding such a fac- torization is principal components analysis (PCA). In this paper, we study algorithms for finding matrix fac- torizations, but with a specific focus on sparse factor- izations, and on properly accounting for uncertainties while computing the factorization. Our interest in algorithms for “probabilistic sparse ma- trix factorization” (PSMF) is motivated by a problem identified during our collaborations with researchers working in the area of molecular biology. Advances in bio-molecular sensing technologies have enabled re- searchers to probe the transcriptional state of cells and quantify the abundance of specific nucleotide se- quences including mRNAs (which code for proteins). These technologies, called DNA microarrays, gene ex- pression arrays, or simply expression arrays (Schena et al. 1995) can be applied to a variety of problems, ranging from genome sequencing to detecting the ac- tivation of a specific gene in a tissue sample (e.g., for the purpose of detecting disease or monitoring drug response). Although researchers now have digital rep- resentations of the actors in cellular processes and the ability to probe these actors, it is not yet known how the information is digitally encoded and how the digi- tal components interact to regulate cellular activities. An expression array is a 1 00 × 2 00 biology slide contain- ing tens of thousands of probes, each of which detects a specific nucleic acid sequence. These arrays can be used to measure the mRNA abundance levels by first extracting mRNA from a biological sample of inter- est, tagging the sequences with a fluorescent molecule, and placing the tagged sequences on the array. mRNA sequences from the tissue that match each probe will tend to hybridize (stick to the probe), so when the slide is scanned by a laser, the level of florescence at each probe indicates the amount of the corresponding mRNA sequence in the tissue sample. By varying the type of sample (e.g., heart versus brain) or the con- ditions under which the sample is taken (e.g., healthy versus diseased), multiple arrays can be used to con- struct a vector of expression levels for each gene, where

Upload: others

Post on 24-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

Probabilistic sparse matrix factorization with an application todiscovering gene functions in mouse mRNA expression data

Delbert Dueck, Quaid Morris, and Brendan Frey

Department of Electrical and Computer EngineeringUniversity of Toronto

Toronto, Ontario, Canada M5S 3G4{delbert,quaid,frey}@psi.toronto.edu

Abstract

We address the problem of sparse matrix fac-torization using a generative model. Our al-gorithm, probabilistic sparse matrix factor-ization (PSMF), is a probabilistic extensionof a previous hard-decision algorithm for thisproblem (Srebro and Jaakkola 2001). PSMFallows for varying levels of sensor noise in thedata, uncertainty in the hidden prototypesused to explain the data, and uncertainty asto which prototypes are selected to explaineach data vector. We present experimentalresults demonstrating that our method canbetter recover functionally relevant cluster-ings in mRNA expression data than standardclustering techniques, and we show that bycomputing probabilities instead of point esti-mates, our method avoids converging to poorsolutions.

1 Introduction

Many kinds of data can be viewed as consisting of aset of vectors, each of which is a noisy combinationof a small number of noisy prototype vectors. More-over, these prototype vectors may correspond to dif-ferent hidden variables that play a meaningful role indetermining the measured data. For example, a gene’sexpression is influenced by the presence of transcrip-tion factor proteins, and two genes may be activatedby overlapping sets of transcription factors. Conse-quently, the activity of each gene can be explained bythe activities of a small number of transcription fac-tors. This task can be viewed as the problem of fac-torizing a data matrix, while taking into account con-straints reflecting structural knowledge of the problemand probabilistic relationships between variables thatare induced by known uncertainties in the problem. Asimple example of a technique for finding such a fac-

torization is principal components analysis (PCA). Inthis paper, we study algorithms for finding matrix fac-torizations, but with a specific focus on sparse factor-izations, and on properly accounting for uncertaintieswhile computing the factorization.

Our interest in algorithms for “probabilistic sparse ma-trix factorization” (PSMF) is motivated by a problemidentified during our collaborations with researchersworking in the area of molecular biology. Advancesin bio-molecular sensing technologies have enabled re-searchers to probe the transcriptional state of cellsand quantify the abundance of specific nucleotide se-quences including mRNAs (which code for proteins).These technologies, called DNA microarrays, gene ex-pression arrays, or simply expression arrays (Schenaet al. 1995) can be applied to a variety of problems,ranging from genome sequencing to detecting the ac-tivation of a specific gene in a tissue sample (e.g., forthe purpose of detecting disease or monitoring drugresponse). Although researchers now have digital rep-resentations of the actors in cellular processes and theability to probe these actors, it is not yet known howthe information is digitally encoded and how the digi-tal components interact to regulate cellular activities.

An expression array is a 1′′ × 2′′ biology slide contain-ing tens of thousands of probes, each of which detectsa specific nucleic acid sequence. These arrays can beused to measure the mRNA abundance levels by firstextracting mRNA from a biological sample of inter-est, tagging the sequences with a fluorescent molecule,and placing the tagged sequences on the array. mRNAsequences from the tissue that match each probe willtend to hybridize (stick to the probe), so when theslide is scanned by a laser, the level of florescence ateach probe indicates the amount of the correspondingmRNA sequence in the tissue sample. By varying thetype of sample (e.g., heart versus brain) or the con-ditions under which the sample is taken (e.g., healthyversus diseased), multiple arrays can be used to con-struct a vector of expression levels for each gene, where

Page 2: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

the elements of the vector correspond to the differentexperimental conditions. Such a vector is called an“expression profile”.

By viewing expression profiles as points in a vectorspace, researchers have been able to use well-knownvector-space data analysis techniques to identify newpatterns of biological significance, and quickly makepredictions based on previous studies that required alarge amount of time and resources. In particular, theconstruction of these profiles has enabled the large-scale prediction of gene function for genes with un-known function, based on genes with known function(Hughes et al. 2000). Because many biological func-tions depend upon the coordinated expression of multi-ple genes, similarity of expression profile often impliessimilarity of function (Marcotte et al. 1999). This re-lationship has been exploited to predict the function ofuncharacterized genes by various researchers (see, e.g.,Brown et al. 2000). These schemes make use of anno-tation databases, which assign characterized genes toone or more predefined functional categories.

However, noise in the expression measurements haslimited the predictive accuracy of these algorithms, es-pecially in those categories containing a small numberof genes. These “small” categories can be of greater in-terest because less is typically known about them; fur-thermore, they often make specific (and thus more eas-ily confirmed) predictions about gene function. Here,we introduce an unsupervised technique that jointlydenoises expression profile data and computes a sparsematrix factorization. Our technique models the under-lying causes of the expression data in terms of a smallnumber of hidden factors. The representation of theexpression profile in terms of these hidden factors isless noisy; here we explore whether this representationis more amenable to functional prediction.

Our technique explicitly maximizes a lower boundon the log-likelihood of the data under a probabilitymodel. The sparse encoding found by our methodcan be used for a variety of tasks, including func-tional prediction and data visualization. We reportp-values, which show that our technique predicts func-tional categories with greater statistical significancethan a standard method, hierarchical agglomerativeclustering. Also, we show that our algorithm, whichcomputes probabilities rather than making hard de-cisions, obtains a higher data log-likelihood than theversion of the algorithm that makes hard decisions.

2 Methods for matrix factorization

One approach to analyzing data vectors lying in a low-dimensional linear subspace is to stack them to forma data matrix, X, and then find a low-rank matrix

factorization of the data matrix. Given X ∈ RG×T ,matrix factorization techniques find a Y ∈ RG×C anda Z ∈ RC×T such that X ≈ Y · Z.

Interpreting the rows of X as input vectors {xg}Gg=1,

the rows of Z (i.e. {zc}Cc=1) can be viewed as vectors

that span the C-dimensional linear subspace, in whichcase the gth row of Y contains the coefficients {ygc}C

c=1

that combine these vectors to explain the gth row ofX.

A variety of techniques have been proposed for findingmatrix factorizations, including non-probabilistic tech-niques such as principal components analysis, indepen-dent components analysis (Bell and Sejnowski 1995),and network component analysis (Liao et al 2003), andalso probabilistic techniques that account for noise,such as factor analysis. In many situations, we antic-ipate that the data consists of additive combinationsof positive sources. In these cases, non-probabilistictechniques have been proposed for finding factoriza-tions under non-negativity constraints (Lee and Seung1999, 2001).

Global gene expression is regulated by a small set of“transcription factors” (TFs) — proteins which bindto the promoter region associated with each gene.Typically, each individual gene is influenced by only asmall subset of these TFs. Inspired by this, we modelthe gene expression vector as an additive, weightedcombination of a small number of prototype vectors— each prototype representing the influence of a dif-ferent factor (or factors). Although each prototypemay represent the aggregate effect of a number of TFsthat always act in concert, to simplify presentation, wedescribe each prototype as a transcription factor, andthe prototype profile — which represents the effect ofthe TF on gene expression — as the TF profile.

This type of problem was called “sparse matrix factor-ization” in (Srebro and Jaakkola 2001), and is relatedto independent component analysis (ICA). In theirmodel, Srebro and Jaakkola augment the X ≈ Y · Zmatrix factorization setup with the sparseness struc-ture constraint that each row of Y has at most Nnon-zero entries1. They then describe an iterative al-gorithm for finding a sparse matrix factorization thatmakes hard decisions at each step.

On the other hand, our method finds such a factor-ization while accounting for uncertainties due to (1)different levels of noise in the data, (2) different levelsof noise in the factors used to explain the data, and (3)uncertainty as to which hidden prototypes are selectedto explain each input vector.

1When N = 1, this scheme degenerates to clusteringwith arbitrary data vector scaling; N = C yields ordinarylow-rank approximation.

Page 3: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

3 Probabilistic sparse matrixfactorization (PSMF)

Let X be the matrix of gene expression data such thatrows correspond to each of G genes and columns toeach of T tissues (i.e. entry xgt represents the amountby which gene g is expressed in cells of tissue type t.)We denote the collection of unobserved transcriptionfactor profiles as a matrix, Z, with rows correspondingto each of C factors and T columns corresponding totissues, as before.

We model each gene expression profile, xg, as a linearcombination of a small number (rg) of these transcrip-tion factor profiles, zc, plus noise:

xg =∑rg

n=1ygsgn

zsgn+ noise (1)

The transcription factor profiles contributing tothe gth gene expression profile are indexed by{sg1, sg2, . . . , sgrg

}, with corresponding weights{ygsg1

, ygsg2, . . . , ygsgrg

}. This is identical to theX ≈ Y · Z matrix factorization with {S, r} represent-ing the sparseness structure constraint. We accountfor varying levels of noise in the observed data byassuming the presence of gene-specific isotropic Gaus-sian sensor noise with variance ψ2

g so the likelihood ofxg is as follows:

P (xg|yg,Z, sg, rg, ψ2g) = N

(

xg;∑rg

n=1ygsgn

zsgn, ψ2

gI)

(2)

We complete the model with prior assumptions thatthe factor profiles (zc) are normally distributed andthat the factor indices (sgn) are uniformly distributed.The number of causes, rg, contributing to eachgene’s profile is multinomially distributed such thatP (rg =n) = νn, where ν is a user-specified N -vector.We make no assumptions about Y beyond the sparse-ness constraint, so P (Y)∝1.

Multiplying these priors by (2) forms the followingjoint distribution:

P (X,Y,Z,S, r|Ψ)

= P (X|Y,Z,S, r,Ψ) · P (Y) · P (Z) · P (S|r) · P (r)

∝G∏

g=1

N(

xg;∑rg

n=1ygsgn

zsgn, ψ2

gI)

·C∏

c=1

N (zc; 0, I)

·G∏

g=1

C∏

c=1

N∏

n=1

(1C

)δ(sgn−c)·

G∏

g=1

N∏

n=1

(νn)δ(rg−n)

(3)

3.1 Factorized Variational Inference

Exact inference with (3) is intractable so we utilize afactorized variational method (Jordan et al. 1998) and

approximate the posterior distribution with a mean-field decomposition:

P (Y,Z,S, r|X,Ψ)≈G∏

g=1

Q(yg) ·C∏

c=1

Q(zc)

·G∏

g=1

N∏

n=1

Q(sgn) ·G∏

g=1

Q(rg) (4)

We parameterize the Q-distribution as follows:

Q(yg) =

λgc is a point estimate of ygc

︷ ︸︸ ︷rg∏

n=1

δ(ygsgn

− λgsgn

but for {sg,rg}︷ ︸︸ ︷

C∏

c=1c/∈{sg1,sg2,...,sgrg}

δ (ygc)

Q(zct) = N (zct; ζct, , φ2c)

Q(sgn=c) = σgnc

Q(rg=n) = ρgn

Using this approach, inference corresponds to bring-ing the Q-distribution as close as possible to the P-distribution by setting the variational parameters (λ,ζ, φ, σ, ρ) to minimize the relative entropy, D(Q‖P ):

min{λ,ζ,φ,σ,ρ}

Y,Z,S,r

Q(Y,Z,S, r) · logQ(Y,Z,S, r)

P (Y,Z,S, r|X,Ψ)(5)

The constraints∑C

c=1σgnc = 1,∑N

n=1ρgn = 1 becomeLagrange multipliers in this optimization problem.

There is no closed-form expression for the posterior(denominator in (5)), but we can subtract logP (X)inside the integral (it is independent of the variationalparameters) to form the readily-minimized free energy,F :

F = D(Q‖P ) − logP (X)

=

Y,Z,S,r

Q(Y,Z,S, r) · logQ(Y,Z,S, r)

P (X,Y,Z,S, r|Ψ)

...

=

G∑

g=1

N∑

n=1

ρgn

n∑

n′=1

C∑

c=1

(

σgn′c · logσgn′c

1/Cn

)

+

G∑

g=1

N∑

n=1

(

ρgn · logρgn

νn

)

+T

2

G∑

g=1

log 2πψ2g

−T

2

C∑

c=1

(1 + log φ2

c

)+

1

2

T∑

t=1

C∑

c=1

(ζ2ct + φ2

c

)

+1

2

G∑

g=1

T∑

t=1

N∑

n=1

ρgn

ψ2g

C∑

c1=1

σg1c1

C∑

c2=1

σg2c2· · ·

· · ·C∑

cn=1

σgncn

(

xgt −n∑

n′=1

λgcn′ζc

n′ t

)2

+

n∑

n′=1

λ2gc

n′φ2

cn′

Page 4: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

The free energy can be minimized sequentially withrespect to each variational parameter (λ, ζ, φ, σ, ρ)by analytically finding zeros of the partial derivativeswith respect to them. This coordinate descent repre-sents the E-step in variational EM (Jordan et al. 1998)that alternates with a brief M-step, where the globalsensor noise is fit by similarly solving ∂F/∂Ψ = 0. Fora more detailed treatment of the algorithm and pa-rameter update equations, see (Dueck and Frey 2004).

3.2 Non-negativity Constraints

Recalling the biological intuition behind the model,gene expression profiles across tissues are thought toconsist of a linear combination of transcription factorprofiles. Since gene expression data represents geneabundances, data matrix X should only have non-negative entries. To ensure this, it is convenient toconstrain the transcription factor profile vectors, Z, toalso be non-negative.

Similarly, we wish to apply a non-negativity constraintto the factor strengths (weights), Y, to be compati-ble with the notion of allowing only additive combi-nations of factors; side-effects of this are described in(Lee and Seung 1999, 2001). This constraint could beenforced by introducing additional Lagrange multipli-ers to constrain λ and ζ or by modifying the priorsfor Y and Z to disallow negative values, but this ad-versely affects the differentiability of F . In practice,we simply zero-thresholded the parameters after eachupdate, sidestepping the issue of further complicatingthe update equations. This heuristic has no noticeableimpact on free energy minimization.

4 Experimental Results

To experimentally analyze the performance of our al-gorithm, we use the expression data from Zhang et al.in (Zhang, Morris et al. 2004). Each gene’s profile isa measurement of its expression in a set of 55 mousetissues. Expression levels are rectified at the medianexpression level, so that the data set is entirely non-negative, consistent with the notion of light intensitiesalso being non-negative.

The functional category labels for the genes withknown biological function were taken from (Zhang,Morris et al. 2004). An example of a category label is“cell wall biosynthesis”, which indicates the gene ex-presses a protein that is involved in building the cellwall. These labels are derived from Gene Ontology Bi-ological Process (GO-BP) category labels (Ashburneret al. 2000) assigned to genes by EBI and MGI.

Among the 22,709 genes in the Zhang et al. database,9499 have annotations in one or more of 992 differentGO categories. The category sizes range from 3 to 456genes, with more than half the categories having fewerthan 20 genes.

We present results for the 22,709 genes × 55 tissuesdata set shown in Figure 1. The data matrix, X, isshown alongside the model’s approximation, X̂, alsoexpressed as X̂ = Y · Z. A total of C = 50 factorswere used, and the user-specified prior on the numberof factors (rg) explaining each expression profile wasset to ν =

[.55 .27 .18

], making N = 3.2

Gene expression profiles (row vectors) are first sortedby ‘primary’ factor (sg1); next, within each sg1 group-ing, genes are sorted by ‘secondary’ factor (sg2), andso on. This organization is easily visualized in the hi-erarchical diagonal structure of Y.

4.1 Unsupervised characterization of mRNA

data

The overall objective for this research is to develop amodel that captures the functionally relevant hiddenfactors that explain gene expression data. As such wecan gauge success by measuring the similarity of thehidden factor assignments are for genes with similarfunctions.

We cluster genes on the basis of their TF assignmentsand use p-values calculated from the hypergeometricdistribution to measure the enrichment of these clus-ters for genes with similar function (Tavazoie et al.1999). For a given gene cluster of size M of which Kelements had a given functional annotation, the hy-pergeometric p-value is the probability that a randomset of genes of size M (selected without replacementfrom all genes in the dataset) would have K or moreelements with given functional annotation. For N = 3,we generate three different clustering of the genes: by‘primary’ factor, ’secondary’ factor, and ’tertiary’ fac-tor assignments. So, for example, genes g and g′ are inthe same ‘primary’ factor cluster if sg1 = sg′1. We la-bel each cluster with the GO-BP category having thelowest hypergeometric p-value for the genes in thatcluster, making this p-value the p-value for the entirecluster.

2A uniform prior ν (reflecting no knowledge about thedistribution of r) would give equal preference to all valuesof a particular rg. For any given rg<N , a factor can almostalways be found that, if present with infinitesimal weight(ygc), will imperceptibly improve the cost function (F),with the end result that almost all rg would then equalN . Weighting the prior towards lower values ensures thatfactors will only be included if they make a noteworthydifference. We only choose νn∝1/n, ∀n<N for simplicity.

Page 5: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

(a) Complete Data Set←

G=2

2709

gen

es→

←T=55 tissues→

←G

=227

09 g

enes

←T=55 tissues→

←G

=227

09 g

enes

←C=50 factors→

←C

=50

fact

ors→

←T=55 tissues→

X ≈ X̂ = Y • Z

(b) Data subset — factor 37 is primary

←83

9 ge

nes→

←T=55 tissues→

←83

9 ge

nes→

←T=55 tissues→

←83

9 ge

nes→

←C=50 factors→

xg ≈ x̂g = yg • Z

(c) Data subset — factor 37 is primary and factor 19 is secondary

←32

gen

es→

←T=55 tissues→

←32

gen

es→

←T=55 tissues→

←32

gen

es→

←C=50 factors→

xg ≈ x̂g = yg • Z

↑37th factor

weights↓

↑19th factor

weights

X, X̂ scale: 0 2 4 6 8 >10

Y, Z scale: 0 1 2 >3

Figure 1: Data matrix X approximated by X̂, the product of sparse Y and low-rank Z. Gene expression profiles appear as

row vectors in X and X̂ sorted by primary class (sg1), secondary class (sg2), etc. (a) shows the complete data set. (b) showsonly those genes where factor 37 is primary (i.e. {∀g | sg1 =37} — note the vertical line in yg). (c) shows those genes wherefactor 37 is primary and factor 19 is secondary (i.e. {∀g | sg1 =37 ∩ sg2 =19} — note the vertical lines in yg).

For reference, we also compute the functional enrich-ment of gene clusters generated randomly and thosegenerated using hierarchical agglomerative clustering(HAC), SMF, and ICA. To generate the HAC clusters,we used average linkage with Pearson’s correlation co-efficient on the expression profiles. We generated SMFclusters in the same way as we generated the PSMFclusters. Note that unlike HAC, PSMF and SMF allowus to generate ’primary’, ’secondary’, and ’tertiary’clusters. To generate the ICA clusters, we used theFastICA package (Hyvrinen 1999) to recover 55 sourcevectors (i.e. factors) for X, each of length 22,709. Weignore the mixing matrix recovered by FastICA, andcluster the genes by the index of the largest absolutevalue among the corresponding 55-dimensional slice inthe matrix of source vectors.

Histograms of these p-values, as well as those for hier-archical agglomerative clustering, are shown in Figure2. We summarize these histograms in two ways: theproportion of clusters that are significantly enrichedfor at least one functional category at α = 0.05 (aftera Bonferroni correction) and the mean log10(p-value).Figure 3 shows these quantities for N = {1, 2, 3} over0 < C ≤ 100.

−20 −15 −10 −5 00

0.05

0.1

0.15

0.2

0.25

0.3

log10

(p−value)

frequ

ency

hierarchicalagglomerative

clustering

−20 −15 −10 −5 00

0.05

0.1

0.15

0.2

0.25

0.3

log10

(p−value)

frequ

ency

PSMF(primary)

−20 −15 −10 −5 00

0.05

0.1

0.15

0.2

0.25

0.3

log10

(p−value)

frequ

ency

PSMF(secondary)

−20 −15 −10 −5 00

0.05

0.1

0.15

0.2

0.25

0.3

log10

(p−value)

frequ

ency

PSMF(tertiary)

Figure 2: P-values for hierarchical agglomerative clusteringand probabilistic sparse matrix factorization (C=50, N=3).Significantly enriched factors/clusters (at α = 0.05 after Bon-ferroni correction) are shown as solid bars.

The plot shows that PSMF with N = 1 outperformshierarchical agglomerative clustering and clustering bydominant sources with ICA. For N > 1, the functionalenrichment of ’primary’ clusters for PSMF and SMFremains constant but the ’secondary’ and ’tertiary’

Page 6: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

0 20 40 60 80 100

−25

−20

−15

−10

−5

0

ν = [1.0 0 ⋅⋅⋅ 0]

C (# clusters, factors)

mea

n lo

g 10(p

−val

ue)

ICA (C=T=55)

0 20 40 60 80 1000%

20%

40%

60%

80%

100%N = 1

fact

ors

with

sig

nific

ance

ICA (C=T=55)

0 20 40 60 80 100

−25

−20

−15

−10

−5

0

ν = [.67 .33 0 ⋅⋅⋅ 0]

C (# clusters, factors)

°°°°

sg1

sg2

0 20 40 60 80 1000%

20%

40%

60%

80%

100%N = 2

°°°°

sg1

sg2

0 20 40 60 80 100

−25

−20

−15

−10

−5

0

ν = [.55 .27 .18 0 ⋅⋅⋅ 0]

C (# clusters, factors)

°° °°°°

sg1

sg2

sg3

0 20 40 60 80 1000%

20%

40%

60%

80%

100%N = 3

°° °°°°

sg1

sg2

sg3

PSMF (variational)SMF (ICM)hierarchical agg.random clustering

Figure 3: Fraction of clusters/factors with significance (top row) and mean log10

p-value (bottom row). Probabilistic SparseMatrix Factorization (Dueck and Frey 2004) and Sparse Matrix Factorization (Srebro and Jaakkola 2001) are shown for N = 1(left), N = 2 (center), and N = 3 (right) along with hierarchical agglomerative clustering and random clustering over a rangeof C-values. Error bars represent the sample standard deviation over ten trials.

clusters are also more functionally enriched than ran-dom. Note that for all values of N , the PSMF clustersare more enriched than the corresponding SMF clus-ters.

Genes can perform multiple functional roles, and Fig-ures 2 and 3 suggest that the ’secondary’ and ’tertiary’factors may capture this additional level of functionalorganization. Figure 4 shows the probability distribu-tion of ’secondary’ factor assignment conditioned onvarious ’primary’ factor assignments. Factors are la-beled with their most significantly enriched GO-BPcategory and only factors with statistically significantfunctional enrichment (26 primary and 17 secondary)are shown.

Two possible types of functional organization canbe seen in this figure. Consider the primary fac-tor labelled “response to pest/pathogen/parasite” (cir-cled horizontally) in Figure 4. Genes with thisas their primary factor are primarily assigned intogroups enriched for either “mitotic cell cycle”, “aminoacid transport”, or “humoral immune response” bytheir secondary factor, possibly representing differ-ent strategies for generating an immune response.

In contrast, consider the secondary factor enrichedfor “amino acid transport” (circled column in Fig-ure 4). Genes having this secondary factor are pri-marily drawn from primary factors enriched for as“response to pest/pathogen/parasite” (from before),“cytokenesis”, “fatty acid metabolism”, and “patternspecification”. Perhaps representing different pro-cesses which require the transportation of amino acids.

4.2 Avoiding local minima by accounting for

uncertainty

The benefit of using a soft-decision factorized vari-ational method (PSMF) — as opposed to hard-decision iterated conditional modes (SMF) (Srebroand Jaakkola 2001, Besag 1986) — is also apparentin Figure 3. This is the case because SMF is too in-flexible to properly allow factor groupings to evolve,as shown in the likelihood plots of Figure 5.

In our simulations, we observe that SMF appears toget trapped in a local log-likelihood maximum imme-diately after the first several iterations. An additionaladvantage of using probabilistic inference is that thefree energy provides a tighter bound on the marginal

Page 7: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

Figure 4: Conditional probability distributions of secondary factors (columns) given the primary factor (rows), i.e. — onlysignificant factors are shown. Probabilistic Sparse Matrix Factorization was performed with C = 50 factors and N = 2(ν =

[.67 .33

]). The GO categories with gene memberships most similar to each factor are labelled along each axis. The

circled row and column are discussed in §??.

0 5 10 15 20 25 300

2

4

6

8

10

12

14

16x 10

5 Free energy minimization(⇒ log-likelihood maximization)

iteration

−lC

-lC

F

SMF (iterated conditional modes)PSMF (factorized variational method)

Figure 5: Complete log-likelihood maximization of sparsematrix factorization (SMF) versus free energy minimization ofprobabilistic sparse matrix factorization (PSMF). The com-plete log-likelihood, i.e. `C (log of (3)) is calculated aftereach iteration of both ICM (which directly maximizes `C) andvariational EM using maximum-likelihood latent variable es-timates.

log-likelihood of the data (not shown), which is greaterthan the complete log-likelihood.

5 Summary

Many kinds of data vectors can most naturally be ex-plained as an additive combination of a selection ofprototype vectors, which can be viewed as computa-tional problem of finding a sparse matrix factoriza-tion. While most work on biological gene expressionarrays has focused on clustering techniques and meth-ods for dimensionality reduction, there is recent in-terest in performing these tasks jointly, which corre-sponds to sparse matrix factorization. Like (Srebroand Jaakkola 2001), our algorithm computes a sparsematrix factorization, but instead of making point es-timates (hard decisions) for factor selections, our al-gorithm computes probability distributions. We findthat this enables the algorithm to avoid local minimafound by iterated conditional modes.

Compared to a standard technique used in the bioin-formatics community for clustering gene expressionprofiles, hierarchical agglomerative clustering, ourtechnique finds clusters that have higher statistical sig-nificance (lower p-values) in terms of enrichment ofgene function categories. An additional advantage ofour method over standard clustering techniques is that

Page 8: Probabilistic sparse matrix factorization with an application to …delbert/docs/Dueck,Morris,Frey... · 2004-10-08 · Probabilistic sparse matrix factorization with an application

the secondary and higher-order labels found for eachexpression vector can be used for more refined visu-alization and functional prediction. We are currentlyexploring this possibility.

Acknowledgements

The authors thank Tim Hughes from the Banting andBest Department of Medical Research at the Univer-sity of Toronto for making the data available, and SamRoweis, Tommi Jaakkola and Nati Srebro for helpfuldiscussions. We also gratefully acknowledge the finan-cial support of Natural Sciences and Engineering Re-search Council of Canada.

References

Ashburner, M. et al. (2000) Gene Ontology: toolfor the unification of biology. The Gene OntologyConsortium. Nature Genetics 25: 25–29.

Bell, A. J. and Sejnowski, T. J. (1995) An informa-tion maximization approach to blind separationand blind deconvolution. Neural Computation 7:1129–1159.

Besag, J. (1986) On the statistical analysis of dirtypictures. Journal of the Royal Statistical SocietyB 48: 259–302.

Brown, M.P. et al. (2000) Knowledge-based analysisof microarray gene expression data by using sup-port vector machines. Proceedings of the NationalAcademy of Sciences 97: 262–267.

Dueck, D. and Frey, B. (2004) Probabilistic SparseMatrix Factorization. University of Toronto tech-nical report PSI-2004-23.

Eisen, M.V., Spellman, P.T., Brown, P.O., and Bot-stein, D. (1999) Cluster analysis and display ofgenome-wide expression patterns. Proceedings ofthe National Academy of Sciences 95: 14863–14868.

Hyvrinen, A. (1999) Fast and Robust Fixed-PointAlgorithms for Independent Component Analy-sis IEEE Transactions on Neural Networks 10(3):626–634.

Hughes, T.R. et al. (2000) Functional discovery viaa compendium of expression profiles. Cell 102,109–126.

Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., andSaul, L.K. (1998) An introduction to variationalmethods for graphical models. In M.I. Jordan(ed.), Learning in Graphical Models. Norwell, MA:Kluwer Academic Publishers.

Lee, D. and Seung, H. (1999) Learning the parts ofobjects by non-negative matrix factorization. Na-ture 401: 788–791.

Lee, D. and Seung, H. (2001) Algorithms for non-negative matrix factorization. In Advances inNeural Information Processing Systems 7, pp.556–562.

Liao, J.C., et al. (2003) Network component analysis:Reconstruction of regulatory signals in biologicalsystems. Proceedings of the National Academy ofSciences 100: 15522–15527.

Marcotte, E.M., et al. (1999) Detecting protein func-tion and protein-protein interactions from genomesequences. Science 285: 751–753.

Schena, M., Shalon, D., Davis, R.W., and Brown,P.O. (1995) Quantitative monitoring of gene ex-pression patterns with a complementary DNA mi-croarray. Science 270: 467–470.

Srebro, N. and Jaakkola, T. (2001) Sparse Ma-trix Factorization of Gene Expression Data. Un-published note, MIT Artificial Intelligence Lab-oratory. Available at www.ai.mit.edu/research/-abstracts/abstracts2001/genomics/01srebro.pdf

Tavazoie, S. et al. (1999) Systematic determinationof genetic network architecture. Nature Genetics22(3), 213–215.

Zhang, W., Morris, Q., et al. (2004) The functionallandscape of mouse gene expression. Submittedfor publication.