do not reproduce without permission 1 1 - lectures.gersteinlab.org (c) '09 bioinformatics...

148
Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452 (last edit in spring '11)

Upload: ethan-daniels

Post on 29-Dec-2015

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 11 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

BIOINFORMATICSDatabases & Datamining II

Mark Gerstein, Yale University

gersteinlab.org/courses/452(last edit in spring '11)

Page 2: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 22 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Datamining

• Building the data matrix- Generic- Gene centric - Genomic region- Network centric

• Databases• Supervised Mining

• Unsupervised mining- Simple overlaps- Clustering rows &

columns (networks)- PCA- SVD (theory + appl.)- Weighted Gene Co-

Expression Network- Biplot- CCA

Page 3: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 33 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Spectral Methods Outline & Papers

• Simple background on PCA (emphasizing lingo)• More abstract run through on SVD• Application to

- O Alter et al. (2000). "Singular value decomposition for genome-wide expression data processing and modeling." PNAS 97: 10101

- J Qian et al. (2001). "Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions." J Mol Biol. 314: 1053.

- Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

- Z Zhang et al. (2007) "Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions." Genome Res 17: 787

- TA Gianoulis et al. (2009) "Quantifying environmental adaptation of metabolic pathways in metagenomics." PNAS 106: 1374.

Page 4: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 44 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

Simple "Overlaps"

Page 5: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 55 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Structure of Genomic Features Matrix

Page 6: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 66 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Example Overlap

Analysis: Biochemically Active Regions Don't all Appear

to be Under Constraint

• Careful Randomization (GSC statistic)

[ENCODE Consortium, Nature 447, 2007]

Page 7: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 77 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Genomic Features Matrix: Deserts & Forests

Page 8: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 88 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Non-random distribution of TREs

• TREs are not evenly distributed throughout the encode regions (P < 2.2×10−16 ).

• The actual TRE distribution is power-law.

• The null distribution is ‘Poissonesque.’

• Many genomic subregions with extreme numbers of TREs.

Zhang et al. (2007) Gen. Res.

Page 9: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 99 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Aggregation & Saturation

[Nat. Rev. Genet. (2010) 11: 559]

Page 10: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1010 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Aggregation Analysis

[ENCODE Consortium, Nature 447, 2007]

Page 11: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1111 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

Clustering Columns & Rows of the Data

Matrix

Page 12: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1212 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Correlating Rows & Columns

[Nat. Rev. Genet. (2010) 11: 559]

Page 13: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1313 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

“cluster” predictors

Page 14: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1414 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Use clusters to predict Response

Page 15: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1515 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Agglomerative Clustering

• Bottom up v top down (K-means, know how many centers)

• Single or multi-link- threshold for

connection?

http://commons.wikimedia.org/wiki/File:Hierarchical_clustering_diagram.png

Page 16: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1616 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

K-means

1) Pick ten (i.e. k?) random points as putative cluster centers. 2) Group the points to be clustered by the center to which they areclosest. 3) Then take the mean of each group and repeat, with the means now atthe cluster center.4)Stop when the centers stop moving.

Page 17: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 1717 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Using Genes to Cluster Cancer Samples

Cheng et al., Genome Biology, 2009

(1) RE-score profile for diff. miRNA in 1 cancer sample. (2) Tabulate over many different breast cancer samples

(3) Clustering based on RE score divides samples into 2 main types of cancer

(4) Clustering better than based on indiv. gene expression levels

Page 18: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

18

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Clusteringthe

yeast cell cycle to uncover

interacting proteins

-2

-1

0

1

2

3

4

0 4 8 12 16

RPL19B

TFIIIC

Microarray timecourse of 1 ribosomal protein

mR

NA

exp

ress

ion

leve

l (ra

tio)

Time->

[Brown, Davis]

Page 19: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

19

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Clusteringthe

yeast cell cycle to uncover

interacting proteins

-2

-1

0

1

2

3

4

0 4 8 12 16

RPL19B

TFIIIC

Random relationship from ~18M

mR

NA

exp

ress

ion

leve

l (ra

tio)

Time->

Page 20: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

20

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Clusteringthe

yeast cell cycle to uncover

interacting proteins

-2

-1

0

1

2

3

4

0 4 8 12 16

RPL19B

RPS6B

Close relationship from 18M (2 Interacting Ribosomal Proteins)

mR

NA

exp

ress

ion

leve

l (ra

tio)

Time->

[Botstein; Church, Vidal]

Page 21: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

21

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Clusteringthe

yeast cell cycle to uncover

interacting proteins

-2

-1

0

1

2

3

4

0 4 8 12 16

RPL19B

RPS6B

RPP1A

RPL15A

?????

Predict Functional Interaction of Unknown Member of Cluster

mR

NA

exp

ress

ion

leve

l (ra

tio)

Time->

Page 22: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

22

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Global Network of Relationships

~470K significant

relationshipsfrom ~18M

possible

Page 23: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

23

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Network = Adjacency Matrix

• Adjacency matrix A=[aij] encodes

whether/how a pair of nodes is connected.

• For unweighted networks: entries are 1

(connected) or 0 (disconnected)

• For weighted networks: adjacency matrix

reports connection strength between gene

pairs

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 24: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

24

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Steps for constructing aco-expression network

A) Get microarray gene expression data

B) Do preliminary filteringC) Measure concordance of gene

expression profiles by Pearson correlation

C) The Pearson correlation matrix is either dichotomized to arrive at an adjacency matrix unweighted network

...Or transformed continuously with the power adjacency function weighted network A

dapt

ed fr

om :

http

://w

ww

.gen

etic

s.uc

la.e

du/la

bs/h

orva

th/C

oexp

ress

ionN

etw

ork

Page 25: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

25

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Power adjacency function to transform correlation into

adjacency

| ( , ) |ij i ja cor x x To determine β: in general use the “scale free topology criterion” described in Zhang and Horvath 2005

Typical value: β=6 Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 26: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

26

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Comparing adjacency functions

Power Adjancy (soft threshold) vs Step Function (hard threshold)

Ada

pted

fro

m :

htt

p://

ww

w.g

enet

ics.

ucla

.edu

/labs

/hor

vath

/Coe

xpre

ssio

nNet

wor

k

Page 27: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

27

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Why weighted?

• A continuous spectrum between perfect co-

expression and no co-expression at all

• Could threshold, but will lose information

• Instead, assign a weight to each link that

represents the extent of gene co-expression

• Natural range of weights: 0=no connection,

1=perfect agreement.

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 28: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

28

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Local Clustering algorithm identifies further

(reasonable) types of

expression relation-ships

Simultaneous

TraditionalGlobal

Correlation

Inverted

Time-Shifted

Page 29: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

29

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Simultaneous

Page 30: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

30

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Simultaneous

Page 31: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

31

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Time-Shifted

Page 32: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

32

(c)

Mar

k G

erst

ein

, 19

99,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Inverted

Page 33: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 3333 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

PCA

Page 34: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 3434 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Relation of Clustering to PCA

• Correlation Matrix C- Similarity of all pairs, either in terms of rows (genes)

or columns (conditions)

- C(genes) = AAT - C(conditions) = ATA

• Clustering - Operates on C, thresholding it to give a network

• PCA gives another view of C- diagonalizes C

• Two different correlation matrices (for rows & cols)- SVD operates on both

Page 35: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 3535 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

35

Covariance matrix

• N random variables• NxN symmetric matrix

• Diagonal elements are variances

Page 36: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 3636 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

PCA section will be a "mash up" up a number of PPTs on the web

• pca-1 - black ---> www.astro.princeton.edu/~gk/A542/PCA.ppt• by Professor Gillian R. Knapp [email protected]

• pca-2 - yellow ---> myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt• by Hal Whitehead.• This is the class main url

http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm

• pca.ppt - what is cov. matrix ----> hebb.mit.edu/courses/9.641/lectures/pca.ppt• by Sebastian Seung. Here is the main page of the course• http://hebb.mit.edu/courses/9.641/index.html

• from BIIS_05lecture7.ppt ----> www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt• by R.S.Gaborski Professor

Page 37: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

37

abstract

Principal component analysis (PCA) is a technique that is useful for thecompression and classification of data. The purpose is to reduce thedimensionality of a data set (sample) by finding a new set of variables,smaller than the original set of variables, that nonetheless retains mostof the sample's information.

By information we mean the variation present in the sample,given by the correlations between the original variables. The newvariables, called principal components (PCs), are uncorrelated, and areordered by the fraction of the total information each retains.

Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt

Page 38: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

38

Geometric picture of principal components (PCs)

A sample of n observations in the 2-D space

Goal: to account for the variation in a sample in as few variables as possible, to some accuracy

Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt

Page 39: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

39

Geometric picture of principal components (PCs)

• the 1st PC is a minimum distance fit to a line in space

PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.

• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC

Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt

Page 40: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

40

PCA: General methodologyFrom k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...yk = ak1x1 + ak2x2 + ... + akkxksuch that:

yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 41: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

41

PCA: General methodologyFrom k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...

yk = ak1x1 + ak2x2 + ... + akkxk

such that:yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.

yk's arePrincipal

Components

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 42: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

42

Principal Components Analysis

4.0 4.5 5.0 5.5 6.02

3

4

5

1st Principal Component, y1

2nd Principal Component, y2

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 43: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

43

Principal Components Analysis

• Rotates multivariate dataset into a new configuration which is easier to interpret

• Purposes– simplify data– look at relationships between variables– look at patterns of units

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 44: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

44

Principal Components Analysis{a11,a12,...,a1k} is 1st Eigenvector of correlation/covariance

matrix, and coefficients of first principal component

{a21,a22,...,a2k} is 2nd Eigenvector of correlation/covariance matrix, and coefficients of 2nd principal component

{ak1,ak2,...,akk} is kth Eigenvector of correlation/covariance matrix, and coefficients of kth principal component

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 45: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

45

eigenvalue problem

• The eigenvalue problem is any problem having the following form:

A . v = λ . vA: n x n matrixv: n x 1 non-zero vectorλ: scalarAny value of λ for which this equation has a solution is called the eigenvalue of A and vector v which corresponds to this value is called the eigenvector of A.

[from BIIS_05lecture7.ppt] Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

Page 46: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

46

eigenvalue problem

2 3 3 12 3

2 1 2 8 2

A . v = λ . v

Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A

Given matrix A, how can we calculate the eigenvector and eigenvalues for A?

x = 4 x=

[from BIIS_05lecture7.ppt] Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

Page 47: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

47

Principal Components Analysis

So, principal components are given by:

y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...

yk = ak1x1 + ak2x2 + ... + akkxk

xj’s are standardized if correlation matrix is used (mean 0.0, SD 1.0)

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 48: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

48

Principal Components Analysis

Score of ith unit on jth principal component

yi,j = aj1xi1 + aj2xi2 + ... + ajkxik

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 49: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

49

PCA Scores

4.0 4.5 5.0 5.5 6.02

3

4

5

xi2

xi1

yi,1 yi,2

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 50: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

50

Principal Components Analysis

Amount of variance accounted for by:1st principal component, λ1, 1st eigenvalue

2nd principal component, λ2, 2nd eigenvalue

...

λ1 > λ2 > λ3 > λ4 > ...

Average λj = 1 (correlation matrix)

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 51: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

51

Principal Components Analysis:Eigenvalues

4.0 4.5 5.0 5.5 6.02

3

4

5

λ1λ2

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 52: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

52

PCA: Terminology• jth principal component is jth eigenvector of

correlation/covariance matrix• coefficients, ajk, are elements of eigenvectors and relate original

variables (standardized if using correlation matrix) to components• scores are values of units on components (produced using

coefficients)• amount of variance accounted for by component is given by

eigenvalue, λj

• proportion of variance accounted for by component is given by λj / Σ λj

• loading of kth original variable on jth component is given by ajk√λj --correlation between variable and component

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 53: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

53

How many components to use?

• If λj < 1 then component explains less variance

than original variable (correlation matrix)• Use 2 components (or 3) for visual ease• Scree diagram:

0

0.5

1

1.5

2

2.5

1 2 3 4 5

Component number

Eig

enva

lue

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 54: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

54

Principal Components Analysis on:• Covariance Matrix:

– Variables must be in same units– Emphasizes variables with most variance– Mean eigenvalue ≠1.0– Useful in morphometrics, a few other cases

• Correlation Matrix:– Variables are standardized (mean 0.0, SD 1.0)– Variables can be in different units– All variables have same impact on analysis– Mean eigenvalue = 1.0

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 55: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

55

PCA: Potential Problems

• Lack of Independence– NO PROBLEM

• Lack of Normality– Normality desirable but not essential

• Lack of Precision– Precision desirable but not essential

• Many Zeroes in Data Matrix– Problem (use Correspondence Analysis)

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

Page 56: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

56

PCA applications -Eigenfaces

• the principal eigenface looks like a bland androgynous average human face

http://en.wikipedia.org/wiki/Image:Eigenfaces.png

Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

Page 57: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

57

Eigenfaces – Face Recognition

• When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.

• Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces

• Hence eigenfaces provide a means of applying data compression to faces for identification purposes.

Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

Page 58: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 5858 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

SVD

Puts together slides prepared by Brandon Xia with images from Alter et al. and Kluger et al. papers

Page 59: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

59

SVD for microarray data(Alter et al, PNAS 2000)

Page 60: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

60

A = USVT

• A is any rectangular matrix (m ≥ n)• Row space: vector subspace

generated by the row vectors of A• Column space: vector subspace

generated by the column vectors of A– The dimension of the row & column

space is the rank of the matrix A: r (≤ n)

• A is a linear transformation that maps vector x in row space into vector Ax in column space

Page 61: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

61

A = USVT

• U is an “orthogonal” matrix (m ≥ n)• Column vectors of U form an

orthonormal basis for the column space of A: UTU=I

• u1, …, un in U are eigenvectors of AAT

– AAT =USVT VSUT =US2 UT

– “Left singular vectors”

1 2

| | |

| | |nU

u u u

Page 62: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

62

A = USVT

• V is an orthogonal matrix (n by n)• Column vectors of V form an

orthonormal basis for the row space of A: VTV=VVT=I

• v1, …, vn in V are eigenvectors of ATA– ATA =VSUT USVT =VS2 VT

– “Right singular vectors”

1 2

| | |

| | |nV

v v v

Page 63: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

63

A = USVT

• S is a diagonal matrix (n by n) of non-negative singular values

• Typically sorted from largest to smallest

• Singular values are the non-negative square root of corresponding eigenvalues of ATA and AAT

Page 64: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

64

AV = US

• Means each Avi = siui

• Remember A is a linear map from row space to column space

• Here, A maps an orthonormal basis {vi} in row space into an orthonormal basis {ui} in column space

• Each component of ui is the projection of a row onto the vector vi

Page 65: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

65

SVD of A (m by n): recap

• A = USVT = (big-"orthogonal")(diagonal)(sq-orthogonal)

• u1, …, um in U are eigenvectors of AAT

• v1, …, vn in V are eigenvectors of ATA• s1, …, sn in S are nonnegative singular values of

A

• AV = US means each Avi = siui

• “Every A is diagonalized by 2 orthogonal matrices”

Page 66: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

66

SVD as sum of rank-1 matrices

• A = USVT

• A = s1u1v1T + s2u2v2

T +… + snunvnT

• s1 ≥ s2 ≥ … ≥ sn ≥ 0

• What is the rank-r matrix A that best approximates A ?– Minimize

• A = s1u1v1T + s2u2v2

T +… + srurvrT

• Very useful for matrix approximation

2

1 1

ˆm n

ij iji j

A A

Page 67: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

67

Examples of (almost) rank-1 matrices

• Steady states with fluctuations

• Array artifacts?

• Signals?

101 103 102

302 300 301

203 204 203

401 402 404

1 2 1

2 4 2

1 2 1

0 0 0

101 303 202

102 300 201

103 304 203

101 302 204

Page 68: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

68

Geometry of SVD in row space

• A as a collection of m row vectors (points) in the row space of A

• s1u1v1T is the best rank-1 matrix

approximation for A

• Geometrically: v1 is the direction of the best approximating rank-1 subspace that goes through origin

• s1u1 gives coordinates for row vectors in rank-1 subspace

• v1 Gives coordinates for row space basis vectors in rank-1 subspace

x

yv1

iA si iv u

I i iv v

Page 69: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

69

Geometry of SVD in row space

x

yv1

x

y

This line segment that goes through origin approximates the original data set

The projected data set approximates the original data set

x

y s1u1v1T

A

Page 70: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

70

Geometry of SVD in row space

• A as a collection of m row vectors (points) in the row space of A

• s1u1v1T + s2u2v2

T is the best rank-2 matrix approximation for A

• Geometrically: v1 and v2 are the directions of the best approximating rank-2 subspace that goes through origin

• s1u1 and s2u2 gives coordinates for row vectors in rank-2 subspace

• v1 and v2 gives coordinates for row space basis vectors in rank-2 subspace

x

yx’

y’

iA si iv u

I i iv v

Page 71: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

71

What about geometry of SVD in column space?

• A = USVT

• AT = VSUT

• The column space of A becomes the row space of AT

• The same as before, except that U and V are switched

Page 72: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

72

Geometry of SVD in row and column spaces

• Row space– siui gives coordinates for row vectors along unit

vector vi

– vi gives coordinates for row space basis vectors along unit vector vi

• Column space– sivi gives coordinates for column vectors along

unit vector ui

– ui gives coordinates for column space basis vectors along unit vector ui

• Along the directions vi and ui, these two spaces look pretty much the same!– Up to scale factors si

– Switch row/column vectors and row/column space basis vectors

– Biplot....

iA si iv u

I i iv v

TiA si iu v

I i iu u

Page 73: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

73

When is SVD = PCA?

• Centered data

x

y

x’y’

Page 74: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

74

When is SVD different from PCA?

x

yx’

y’

y’’x’’

Translation is not a linear operation, as it moves the origin !

PCA

SVD

Page 75: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Additional Points

• Time Complexity (Cubic)

• Application to text mining– Latent semantic indexing– sparse

75

A

Page 76: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

76

Conclusion

• SVD is the “absolute high point of linear algebra”• SVD is difficult to compute; but once we have it, we have

many things• SVD finds the best approximating subspace, using linear

transformation• Simple SVD cannot handle translation, non-linear

transformation, separation of labeled data, etc.• Good for exploratory analysis; but once we know what

we look for, use appropriate tools and model the structure of data explicitly!

Page 77: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 7777 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

Intuition on interpretation of SVD in terms of genes and

conditions77

Page 78: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

78

SVD for microarray data(Alter et al, PNAS 2000)

Page 79: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

79

Notation• m=1000 genes

– row-vectors

– 10 eigengene (vi) of dimension 10 conditions

• n=10 conditions (assays)– column vectors

– 10 eigenconditions (ui) of dimension 1000 genes

Page 80: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

80

Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix

Page 81: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

81

Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix

Bra - ket notation

Page 82: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

82

Close up on Eigengenes

Page 83: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

83Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Genes sorted by correlation with top 2 eigengenes

Page 84: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

84Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Normalized elutriation

expression in the subspace

associated with the cell cycle

Page 85: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

85

Plotting Experiments in Low Dimension Subspace

Page 86: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 8686 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

Weighted Gene Co-Expression Network

Page 87: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Weighted Gene Co-Expression Network Analysis

Bin Zhang and Steve Horvath (2005)

"A General Framework for Weighted Gene Co-Expression Network Analysis",

Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Art. 17. Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 88: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Central concept in network methodology:

• Modules: groups of densely interconnected genes

(not the same as closely related genes)

– a class of over-represented patterns

• Empirical fact: gene co-expression networks exhibit

modular structure

Network Modules

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 89: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Module Detection

• Numerous methods exist

• Many methods define a suitable gene-gene dissimilarity measure and use clustering.

• In our case: dissimilarity based on topological overlap

• Clustering method: Average linkage hierarchical clustering

– branches of the dendrogram are modulesAdapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 90: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Topological overlap measure,

TOM

• Pairwise measure by Ravasz et al, 2002

• TOM[i,j] measures the overlap of the set of nearest

neighbors of nodes i,j

• Closely related to twinness

• Easily generalized to weighted networks

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 91: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Calculating TOM

• Normalized to [0,1] with 0 = no overlap, 1 = perfect overlap• Generalized in Zhang and Horvath (2005) to the case of

weighted networks

1ij ijDistTOM TOM

TOMij=

∑ua

iua

uja

ij

min ki, k

j1−a

ij

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 92: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Example of module detection via hierarchical clustering

• Expression data from human brains, 18 samples.

Example of module detection via hierarchical clustering

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 93: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Module eigengenes

• Often: Would like to treat modules as single units

– Biologically motivated data reduction

• Construct a representative

• Our choice: module eigengene = 1st principal component of

the module expression matrix

• Intuitively: a kind of average expression profile

• Genes of each module must be highly correlated for a

representative to really represent

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

Page 94: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

ExampleHuman brain expression data, 18

samples

Module consisting of 50 genes

Ada

pte

d fr

om :

htt

p://

ww

w.g

enet

ics.

ucla

.edu

/lab

s/ho

rvat

h/C

oexp

ress

ionN

etw

ork

Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

Page 95: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Module eigengenes are very useful!

• Summarize each module in one synthetic expression profile

• Suitable representation in situations where modules are

considered the basic building blocks of a system

– Allow to relate modules to external information

(phenotypes, genotypes such as SNP, clinical traits) via

simple measures (correlation, mutual information etc)

– Can quantify co-expression relationships of various

modules by standard measures

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

Page 96: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Human and chimpanzee brain

expression data

Construct gene expression networks in both sets, find

modules

Construct consensus modules

Characterize each module by brain region where it is most

differentially expressed

Represent each module by its eigengene

Characterize relationships among modules by correlation of

respective eigengenes (heatmap or dendrogram)

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 97: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Set modules

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 98: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Biological information?

Assign modules to brain regions with highest (positive) differential expression

Red means the module genes

are over-expressed in the

brain region;

green means under-

expression

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 99: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Visualizing consensus eigengene networks

Heatmap comparisons of module relationships

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 100: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

For more information

Weighted Gene Co-expression Networks website:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

Page 101: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

A short methodological summary of the publications.• How to construct a gene co-expression network using the scale free topology criterion? Robustness of network results. Relating a gene

significance measure and the clustering coefficient to intramodular connectivity: – Zhang B, Horvath S (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in

Genetics and Molecular Biology: Vol. 4: No. 1, Article 17• Theory of module networks (both co-expression and protein-protein interaction modules):

– Dong J, Horvath S (2007) Understanding Network Concepts in Modules, BMC Systems Biology 2007, 1:24• What is the topological overlap measure? Empirical studies of the robustness of the topological overlap measure:

– Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 2007, 8:22

• Software for carrying out neighborhood analysis based on topological overlap. The paper shows that an initial seed neighborhood comprised of 2 or more highly interconnected genes (high TOM, high connectivity) yields superior results. It also shows that topological overlap is superior to correlation when dealing with expression data.

– Li A, Horvath S (2006) Network Neighborhood Analysis with the multi-node topological overlap measure. Bioinformatics. doi:10.1093/bioinformatics/btl581

• Gene screening based on intramodular connectivity identifies brain cancer genes that validate. This paper shows that WGCNA greatly alleviates the multiple comparison problem and leads to reproducible findings.

– Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu, Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006) "Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target", PNAS | November 14, 2006 | vol. 103 | no. 46 | 17402-17407

• The relationship between connectivity and knock-out essentiality is dependent on the module under consideration. Hub genes in some modules may be non-essential. This study shows that intramodular connectivity is much more meaningful than whole network connectivity:

– "Gene Connectivity, Function, and Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks" (2006) by Carlson MRJ, Zhang B, Fang Z, Mischel PS, Horvath S, and Nelson SF, BMC Genomics 2006, 7:40

• How to integrate SNP markers into weighted gene co-expression network analysis? The following 2 papers outline how SNP markers and co-expression networks can be used to screen for gene expressions underlying a complex trait. They also illustrate the use of the module eigengene based connectivity measure kME.

– Single network analysis: Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S (2006) "Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight". PLoS Genetics. Volume 2 | Issue 8 | AUGUST 2006

– Differential network analysis: Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) "Weighted Gene Co-expression Network Analysis Strategies Applied to Mouse Weight", Mammalian Genome. In Press

• The following application presents a `supervised’ gene co-expression network analysis. In general, we prefer to construct a co-expression network and associated modules without regard to an external microarray sample trait (unsupervised WGCNA). But if thousands of genes are differentially expressed, one can construct a network on the basis of differentially expressed genes (supervised WGCNA):

– Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang W, He A, Truong A, Patel S, Nelson SF, Horvath S, Berliner J, Kirchgessner T, Lusis AJ (2006) Identification of Inflammatory Gene Modules based on Variations of Human Endothelial Cell Responses to Oxidized Lipids. PNAS 22;103(34):12741-6

• The following paper presents a differential co-expression network analysis. It studies module preservation between two networks. By screening for genes with differential topological overlap, we identify biologically interesting genes. The paper also shows the value of summarizing a module by its module eigengene.

– Oldham M, Horvath S, Geschwind D (2006) Conservation and Evolution of Gene Co-expression Networks in Human and Chimpanzee Brains. 2006 Nov 21;103(47):17973-8

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Page 102: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 102

102 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

Biplot

Page 103: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

103

Biplot• A biplot is a two-dimensional representation of a data matrix showing a point for each of the n

observation vectors (rows of the data matrix) along with a point for each of the p variables (columns of the data matrix).

– The prefix ‘bi’ refers to the two kinds of points; not to the dimensionality of the plot. The method presented here could, in fact, be generalized to a threedimensional (or higher-order) biplot. Biplots were introduced by Gabriel (1971) and have been discussed at length by Gower and Hand (1996). We applied the biplot procedure to the following toy data matrix to illustrate how a biplot can be generated and interpreted. See the figure on the next page.

• Here we have three variables (transcription factors) and ten observations (genomic bins). We can obtain a two-dimensional plot of the observations by plotting the first two principal components of the TF-TF correlation matrix R1.

– We can then add a representation of the three variables to the plot of principal components to obtain a biplot. This shows each of the genomic bins as points and the axes as linear combination of the factors.

• The great advantage of a biplot is that its components can be interpreted very easily. First, correlations among the variables are related to the angles between the lines, or more specifically, to the cosines of these angles. An acute angle between two lines (representing two TFs) indicates a positive correlation between the two corresponding variables, while obtuse angles indicate negative correlation.

– Angle of 0 or 180 degrees indicates perfect positive or negative correlation, respectively. A pair of orthogonal lines represents a correlation of zero. The distances between the points (representing genomic bins) correspond to the similarities between the observation profiles. Two observations that are relatively similar across all the variables will fall relatively close to each other within the two-dimensional space used for the biplot. The value or score for any observation on any variable is related to the perpendicular projection form the point to the line.

• Refs– Gabriel, K. R. (1971), “The Biplot Graphical Display of Matrices with Application to Principal Component Analysis,”

Biometrika, 58, 453–467.– Gower, J. C., and Hand, D. J. (1996), Biplots, London: Chapman & Hall.

Page 104: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Introduction• A biplot is a low-

dimensional (usually 2D) representation of a data matrix A.– A point for each of

the m observation vectors (rows of A)

– A line (or arrow) for each of the n variables (columns of A)

Page 105: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

PCA TFs: a, b, c...

Genomic Sites: 1,2,3...

AT

A

AAT (site-site correlation)

ATA (TF-TF corr.)

Page 106: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Biplot to Show Overall Relationship of TFs & Sites

TFs: a, b, c...

Genomic Sites: 1,2,3...

AT

A=USVT

AAT (site-site correlation)

ATA (TF-TF corr.)

Page 107: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Biplot Ex

A

A’A

Page 108: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Biplot Ex #2

AT A = V S2 VT

A AT = U S2 UT

A = (U Sr) (V S1−r) T

A vj = uj sj & AT uj = vj sj

Page 109: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

109

Biplot Ex #3iA si iv u

TiA si iu v

Assuming s=1,Av = uATu = v

Page 110: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Results of

BiplotZhang et al. (2007)

Gen. Res.

• Pilot ENCODE (1% genome): 5996 10 kb genomic bins (adding all hits) + 105 TF experiments biplot

• Angle between TF vectors shows relation b/w factors• Closeness of points gives clustering of "sites" • Projection of site onto vector gives degree to which

site is assoc. with a particular factor

Page 111: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Results of

BiplotZhang et al. (2007)

Gen. Res.

• Biplot groups TFs into sequence-specific and sequence-nonspecific clusters.

– c-Myc may behave more like a sequence-nonspecific TF.

– H3K27me3 functions in a transcriptional regulatory process in a rather sequence-specific manner.

• Genomic Bins are associated with different TFs and in this fashion each bin is "annotated" by closest TF cluster

Page 112: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 112

112 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Unsupervised Mining

CCA

Page 113: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

113

Sorcerer II Global Ocean Survey

Sorcerer II journey August 2003- January 2006

Sample approximately every 200 miles

Rusch, et al., PLOS Biology 2007

Page 114: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

114

Sorcerer II Global Ocean Survey

Metagenomic Sequence: 6.25 GB of data

0.1–0.8 μm size fraction (bacteria)

6.3 billion base pairs (7.7 million reads)

Reads were assembled and genes annotated

1 million CPU hours to process

Metadata

GPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content

Rusch, et al., PLOS Biology 2007

Metabolic

Pathways

Membrane Protein Families

Additional Metadata via GPS coordinates

Page 115: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

115

Nutrient Features Extracted:

Phosphate

Silicate

Nitrate

Apparent Oxygen Utilization

Dissolved Oxygen

Annual Phosphate [umol/l] at the surface

Sample Depth: 1 meter

Water Depth: 32 meters

Chlorophyll: 4.0 ug/kg  

Salinity: 31 psu

Temperature: 11 C

Location: 41°5'28"N, 71°36'8"W

Extracting Environmental Data from Other Sources

World Ocean Atlas 2005NOAA/NODC

Page 116: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

116

116 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Mapping Raw Metagenomic

Reads to a Matrix of Familes or

Pathways for each Site

Patel et. al., Genome Research 2010

Families Matrix

Page 117: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

117

117 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Expressing data as matrices indexed by site, env. var., and pathway usage

Pathway Sequences(Community Function) Environmental

Features

[Rusch et. al., (2007) PLOS Biology; Gianoulis et al., PNAS (in press, 2009]

Page 118: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

118

118 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Simple Relationships: Pairwise Simple Relationships: Pairwise CorrelationsCorrelations

[ G

iano

ulis

et

al.,

PN

AS

(in

pre

ss,

2009

) ]

Pathways

Page 119: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 119

119

- L

ectu

res.

Ger

stei

nL

ab.o

rg (c

) '0

9

Canonical Correlation Analysis: Simultaneous weighting

Lifestyle Index = a +b + c

Weight # km run/week Lifestyle Index Fit Index

Fit Index = a + b +c

Page 120: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 120

120

- L

ectu

res.

Ger

stei

nL

ab.o

rg (c

) '0

9

Canonical Correlation Analysis: Simultaneous weighting

Lifestyle Index = a +b + c

Weight # km run/week Lifestyle Index Fit Index

Fit Index = a + b +c

Metabolic Pathways/ Protein Families

EnvironmentalFeatures

Temp

Chlorophyll

etc Photosynthesis

Lipid Metabolism

etc

Page 121: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 121

121

- L

ectu

res.

Ger

stei

nL

ab.o

rg (c

) '0

9

CCA: Finding Variables CCA: Finding Variables with Large Projections in "Correlation Circle"with Large Projections in "Correlation Circle"

The goal of this technique is to interpret cross-variance matricesWe do this by defining a change of basis.

Gianoulis et al., PNAS 2009

Page 122: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Formal CCA def'n

• More intuitively, canonical correlation analysis can be defined as the problem of finding two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are mutually maximized.

[From M Borga, CCA Tutorial, http://www.imt.liu.se/~magnus/cca ]

Page 123: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

CCA resultsWe are defining a change of basis of the cross co-variance matrixWe want the correlations between the projections of the variables, X and Y, onto

the basis vectors to be mutually maximized.Eigenvalues squared canonical correlationsEigenvectors normalized canonical correlation basis vectors

Environment Family

Correlation = .3

Correlation= 1

This plot shows the correlations in the first and second dimensions

Correlation Circle: The closer the point is to the outer circle, the higher the correlation

Variables projected in the same direction are correlated

Page 124: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 124

124

- L

ectu

res.

Ger

stei

nL

ab.o

rg (c

) '0

9

Circuit MapCircuit Map

Environmentally invariant

Environmentally variant

Strength of Pathway co-variation with environment

Gianoulis et al., PNAS 2009

Page 125: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 125

125

- L

ectu

res.

Ger

stei

nL

ab.o

rg (c

) '0

9

Conclusion #1: energy conversion strategy, temp and depth

Gianoulis et al., PNAS 2009

Page 126: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 126

126

- L

ectu

res.

Ger

stei

nL

ab.o

rg (c

) '0

9

Conclusion #2: Outer Membrane components vary with the Conclusion #2: Outer Membrane components vary with the environmentenvironment

Gianoulis et al., PNAS 2009Patel et al. Genome Research 2010

Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes

Page 127: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

127

Membrane Proteins: Sensing and Responding the Environment

Water depth

Acidity

App. O2 util.

Salinity

Pollution

Climate change Shipping

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

UV

Sample Depthvariant

invariant

Dimension 1

Dim

en

sio

n

2

Patel and Gianoulis et al., (in press) Genome Research

44 invariant membrane protein families

107 variant membrane protein families

• 2.3 million predicted membrane proteins

• 1.2 million unique• 850,000 mapped to 151

membrane protein COGs

20% have NO KNOWN FUNCTION

Page 128: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Membrane Proteins co-vary more than Metabolic Pathways

128

Median absolute structural Correlation Coefficient

Metabolic Pathways = 0.17

Dimension 1

Dim

en

sio

n

2

Acidity

App. O2 util.

Salinity

Pollution

Climate change Shipping

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

UV

Sample Depthvariant

invariant

Dimension 1

Dim

en

sio

n

2

Membrane Proteins = 0.3

Patel and Gianoulis et al., (in press) Genome Research

Page 129: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Acidity

App. O2 util.

Salinity

Pollution

Climate change Shipping

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

UV

Sample Depthvariant

invariant

Dimension 1

Dim

en

sio

n

2

129

CCA Limitations

Water depth

Both the strength and the

directionality of relationships between nodes is difficult to decipher in this format.

Acidity

App. O2 util.

Salinity

Pollution

Climate change Shipping

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

UV

Sample Depthvariant

invariant

Dimension 1

Dim

en

sio

n

2

Patel and Gianoulis et al., (in press) Genome Research

Page 130: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

130

Protein Families and Environmental Features Network (PEN)

cosbaba

Distance: Dot product between 1st and 2nd Dimension of CCA

Patel et. al., Genome Research 2010

Page 131: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

131

Protein Families and Environmental Features Network (PEN)

“Bi-modules”: groups of environmental features and membrane proteins families that are associated

UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network

COG0598, Magnesium Transporter

COG1176, Polyamine Transporter

Page 132: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

132

Bi-module 1: Phosphate/Phosphate Transporters

Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation

High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed

Patel et. al., Genome Research 2010

Page 133: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

133

Bi-module 2: Iron Transporters/Pollution/Shipping

Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron

Pollution and Shipping may be a proxy for iron concentrations

Patel et. al., Genome Research 2010

Page 134: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 134

134 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 135: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 135

135 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 136: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 136

136 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 137: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 137

137 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 138: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 138

138 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 139: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 139

139 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 140: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 140

140 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 141: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 141

141 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 142: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 142

142 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 143: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 143

143 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 144: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 144

144 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 145: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 145

145 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 146: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 146

146 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 147: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 147

147 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09

Page 148: Do not reproduce without permission 1 1 - Lectures.GersteinLab.org (c) '09 BIOINFORMATICS Databases & Datamining II Mark Gerstein, Yale University gersteinlab.org/courses/452

Do not reproduce without permission 148

148 -

Lec

ture

s.G

erst

ein

Lab

.org

(c)

'09