using bayesian networks to analyze expression data

Post on 04-Feb-2016

32 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Using Bayesian Networks to Analyze Expression Data. N. Friedman M. Linial I. Nachman D. Pe’er Hebrew University, Jerusalem. Transcription. mRNA. Gene. Central Dogma. Translation. Protein. Cells express different subset of the genes - PowerPoint PPT Presentation

TRANSCRIPT

.

Using Bayesian Networks to Analyze Expression Data

N. Friedman M. Linial I. Nachman D. Pe’er Hebrew University, Jerusalem

Central Dogma

Transcription

mRNA

Cells express different subset of the genesIn different tissues and under different conditions

Gene

Translation

Protein

Microarrays (aka “DNA chips”)

New technological breakthrough: Measure RNA expression levels of thousands

of genes in one experiment Measure expression on

a genomic scale Opens up new

experimental designs Many major labs are using,

or will use this technology in the near future

The ProblemGenes

Exp

erim

ents

j

i

Aij - the mRNA level of gene j in experiment iGoal:

Learn regulatory/metabolic networks Identify causal sources of the biological

phenomena of interest

Analysis Approaches

Clustering of expression data Groups together genes with similar expression patterns Does not reveal structural relations between genes

Boolean networks Deterministic models of the logical interactions between

genes Deterministic, impractical for real data

Example: Cell-Cycle Data [Spellman et al]

clusters

Cell cycle stages

Our Approach

Characterize statistical relationships between expression patterns of different genes

Beyond pair-wise interactions Many interactions are explained by intermediate factors Regulation involves combined effects of several gene-

products

We build on the language of Bayesian networks

Modeling assumptions: Ancestors can effect descendants' genotype only by passing

genetic materials through intermediate generations

Network: Example

Noisy stochastic process:

Example: Pedigree A node represents

an individual’sgenotype

Homer

Bart

Marge

Lisa Maggie

Network Structure

Generalizing to DAGs: A child is conditionally

independent from its non-descendents, given the value of its parents

Often a natural assumption for causal processes if we believe that we capture

the relevant state of each intermediate stage.

X

Y1 Y2

Descendent

Ancestor

Parent

Non-descendentNon-descendent

Associated with each variable Xi is a conditional probability distribution P(Xi|Pai:)

Discrete variables: Multinomial distribution

Continuous variables: Choice: for example linear Gaussian

Local Probabilities

YX

P(Y

| X

)

X

Y

0.9 0.1

0 0.3 0.7

1

X P(Y |X)

Qualitative partDAG specifies

conditionalindependence

statements

+

Quantitative part

localprobability

models

Unique jointdistribution

over domain=

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versusP(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

E

R

B

A

C

Bayesian Network Semantics

Compact & efficient representation: k parents O(2kn) vs. O(2n) params parameters pertain to local interactions

Why Bayesian Networks?

Bayesian Networks: Flexible representation of dependency structure

of multivariate distributions Natural for modeling processes with local

interactions

Learning of Bayesian Networks Can learn dependencies from observations Handles stochastic processes:

“true” stochastic behavior noise in measurements

Modeling Biological Regulation

Variables of interest: Expression levels of genes Concentration levels of proteins Exogenous variables: Nutrient levels, Metabolite

Levels, Temperature, Phenotype information …

Bayesian Network Structure: Capture dependencies among these variables

Examples

Interactions are represented by a graph: Each gene is represented by a node in the graph Edges between the nodes represent direct

dependency

Measured expression level of each gene

Gene interaction

Random variables

Probabilistic dependencies

A BX BA

More Complex Examples

Dependencies can be mediated through other nodes

Common effects can imply conditional dependence

Common cause

A CB

Intermediate gene

A

C

B

B

A C

Outline of Our Approach

Use learned network to make predictions about

structure of the interactions between genes

Bayesian NetworkLearning Algorithm

E

R

B

A

C

Expression data

Sparse Candidate algorithm - efficient heuristic search that relies on sparseness

Learning With Many Variables

parents in BNcandidates

Choose candidate set for direct influence for each gene

Find optimal BN constrained on candidates

Iteratively improve candidate set

Experiment

Data from Spellman et al. (Mol.Bio. of the Cell 1998).

Contains 76 samples of all the yeast genome:

Different methods for synchronizing cell-cycle in yeast.

Time series at few minutes (5-20min) intervals.

Spellman et al. identified 800 cell-cycle regulated genes.

MethodsExperiment 1: discretized data into 3 levels

Learn multinomial probabilities

Experiment 2: Learn linear interactions (w/ Gaussian noise)

No prior biological knowledge was used

-0.5 0.5

0 +-

Log(ratio to control)

Network Learned

Challenge: Statistical Significance

Sparse Data Small number of samples “Flat posterior” -- many networks fit the data

Solution estimate confidence in network features Two types of features

Markov neighbors: X directly interacts with Y Order relations: X is an ancestor of Y

Confidence Estimates

D resample

resample

resample

D1

D2

Dm

...

Learn

Learn

Learn

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

m

iiGf

mfC

1

11

)(Estimate:

Bootstrap approach[FGW, UAI99]

RandomReal

Testing for Significance

0

500

1000

1500

2000

2500

3000

3500

4000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fea

ture

s w

ith C

onfid

ence

abo

ve t

t

0

50

100

150

200

250

300

350

400

450

500

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RandomReal

We run our procedure on randomized data where we reshuffled the order of values for each gene

Markov w/ Gaussian Models

Testing for Significance

0

200

400

600

800

1000

1200

1400

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fea

ture

s w

ith C

onfid

ence

abo

ve t

t

RandomReal

Markov w/ Multinomial Models

0

50

100

150

200

250

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RandomReal

Local Map

Finding Key GenesKey gene: a gene that preceeds many other genes YLR183C MCD1 Mitotic Chromosome Determinant; RAD27 DNA repair protein CLN2 role in cell cycle START SRO4 involved in cellular polarization during budding YOX1 Homeodomain protein that binds leu-tRNA gene POL30 required for DNA replication and repair YLR467W CDC5 MSH6 Homolog of the human GTBP protein YML119W CLN1 role in cell cycle START

Strong Markov Relations

YKL163W-PIR3 YKL164C-PIR1 Close location

YKR013W-PRY2 YKR012C Close location

MCD1 MSH6 Bind to DNA during mitosis

PHO11 PHO12 Acid phosphatases

HHT1 HTB1 Histones

FAR1 ASH1 Mating type switch, expression uncorrelated

CLN2 SVS1 Unknown function - SVS1

STE2 MFA2 Mating factor & receptor

Future Work

Finding suitable local distribution models Temporal aspect - DBN Correct handling of hidden variables

Can we recognize hidden causes of coordinated regulation events?

Incorporating prior knowledge Incorporate large mass of biological knowledge, and

insight from sequence/structure databases Abstraction

Combine with cluster analysis

top related