Download - Gene Expression Analysis Using Bayesian Networks

1

Gene Expression Analysis Using Bayesian Networks

Éric Paquet

LBIT Université de Montréal

2

Biological basis

DNA(Storage of Genetic

Information)

mRNA(Storage & Transport

of Genetic Information)

Proteins(Expression of

Genetic Information)

RNA Polymerase(Copy DNA in RNA)

Ribosome(Translate Genetic

Information in Proteins)

*-PDB file 1L3A, Transcriptional Regulator Pbf-2 2

3

Biological basis

3

How do proteins get regulated? E. coli operon lactose example :

In normal time, E. coli uses glucose to get energy, but how does it react if there is no more glucose but only lactose?

4

Biological basis

4

......

RNA Polymerase

Polymerase action is blocked because of a DNA lockGene Lac I associated protein

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Glucose Lactose

X

E. coli environment

5

......

RNA Polymerase

Glucose Lactose

X

E. coli environment

Biological basis

X

Lactose

5



Lactose

Lactose recruits gene lacI associated protein… unlockingthe DNA that is then accessible to the polymerase

6

Biological basis

6

= inhibit



7

......

RNA Polymerase

Glucose Lactose

E.coli environment

Biological basis

X

7

In absence of glucose, a polymerase magnet binds to the DNA to accelerate the products of information that help lactose decomposition

CAP

c-AMP



Lactose

8

Biological basis

8



= inhibit

= activate

Research goal:Infer these links

9

Why?

Get insights about cellular processesHelp understand diseasesFind drug targets

9

10

How?

Using gene expression data and tools for learning Bayesian networks

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97



+

*

10

Experiments

[mR

NA

] Tools for Learning Bayesian networks

11

A real value is coming from one spot and tells if the concentration of a specific mRNA is higher(+) or lower(-) than

the normal value

What is gene expression data?

Data showing the concentration of a specific mRNA at a given time of the cell life.

*

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Experiments

[mR

NA

]Every columns are the result of one image

12

What is Bayesian networks?

Graphic representation of a joint distribution over a set of random variables.

A B

C D

E

P(A,B,C,D,E) = P(A)*P(B) *P(C|A)*P(D|A,B) *P(E|D)

Nodes represent gene expression while edges encode the interactions (cf. inhibition, activation)

13

Bayesian networks little problem

A Bayesian network should be a DAG (Direct Acyclic Graph), but there are a lot of example of regulatory networks having directed cycles.

*

*-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003, pages 2271–2282

Histeric oscillator

Switch

Transcription factor dimer

14

How can we deal with that?

Using DBN (Dynamic Bayesian Networks*) and sequential gene expression data

A

B

A1

B1

A2

B2

We unfold the network in time

*-Friedman, Murphy, Russell,Learning the Structure of Dynamic Probabilitic Networks

DBN = BN with constraints on parents and children nodes

t t+1

15

What are we searching for?

A Bayesian network that is most probable given the data D (gene expression)

We found this BN like that :BN* = argmaxBN{P(BN|D)}

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

Where:

Naïve approach to the problem : try all possible dags and keep the best one!

16

It is impossible to try all possible DAGs because

The number of dags increases super-exponentially with the number of nodes

n = 3 → 25 dagsn = 4 → 543 dags n = 5 → 29281 dagsn = 6 → 3781503 dags n = 7 → 1138779265 dagsn = 8 → 783702329343 dags…

We are interested in problem having around 60 nodes ….

17

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

Basically add, remove and reverse edges

A

B

CP(a) = ?P(b) = ?P(c|a,b) = ?

18

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

A

B

CP(a) = ?P(b) = ?P(c|a,b) = ?Basically add, remove and reverse edges

19

We use three types of gene expression level?

Sort

-1.06 -0.12 0.18 0.21 1.16 1.19

Split data in 3 equal buckets

-1.06 -0.12 0.18 0.21 1.16 1.19

0 1 2

0 0 2 2 1 1 Discretized data

20

Return on:

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

21

Insight on each terms

P(BN) → prior on networkIn our research, we always use a prior equals to 1We could incorporate knowledge using it

Eg. : we know the presence of an edge. If the edge is in the BN, P(BN) = 1 else P(BN) = 0

Efforts are made to reduce the search space by using knowledge eg. limit the number of parents or children

22

Insight on each terms

P(D|BN) → marginal likelihoodEasy to calculate using Multinomial distribution with Dirichlet prior *

ri

k ijk

ijkijkn

i

qi

j ijij

ij

asa

MNNbndP

11 1 )()(

)()()|(

*-Heckerman,A Tutorial on Learning With Bayesian Networks and Neapolitan,Learning Bayesian Networks

23

A

C B

MCMC (Markov Chain Monte Carlo) simulation

Markov Chain part:Zoom on a node of the chain

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

1/5

1/5

1/51/5

1/5

0

P(BNnew)

24

MCMC (Markov Chain Monte Carlo) simulation

Monte Carlo part:Choose next BN with probability P(BNnew)Accept the new BN with the following Metropolis–Hastings acceptance criterion :

gone! is P(D))(*)()|()()|(,1min

)(*)()()|()()()|(,1min

)(*)|()|(,1min

BNnewPBNoldPBNoldDPBNnewPBNnewDP

BNnewPDPBNoldPBNoldDPDPBNnewPBNnewDP

BNnewPDBNoldPDBNnewPMHP

25

Monte Carlo part example :1. Choose a random path. Each path having a P(BNnew) of 1/5

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

1/5

1/5

1/51/5

1/5

0

P(BNnew)

1. Choose a random path. Each path having a P(BNnew) of 1/52. Choose another random number. If it is smaller than the

Metropolis-Hasting criterion, accept BNnew else return to BNold

26

MCMC (Markov Chain Monte Carlo) simulation recap:Choose a starting BN at randomBurning phase (generate 5*N BN from MCMC without storing them)Storing phase (get 100*N BN structure from MCMC)

log(

P(D

| B

N)P

(BN

))

Iteration

= Burning phase= Storing phase

27

Why 100*N BN and not only 1:

Cause we don’t have enough data and there are a lot of high scoring networksInstead, we associate confidence to edge. Eg. : how many time in the sample can we find edge going from A to B?We could fix a threshold on confidence and retrieve a global network construct with edges having confidence over the threshold

28

What we are working on:

Mixing both sequential and non-sequential data to retrieve interesting relation between genesHow?

Using DBN and MCMC for sequential data + BN and MCMC for non-sequential

100*N networks from DBN 100*N networks from BN

Informationtuner

Learn network

29

How to test the approach:

Problem : There is no way to test it on real data cause there is no completely known networkSolution : Work on realistic simulation where we know the network structureExample :

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

3 5 6

7 8 9

10

11

*

Simulate

30

How to test the approach:

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

3 5 6

7 8 9

10

11

*

Simulate

Sequential data Non-Sequential data

Infotuner DBN

MCMC

BNMCMC

0 1 122 4 133 5 6

7 8 91011

Compare using ROC curves

31

Test description:

Generate 60 sequential dataGenerate 120 non-sequential data (~reality proportion)Run DBN MCMC on sequential data keep 100*N sample netRun BN MCMC on non-sequential data keep 100*N sample netTest performance using weight on sample

0 BN 1 DBN.05 BN 0.95 DBN…0.95 BN .05 DBN1 BN 0 DBN

The metric used is the area under ROC curve. Perfect learner gets 1.0 , random gets 0.5 and the worst one gets 0.

32

Results:

1 DBN10

Are

a un

der R

OC

cur

ve

0 BN

33

Perspective:

Working on more sophisticated ways to mix sequential and non-sequential dataWorking on real cases:

Yeast cell-cycleArabidopsis Thaliana circadian rhythm

Real data also means missing valuesEvaluate missing values solution (EM, KNNImpute)

34

Acknowledgements:

François Major

http://www.precarn.ca/IRIS/PrecarnUnivLedPrgm/

35

Why are there missing datas?

Low correlationExperimental problems

36

ROC Curve

Receiver Operating Characteristic curve

*-http://gim.unmc.edu/dxtests/roc2.htm

*

37

MCMC simulation and number of sampled networks

ROC curve area in function of the number of sample networks from MCMC simulation for N=12

0.86

0.865

0.87

0.875

0.88

0.885

0.89

0.895

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

3750

4000

4250

4500

4750

5000

# of samples from MCMC

RO

C a

rea

Download - Gene Expression Analysis Using Bayesian Networks

Top Related