searching, sampling and counting in rna and gene networks

35
Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting Olivier Martin Laboratoire de Physique Théorique et Modèles Statistiques et UMR de Génétique Végétale University of Paris-Sud

Upload: pammy98

Post on 14-Dec-2014

165 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Searching, sampling and counting in RNA and gene networks

Enumerating, Sampling and Counting:some illustrative cases in biology

Enumerating discrete structuresSampling and searching Nested sampling for counting

Olivier Martin

Laboratoire de Physique Théorique et Modèles Statistiqueset

UMR de Génétique Végétale

University of Paris-Sud

Page 2: Searching, sampling and counting in RNA and gene networks

[ I ]: Enumerating Discrete Structures

Illustrative case: trees describing pedigreesO. Martin and F. Hospital, Genetics (2004)

In a breeding program, one wants to (optimally) crossa collection of “parents” to produce an ideal genome, butthe mixing of the genes (Mendelian genetics) is probabilisticand depends on their mutual distances.

General framework: each individual in the parental populationhas one good gene (resistance to one disease) and the “ideotype” must accumulate all these into one genome.The crossing of 2 parents should pass on their good genes to at least one offspring.

Page 3: Searching, sampling and counting in RNA and gene networks

Transmission of genesH

(1)(2)

s1=1

s2=2

s = 1,2

H(3)(4)

s1=3

s2=4

s = 3,4

H(12)(34)

s1=1,2

s2=3,4

s = 1,2,3,4

We impose that a gametecumulate all the good genesof the 2 chromosomes of its parent

Page 4: Searching, sampling and counting in RNA and gene networks

Example of a simple pedigree

Gene 1Gene 2Gene 3

Fixation Steps

1 2

1 2 3

P1 P2 P3

I*

P1, P2, P3: founder parents

I* : Ideotype

Page 5: Searching, sampling and counting in RNA and gene networks

Pedigrees differ by:

- A tree structure

- The choice of parents

Representation of a pedigree

P1

P2

P3

P4

P1

P2

P3

P4

P1

P4

P2

P3

Page 6: Searching, sampling and counting in RNA and gene networks

Particular cases of pedigrees

Min height = log2 (n) = 3

Max height = (n -1) = 7

Regular pyramid Cascade

Page 7: Searching, sampling and counting in RNA and gene networks

Pedigree = binary leaf-labeled tree

H(1)(2)

H(3)(4)

H(5)(6)

H(12)(34)

H(1234)(56)

P1 P2 P3 P4 P5 P6 Level 0

Level 1

Level 2

Level 3

Leaves

Node

Page 8: Searching, sampling and counting in RNA and gene networks

Questions

How to count the number of distinct pedigrees?

How to computer enumerate them for further use?

How to sample them uniformly?

How to find the «optimal » pedigree given that each pedigree has a cost?

Page 9: Searching, sampling and counting in RNA and gene networks

Counting the number of pedigrees

n 3 4 5 8 10 20

A(n) 3 15 105 135135 3.4 x 107 8.2 x 1021

For n genes, one has A(n)=(2n - 3)!! pedigrees

(by recurrence equations)

Page 10: Searching, sampling and counting in RNA and gene networks

Enumeration of all pedigrees

p genes n-p genes

Sub-pedigree

A pedigree cumulating n genes

One fuses two sub-pedigrees:

- cumulating p genes- cumulating (n-p) genes

Page 11: Searching, sampling and counting in RNA and gene networks

An algorithm for constructing all pedigrees

Examine all pairs of sub-pedigrees {P1,P2} of height h1=h et h2≤h

If P1 et P2 have no good gene in common, fuse them to form a sub-pedigree P of height (h+1)

If P cumulates all good genes, keep it, otherwise add it to the list of sub-pedigrees of height h+1

Suppose all sub-pedigrees of height at most h are known; one can generate all those of height h+1:

Repeat for the next height until h+1 = n-1

Page 12: Searching, sampling and counting in RNA and gene networks

Working of the algorithmh=0

Page 13: Searching, sampling and counting in RNA and gene networks

Working of the algorithm ...

h=1

h=0

Page 14: Searching, sampling and counting in RNA and gene networks

Working of the algorithm ...

h=1

h=0

Page 15: Searching, sampling and counting in RNA and gene networks

Working of the algorithm ...

h=1

h=0

etc ...

Page 16: Searching, sampling and counting in RNA and gene networks

Working of the algorithm ...

h=1

h=0

h=2

etc ...

Page 17: Searching, sampling and counting in RNA and gene networks

Working of the algorithm ...

h=1

h=0

h=2

etc ...

Page 18: Searching, sampling and counting in RNA and gene networks

Working of the algorithm ...

h=1

h=0

h=2

etc ...

etc ...

h=3

Page 19: Searching, sampling and counting in RNA and gene networks

Example : cascade with 4 genes

Page 20: Searching, sampling and counting in RNA and gene networks

Optimal pedigrees: search by pruning the enumeration

(branch and bound)

Of all the ways to produce a given combination of good genes, keep only the best sub-pedigreeEnumeration: one can treat up to 14 genes,Branch and bound: up to 22 genes.

Case of « adjacent » cascades :dynamic programming determines the optimal pedigree in O(n2) operations

Page 21: Searching, sampling and counting in RNA and gene networks

[ II ]: Sampling and searching

This problem is ubiquitous:

Physics: equilibrium configurations Operations research: feasible solutions

of CSP Statistics: estimating p-values

Page 22: Searching, sampling and counting in RNA and gene networks

La voie royale: Monte Carlo Markov Chains

To obtain samples with a given probability distribution or measure, use the Metropolis algorithm (1953)

Simple, very effective if no bottlenecks

If the measure is fragmented, one needs large « moves » but that almost always fails

Page 23: Searching, sampling and counting in RNA and gene networks

The case of biological networks: some computational challenges

(1) Generate a genotype of given phenotype (oriented search)(2) Sample uniformly genotypes of a given phenotype: use symmetries to reduce exponentially the space size(3) Determine the connectivity of the neutral network: do guided search to go from one random genotype to another(4) Sample uniformly a connected component of the neutral network: use random walks(5) Sample uniformly the surface of a “ball” around a point: use Metropolis with asymmetric rates(6) Get the infinite population limit of a population under Darwinian selection: use variance reduction and 1/N extrapolation

Page 24: Searching, sampling and counting in RNA and gene networks

Viable genotypes are rareS. Ciliberti, O. Martin and A. Wagner, Plos Comp. Bio. (2007)

If one allows for M interactions (M non-zero entries of W) between N genes, what fraction of the genotypes (regulatory networks) are viable? By smart sampling:

4 6 8 10 12 14 16 18 20

N

5E -15

5E -14

5E -13

5E -12

5E -11

5E -10

5E -9

5E -8

5E -7

5E -6

5E -5

0.0005

0.0050

0.0500

0.5000

Fractio

n

v f o

f via

ble

netw

ork

s Illustration when M = 0.25 N2

Page 25: Searching, sampling and counting in RNA and gene networks

Showing connectivity properties of biological networks

We want to check with a high level of confidence that a certain space S is connected. We do this in three steps: Use the Metropolis MC algorithm to produce random pairs of points (P1,P2) in the space S Generate an “equilibrium” cloud of points in S around P1 by a biased Monte Carlo and store these Produce a MC chain of points in S, starting from P2, using for instance the same Monte Carlo rule as above; check for collisions with the stored set. If a collision arises, P1 is connected to P2

Page 26: Searching, sampling and counting in RNA and gene networks

The viable genotypes form a connected network

Very few viable networks are not in the giant connected component, and the few such networks are usually isolated.

Example: For M=0.25 N2, the fraction of viable networks not belonging to the giant component is:

2.3×10-3 at N=8

1.7×10-3 at N=12

1.4×10-3 at N=20

Page 27: Searching, sampling and counting in RNA and gene networks

Structure in the neighborhood of a viable genotype

Page 28: Searching, sampling and counting in RNA and gene networks

Neutral network topologyS. Ciliberti, O. Martin and A. Wagner, PNAS (2007)

Page 29: Searching, sampling and counting in RNA and gene networks

Constructive samplers

When the measure is fragmented, resort to creating samples ab-initio and use weights

Need to « guide » the construction, otherwise weights have huge variance Some cases are « easy » (Sinclair et al.): Polynomial Randomized Approximation Scheme Some difficult cases have been treated (PERM of Grassberger) but it is an art

Page 30: Searching, sampling and counting in RNA and gene networks

Other samplers

Choose at random a sufficiently small sub-regions and apply branch and bound in each to get configurations (very slow)

Perform nested sampling (multiple measures interpolating to the desired one)

Accept incorrect distribution and just get « some » configurations by guided stochastic search; this is OK in the context of search or “design”

Page 31: Searching, sampling and counting in RNA and gene networks

0.0 0.2 0.4 0.6 0.8 1.0

R μ

0.0

0.2

0.4

0.6

0.8

1.0

Q

Spearman's s= 0.65, P< 10 -17;n= 104

The mutational robustness and our measure Q have a strong association

What makes a regulatory network robust and how can one « design » functional networks ?

Q is a « quality » factor which measures the synergy of the Wij Sj

Page 32: Searching, sampling and counting in RNA and gene networks

[ III ]: Nested sampling for counting

Sometimes it is not enough to sample feasible solutions, one may want to know their number or frequency… Physics: entropy Statistics: small p-values Operations research: size of set of feasible solutions of CSP Biology: computing neutral network sizes

Page 33: Searching, sampling and counting in RNA and gene networks

Nested sampling

In a discrete space, we want to sample configurations having an unusual property, forming a fraction of say 1 in a trillion… Randomly sampling the full space won't do

Often Monte Carlo won't work because the desired sub-space is fragmented

Introduce a family of measures interpolating between the full space and the desired sub-space and use exchange Monte Carlo

on the replicas

Page 34: Searching, sampling and counting in RNA and gene networks

Example: cardinality of ‘neutral’ network in RNA modeling

T. Jorg, O. Martin and A. Wagner, submitted to BMC Bioinformatics

Discrete space of sequences, only a tiny fraction have the correct folding…Changing just a bit the sequence sometimes changes the folding a lot, so space is fragmentedA simple choice for the measures: increasing distances to the target fold. At very short distances the measure is fragmented, but use of larger distances restores connectivity, thereby allowing the use of the Metropolis approach.Even with this simple choice, one can efficiently sample the space of interest uniformly in spite of its rarity.

Extra bonus: one can both sample and count stochastically, in contrast to standard Monte Carlo.

Page 35: Searching, sampling and counting in RNA and gene networks

Some conclusions

In the most favourable cases, one can enumerate, sample, search (design/optimize) and count.

Sophisticated algorithmic approaches based on Markov Chains allow one to sample even in intricate spaces, though at a significant computational cost.

The use of nested sampling allows for approximate counting in many realistic cases.

Except for enumeration, these techniques are perfectly applicable to continuous spaces.