searching, sampling and counting in rna and gene networks

Post on 14-Dec-2014

165 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Enumerating, Sampling and Counting:some illustrative cases in biology

Enumerating discrete structuresSampling and searching Nested sampling for counting

Olivier Martin

Laboratoire de Physique Théorique et Modèles Statistiqueset

UMR de Génétique Végétale

University of Paris-Sud

[ I ]: Enumerating Discrete Structures

Illustrative case: trees describing pedigreesO. Martin and F. Hospital, Genetics (2004)

In a breeding program, one wants to (optimally) crossa collection of “parents” to produce an ideal genome, butthe mixing of the genes (Mendelian genetics) is probabilisticand depends on their mutual distances.

General framework: each individual in the parental populationhas one good gene (resistance to one disease) and the “ideotype” must accumulate all these into one genome.The crossing of 2 parents should pass on their good genes to at least one offspring.

Transmission of genesH

(1)(2)

s1=1

s2=2

s = 1,2

H(3)(4)

s1=3

s2=4

s = 3,4

H(12)(34)

s1=1,2

s2=3,4

s = 1,2,3,4

We impose that a gametecumulate all the good genesof the 2 chromosomes of its parent

Example of a simple pedigree

Gene 1Gene 2Gene 3

Fixation Steps

1 2

1 2 3

P1 P2 P3

I*

P1, P2, P3: founder parents

I* : Ideotype

Pedigrees differ by:

- A tree structure

- The choice of parents

Representation of a pedigree

P1

P2

P3

P4

P1

P2

P3

P4

P1

P4

P2

P3

Particular cases of pedigrees

Min height = log2 (n) = 3

Max height = (n -1) = 7

Regular pyramid Cascade

Pedigree = binary leaf-labeled tree

H(1)(2)

H(3)(4)

H(5)(6)

H(12)(34)

H(1234)(56)

P1 P2 P3 P4 P5 P6 Level 0

Level 1

Level 2

Level 3

Leaves

Node

Questions

How to count the number of distinct pedigrees?

How to computer enumerate them for further use?

How to sample them uniformly?

How to find the «optimal » pedigree given that each pedigree has a cost?

Counting the number of pedigrees

n 3 4 5 8 10 20

A(n) 3 15 105 135135 3.4 x 107 8.2 x 1021

For n genes, one has A(n)=(2n - 3)!! pedigrees

(by recurrence equations)

Enumeration of all pedigrees

p genes n-p genes

Sub-pedigree

A pedigree cumulating n genes

One fuses two sub-pedigrees:

- cumulating p genes- cumulating (n-p) genes

An algorithm for constructing all pedigrees

Examine all pairs of sub-pedigrees {P1,P2} of height h1=h et h2≤h

If P1 et P2 have no good gene in common, fuse them to form a sub-pedigree P of height (h+1)

If P cumulates all good genes, keep it, otherwise add it to the list of sub-pedigrees of height h+1

Suppose all sub-pedigrees of height at most h are known; one can generate all those of height h+1:

Repeat for the next height until h+1 = n-1

Working of the algorithmh=0

Working of the algorithm ...

h=1

h=0

Working of the algorithm ...

h=1

h=0

Working of the algorithm ...

h=1

h=0

etc ...

Working of the algorithm ...

h=1

h=0

h=2

etc ...

Working of the algorithm ...

h=1

h=0

h=2

etc ...

Working of the algorithm ...

h=1

h=0

h=2

etc ...

etc ...

h=3

Example : cascade with 4 genes

Optimal pedigrees: search by pruning the enumeration

(branch and bound)

Of all the ways to produce a given combination of good genes, keep only the best sub-pedigreeEnumeration: one can treat up to 14 genes,Branch and bound: up to 22 genes.

Case of « adjacent » cascades :dynamic programming determines the optimal pedigree in O(n2) operations

[ II ]: Sampling and searching

This problem is ubiquitous:

Physics: equilibrium configurations Operations research: feasible solutions

of CSP Statistics: estimating p-values

La voie royale: Monte Carlo Markov Chains

To obtain samples with a given probability distribution or measure, use the Metropolis algorithm (1953)

Simple, very effective if no bottlenecks

If the measure is fragmented, one needs large « moves » but that almost always fails

The case of biological networks: some computational challenges

(1) Generate a genotype of given phenotype (oriented search)(2) Sample uniformly genotypes of a given phenotype: use symmetries to reduce exponentially the space size(3) Determine the connectivity of the neutral network: do guided search to go from one random genotype to another(4) Sample uniformly a connected component of the neutral network: use random walks(5) Sample uniformly the surface of a “ball” around a point: use Metropolis with asymmetric rates(6) Get the infinite population limit of a population under Darwinian selection: use variance reduction and 1/N extrapolation

Viable genotypes are rareS. Ciliberti, O. Martin and A. Wagner, Plos Comp. Bio. (2007)

If one allows for M interactions (M non-zero entries of W) between N genes, what fraction of the genotypes (regulatory networks) are viable? By smart sampling:

4 6 8 10 12 14 16 18 20

N

5E -15

5E -14

5E -13

5E -12

5E -11

5E -10

5E -9

5E -8

5E -7

5E -6

5E -5

0.0005

0.0050

0.0500

0.5000

Fractio

n

v f o

f via

ble

netw

ork

s Illustration when M = 0.25 N2

Showing connectivity properties of biological networks

We want to check with a high level of confidence that a certain space S is connected. We do this in three steps: Use the Metropolis MC algorithm to produce random pairs of points (P1,P2) in the space S Generate an “equilibrium” cloud of points in S around P1 by a biased Monte Carlo and store these Produce a MC chain of points in S, starting from P2, using for instance the same Monte Carlo rule as above; check for collisions with the stored set. If a collision arises, P1 is connected to P2

The viable genotypes form a connected network

Very few viable networks are not in the giant connected component, and the few such networks are usually isolated.

Example: For M=0.25 N2, the fraction of viable networks not belonging to the giant component is:

2.3×10-3 at N=8

1.7×10-3 at N=12

1.4×10-3 at N=20

Structure in the neighborhood of a viable genotype

Neutral network topologyS. Ciliberti, O. Martin and A. Wagner, PNAS (2007)

Constructive samplers

When the measure is fragmented, resort to creating samples ab-initio and use weights

Need to « guide » the construction, otherwise weights have huge variance Some cases are « easy » (Sinclair et al.): Polynomial Randomized Approximation Scheme Some difficult cases have been treated (PERM of Grassberger) but it is an art

Other samplers

Choose at random a sufficiently small sub-regions and apply branch and bound in each to get configurations (very slow)

Perform nested sampling (multiple measures interpolating to the desired one)

Accept incorrect distribution and just get « some » configurations by guided stochastic search; this is OK in the context of search or “design”

0.0 0.2 0.4 0.6 0.8 1.0

R μ

0.0

0.2

0.4

0.6

0.8

1.0

Q

Spearman's s= 0.65, P< 10 -17;n= 104

The mutational robustness and our measure Q have a strong association

What makes a regulatory network robust and how can one « design » functional networks ?

Q is a « quality » factor which measures the synergy of the Wij Sj

[ III ]: Nested sampling for counting

Sometimes it is not enough to sample feasible solutions, one may want to know their number or frequency… Physics: entropy Statistics: small p-values Operations research: size of set of feasible solutions of CSP Biology: computing neutral network sizes

Nested sampling

In a discrete space, we want to sample configurations having an unusual property, forming a fraction of say 1 in a trillion… Randomly sampling the full space won't do

Often Monte Carlo won't work because the desired sub-space is fragmented

Introduce a family of measures interpolating between the full space and the desired sub-space and use exchange Monte Carlo

on the replicas

Example: cardinality of ‘neutral’ network in RNA modeling

T. Jorg, O. Martin and A. Wagner, submitted to BMC Bioinformatics

Discrete space of sequences, only a tiny fraction have the correct folding…Changing just a bit the sequence sometimes changes the folding a lot, so space is fragmentedA simple choice for the measures: increasing distances to the target fold. At very short distances the measure is fragmented, but use of larger distances restores connectivity, thereby allowing the use of the Metropolis approach.Even with this simple choice, one can efficiently sample the space of interest uniformly in spite of its rarity.

Extra bonus: one can both sample and count stochastically, in contrast to standard Monte Carlo.

Some conclusions

In the most favourable cases, one can enumerate, sample, search (design/optimize) and count.

Sophisticated algorithmic approaches based on Markov Chains allow one to sample even in intricate spaces, though at a significant computational cost.

The use of nested sampling allows for approximate counting in many realistic cases.

Except for enumeration, these techniques are perfectly applicable to continuous spaces.

top related