yann ponty andy lorenz peter clote biology department boston college asymptotics of rna shapes

Yann PontyAndy LorenzPeter Clote

Biology DepartmentBoston College

Asymptotics of RNA shapes

Talk outline

1.Biological motivation2.General approach used3.Results for RNA shapes

Biological motivation

Primary structurePrimary structure Secondary structureSecondary structure• By definition it is (half-)planar in nature• Commonly computationally tractable

Tertiary structureTertiary structureUltimate goal, but

difficult to predict. Sequence ofA’s, C’s, G’s and U’s

Biological motivation – RNA secondary structure

…(((…((((((((…....)))))…))).(((….)))..)))…

Picture representation (planar graph representation)

Balanced parenthesis sequence representation

Non-negative excursion(path that starts and ends at 0 but is never negative)

Terminal loop

Terminal loop

n=53

Biological motivation – RNA secondary structure

This is a pseudo-knot, not allowed in secondary structure

pseudo-knot

No crossing is allowed in secondary structure. Such crosses are called pseudoknots.

Biological motivation – RNA secondary structureIn many algorithms, the set of all possible secondary structures is the search space. The size of this search space can affect how large an RNA a given algorithm can be applied to?

Let S(n) denote the number of secondary structures on a sequence of length n. As described, S(n) are known as the Motzkin numbers, and the asymptotics of S(n) for large n are

Similarly, if we force terminal loops to be of length 1 or more, we get a different asymptotic growth (Stein, Waterman 1978).

In any case, these numbers grow fast. Can we find equivalence classes of these shapes that do not grow so fast?

S(n) »q

15+7p

58¼ n¡ 3=2

³3+

p5

2

´n» 1:104366¢2:618034n=n3=2

S(n) » 3p

32p

¼3n=n3=2 » 1:46581¢3n=n3=2

Biological motivation – RNA shapes – π shapes

bulge

Terminal loop

Multi-loop

Internal loop

helix

Terminal loops

Multi-loop

Helix regions

π-shapes try to capture basic shape of secondary structure. Bulges and internal loops are ignored. The unpaired bases in multi-loops and terminal loops are ignored. Helix regions are collapsed into 1 bracket. The π-shapes for both of the above secondary structures is the same, and is

[ [ ] [ ] ]

Unpaired base

Paired base

Biological motivation – RNA shapes – π shapes

…(((…((((((((…….)))))…))).(((.(((…)))..))).)))…

[ [ ] [ ] ]

[ [ ] [ ] ]

[

[

]

]

[

][

[[

]]

]

Biological motivation – RNA shapes – π’ shapes

π’-shapes try to capture more aspects of secondary structure. Bulges and internal loops are not ignored. Any group of unpaired bases is reduced to one dot. Helix regions are collapsed into 1 bracket. The π’-shapes for the above secondary structures are

.[.[[.].].[.[.].].]. and .[.[.][.].].

Multi-loop

Helix regions

Unpaired base

Paired basebulge

Internal loop

Biological motivation – RNA shapes – π’ shapes

…(((…((((((((…….)))))…))).(((.(((…)))..))).)))…

[ ]

.[.[[.].].[.[.].].].

. .[[ [[

. ....]] ] ]

...

Talk outline

1.Biological motivation2.General approach used3.Results for RNA shapes

General approach – overview

Structure

Asymptotics

Generating Function

Grammar

Structure

Asymptotics

Functional relation forgenerating function

Recursion relations

Standard method Better method

(comparison 1)

(comparison 2)

General approach Example to be worked through

For our example, we will use secondary structures with minimum terminal loop length of 1.

…(((…((((((((…....)))))…))).(((.)))..)))… OK

..(((..().((..((..))))

Terminal loop of length 7 …length 1

length 0 length 2

not OK

We know the asymptotics of this to be

S(n) »q

15+7p

58¼ n¡ 3=2

³3+

p5

2

´n» 1:104366¢2:618034n=n3=2

Standard method – recursion relations

2 cases

Let where is num of secondary structures on sequence of length n.

nk

Sn-1Sk-1

S(z) =P 1

n=1 Snzn Sn

Sn-k-1

not base paired

From this we see that

Now, being careful with initial conditions, and with a bit of algebraicmanipulation we eventually can get to the relation

Sn = Sn¡ 1 +P n¡ 2

k=1 Sk¡ 1 ¢Sn¡ k¡ 1

S = z + Sz + Sz2 + S2z2

Standard method – recursion relationsFirst note

S2 =

Ã1X

n=1

snzn

! 2

=1X

n=1

Ãn¡ 1X

k=1

sksn¡ k

!

zn : (1)

By induction we ¯rst get Sn = Sn¡ 1 +P n¡ 2

k=1 Sk¡ 1 ¢Sn¡ k¡ 1. Replacing n byn+2, wehaveSn+2 = Sn+1+

P nk=1 Sk¡ 1¢Sn¡ (k¡ 1). Substituting r for k¡ 1, we

have Sn+2 = Sn+1 +P n

r=0 Sr ¢Sn¡ r . Since S0 = 1, we haveP n

r=0 Sr ¢Sn¡ r =Sn +

P n¡ 1r=0 Sr ¢Sn¡ r , so

Sn+2 ¡ Sn+1 ¡ Sn =n¡ 1X

r =0

Sr ¢Sn¡ r :

Now

S2 =

Ã1X

n=1

Snzn

! 2

=1X

n=1

Ãn¡ 1X

k=1

SkSn¡ k

!

zn

so

S2 =1X

n=1

(Sn+2 ¡ Sn+1 ¡ Sn)zn =1X

n=1

Sn+2zn ¡1X

n=1

Sn+1zn ¡1X

n=1

Snzn :

Note thatS ¡ S1z ¡ S2z2

z2 =1X

n=1

Sn+2zn

andS ¡ S1z

z=

1X

n=1

Sn+1zn :

Thus

S2 =S ¡ z ¡ z2

z2 ¡S ¡ z

z¡ S

Multiply by z2 to get

z2S2 = S ¡ z ¡ z2 ¡ zS + z2 ¡ Sz2 (1)

soz2S2 ¡ S(1¡ z ¡ z2) + z = 0

and thusS = z2S2 + Sz2 + Sz + z

Standard method – recursion relations

Better method – grammarGrammars are a way of generating a language. We will restrict to the simple case of context-free grammars. The context-free grammar for secondary structures (with minimal terminal loop length 1) is given by

S ! ² jS ² j ( S ) jS ( S )Here our terminal symbols (letters in the language this grammar describes) are (, ) and ●. The language generated is exactly secondary structures (with minimum terminal loop length 1).

To generate a word, we make a substitution on the right for a symbol on the left. So, for an example for the above language, we could generate the word (always substituting the left-most non-terminal)

! ² ² ( ² ( S ) ² ) ! ² ² ( ² ( ² ) ² )S ! S ( S ) ! S ² ( S ) ! ² ² ( S ) ! ² ² ( S ² ) ! ² ² ( S ( S ) ² )

In addition, because this grammar is unambiguous, there is no different path to this same word.

Better method – grammar

S

S ! ² jS ² j ( S ) jS ( S )

S ( S )

S ●

●

S●

S ( S )

●●

Here is the parse tree for the same word.

Reading the leaves of the tree gives the same word, .This is only tree giving rise to this word, because the language is unambiguous (for this grammar can be shown by induction).

² ² ( ² ( ² ) ² )

Better method – grammar

Type of nonterminal Equation for the l.g.f.S ! T j U S(z) = T(z) + U(z)S ! T U S(z) = T(z)U(z)S ! t S(z) = zS ! " S(z) = 1

Given an unambiguous context-free grammar, we can find the corresponding generating function relations with the above properties. This is very fast!

S ! ² jS ² j ( S ) jS ( S )S = z + zS + z2S + z2S2

Significantly faster than other method at getting here!This method for getting the relations on the generating equation is sometimes called the DSV method.

General approach – overview

Structure

Asymptotics

Generating Function

Grammar

Structure

Asymptotics

Functional relation forgenerating function

Recursion relations

Standard method Better method

(comparison 1)

(comparison 2)

Standard method – getting asymptotics

Suppose that f (z) =P 1

n=1 f nzn is analytic at z = 0, that f n ¸ 0 for all n,and that f (z) = G(z;f (z)), whereG(z;w) =

Pm;n¸ 0 gm;nzmwn . Suppose that

there exist real numbers ±;r;s > 0 such that

² G(z;w) is analytic in jzj < r + ±and jwj < s + ±.

² G(r;s) = s, Gw(r;s) = 1,

² Gz(r;s) 6= 0 and Gw;w(r;s) 6= 0.

Suppose that gm;n is real and non-negative for all m;n, that g0;0 = 0, g0;1 6=1,and gm;n > 0 for some m and some n ¸ 2. Assume further that there existh > j > i ¸ 1 such that f hf i f j 6= 0 while the greatest common divisor of j ¡ iand h ¡ i is 1. Then f (z) converges at z = r, f (r) = s, and

f n = [zn]f (z) »

srGz(r;s)

2¼Gw;w(r;s)r¡ nn¡ 3=2:

The following theorem due to Meir-Moon and modified by Odlyzko can be used to get the asymptotics.

For us, we identify S with w and get G(z;w) = z + zw+ z2w+ z2w2

Standard method – getting asymptoticsWhat is good and bad about the Bender-Meir-Moon theorem?

The good•If all of the conditions are satisfied, one merely has to plug in the answer.•This approach, unlike the one we are going to describe, does not need an explicit formula for the generating function.

The bad•If all of the conditions are not satisfied, (as happened to us), you cannot use the theorem. There are many conditions that are very constraining.• This approach will only tackle one (common) kind of singularity in the generating function. The approach we show next is much more general in this respect.

The ugly…

Better method – getting asymptotics

We now describe a more general approached as described by Flajolet and Odlyzko (1990).

Start with the relation for the generation function given by the grammar.

S = z + zS + z2S + z2S2

Solve for generating function.

Choose the solution that is analytic at 0 (a necessary condition for generating functions).

S§ = 1¡ z¡ z2§p

1¡ 2z¡ z2¡ 2z3+z4

2z2

S = 1¡ z¡ z2¡p

1¡ 2z¡ z2¡ 2z3+z4

2z2

This function will be analytic except possibly where the denominator is zero (in this case it is analytic at z=0), and where the square root is zero (since is not analytic at 0)z1=2

Better method – getting asymptotics

S = 1¡ z¡ z2¡p

1¡ 2z¡ z2¡ 2z3+z4

2z2

The above equation is non-analytic at the roots of the polynomial

1¡ 2z ¡ z2 ¡ 2z3 + z4

These are found to be roots of unity (modulus 1) and

z = 3+p

52 ;z = 3¡

p5

2The dominant singularity, ρ, is the one with with smallest modulus. Thus

½= 3¡p

52

This means the series S =P 1

n=0 Snzn converges for jzj < ½and diverges for . We immediately get that the terms Sn grow exponentially at a rate of

jzj > ½

1=½= 23¡

p5

= 3+p

52

Thus we know Sn ¼(3+p

52 )n

Better method – finer asymptoticsNow we use the theorem by Flajolet and Odlyzko.

Assume that f(z) is analytic in 4 n1, and that as z ! 1 in 4 ,

f (z) » K (1¡ z)®

Then, as n ! 1 , if ®=2 0;1;2;:::,

f n »K

¡ (¡ ®)n¡ ®¡ 1:

If the singularities are isolated, and if there is a unique dominant singularity, the rescaled function will always be analytic in the required region, \1.

The theorem basically states that the growth of terms is determined completely by the singularity. The rescaling of the singularity to 1 merely gets rid of the exponential portion in the answer, which we already know.

1

i

External singularities

Dominant singularity

φε

The region,

4

4

Better method – finer asymptotics

S = 1¡ z¡ z2¡p

1¡ 2z¡ z2¡ 2z3+z4

2z2

= 1¡ z¡ z2

2z2 ¡p

1¡ 2z¡ z2¡ 2z3+z4

2z2

Slower growing as it does not contain singularity

Only part that matters for asymptotics

S0= ¡p

1¡ 2z¡ z2¡ 2z3+z4

2z2

S0= ¡ (P2(z)(1¡ z=½))1=2

2z2where P2(z) is determined by dividing out the dominant singularity.

= ¡ P2(z)1=2

2z2 (1¡ z=½)1=2

This is portion in front of dominant singularity

In theorem, α=1/2 from here

K = ¡ P2(½)1=2

2½2Evaluate that portion at dominant singularity ρ.

Better method – finer asymptotics

K = ¡ P2(½)1=2

2½2

So from theorem we get

®= 1=2and

(from rescaling)

1=½= 3+p

52

Sn » K¡ (¡ ®) n

¡ ®¡ 1(1½)n

Plugging in gives

S(n) »q

15+7p

58¼ n¡ 3=2

³3+

p5

2

´n» 1:104366¢2:618034n=n3=2

as desired.

Method

Combinatorial class Language

Grammar Generating function

Asymptotics

?

DSV

Singularityanalysis

CodesBijections

ShapesShapes Shapes »» Trees Trees

Well-parenthesized languageWell-parenthesized language (Dyck words)(Dyck words)

[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [

Grammar:Grammar: S [ S ] S |

Shapes Shapes »» Trees Trees Well-parenthesized languageWell-parenthesized language

(Dyck words)(Dyck words)

[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [

Shapes

But nested motifs [ [ ] ] must be prohibited !

Otherwise many different shapes may correspond to a single structure !!!

[ ]

[ [ ] ]

[ [ [ ] ] ]

…

ShapesShapes Shapes ,, Dyck words w/o Dyck words w/o [ [ ] ][ [ ] ] motifs motifs

[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [

Grammar:Grammar: S [ T ] S | [ T ]

T [ T ] S |

TST

T S ST S S

Generating Generating function:function:

T T TT T

T

Asymptotics:Asymptotics:

S

ShapesGoal :Goal : Number of shapes (S) compatible with a given RNA sequence S.

First model:First model: Every base can pair with every other base

) (S) is the number of -Shapes having length n

Problem:Problem: We only know the number of -Shapes having length = n

) Build language whose words are -shapes, prefixed by a sequence of a new dummy symbol ■

S [ T ] S | [ T ]

T [ T ] S |

R ■ R | S

By virtually removing the dummy characters in words of size n generated from R, we get all -Shapes of size n.


Second model:Second model: Terminal loops requires at least bases (sterical constraint)

Equivalent to counting the number of Pi-shapes of size n-t, where t is the number of terminal loops in

Terminal loops


Second model: Terminal loops requires at least bases (sterical constraint)

) Introduce a new dummy symbol and get rid of it later !

Derivation T actually creates a terminal loop, so make it cost bases in the matching sequence.

[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [T

STT S S

T S ST T T

T TT

Equivalent to counting the number of Pi-shapes of size n-t, where t is the number of terminal loops in

S [ T ] S | [ T ]

T [ T ] S |

R ■ R | S

Shapes

S [ T ] S | [ T ]

T [ T ] S |

R ■ R | SGrammar:Grammar:

Generating Generating function:function:


Second model:Second model: Terminal loops requires at least bases (sterical constraint)

’-ShapesModel:Model: Everything can basepair + Terminal loops at least base-long

Introduce unpaired bases between helices while avoiding and [ [ ] ] patterns

S U [ T ] S | U

T U [ T ] U [ T ] S | [ T ] | [ T ] | [ T ] | 3

R ■ R | S

U |

Generating function:Generating function:

Grammar:Grammar:


A bijection between -Shapes and Motzkin words

Def Motzkin words:Def Motzkin words: Words in {(, ), }* whose restriction to {(, )} are well-parenthesized

Naturally encodes:Naturally encodes:

• Positive paths from (0,0) to (n,0) using steps (+1,+1), (+1,-1) and (+1,0)• Unary/binary trees• Non-intersecting drawings of any number of chords among n ordered points on a circle • RNA secondary structures (( ) pattern forbidden)

Example:Example: ( ( ) ( ( ) ) ( ) ( ( ( ) ) ( ) ) ( ) )

A bijection between -Shapes and Motzkin words

Generating function for Motzkin words:Generating function for Motzkin words:

Generating function for Generating function for -Shapes:-Shapes:

There exists a bijection between Motzkin words of size n and -Shapes of size 2n+2 !

-Shapes , Motzkin words

Illustration on trees:Illustration on trees:

[ [ [ ] [ ] ] [ ] [ ] ] [ ]

shapelength 2n+2

Dyck’sbijection

Treepruning

(

) (

)

)( )(

Motzkin wordsn edges

Combinatorialbinary tree2n+2 edges

Binary treen edges

-Shapes , Motzkin words Motzkin words , -Shapes

Is there a way to transpose the Is there a way to transpose the constraint for constraint for -Shapes into -Shapes into a a ’ constraint over Motzkin words ???’ constraint over Motzkin words ???

+ Terminal loops appear in both structures

Short answer:Short answer: No ! Why ?

(If so, -Shapes , ’ RNA secondary structures)

Because !

-Shapes , Motzkin words

(Not so) Long answer :(Not so) Long answer : Because there is a bigger (though comparable) number of terminal loops in -Shapes of size 2n+2 than in Motzkin words of size n.

) Proof requires multivariate generating functions analysis.

) A constraint acting on terminal loops has more impact on -shapes than on Motzkin

Multivariate analysisIdea:Idea: Introduce a new character t, whose length is 0, which marks each occurrence of a terminal loop.

) Transpose into a system of functional equations:

S ( T ) S | S |

T ( T ) S | S | t

Let be the number of Motzkin words of size n having k terminal loops

And the number terminal loops in a Motzkin word

Multivariate analysisFrom , the average number of terminal loops is just a derivative away:

(Also holds for the average number of occurrences of any given character !!!)

Take the ratio of these quantities and do the same analysis for -shapes yields the claimed results.

Conclusion and perspectives

DSV method + Singularity analysis) Easy asymptotics for discrete structures) Perfect for computational biology

Derived asymptotical growth for the numbers of -shapes under different models

Yet another bijection between Motzkin words and a subclass of Dyck words

How to add probabilities for base-pairing to these models ? Why do the equations simplifies and yield beautiful results

when is even ?

[E.A. Bender] Asymptotic methods in enumeration SIAM Rev., 16(4):485-515, 1974.

[P. Flajolet] Singular CombinatoricsIn Proceedings of the International Congress of Mathematicians, Vol 3, World Scientific, pp. 561-57, 2002

[P. Flajolet and A. M. Odlyzko] Singularity analysis of generating functions SIAM Journal of Discrete Mathematics, 3:216-240, 1990.

[R. Giegerich, B. Voss, and M. Rehmsmeier]Abstract shapes of RNA Nucleic Acids Res., 32(16):4843-4851, 2004.

[A.M. Odlyzko] Asymptotic enumeration methodsIn Handbook of Combinatorics, pages 1063-1230. Elsevier Science B. V. and MIT Press, Amsterdam and Cambridge, Volume II. 1995.

[A. Meir and J.W. Moon] On an asymptotic method in enumeration Journal of Combinatorial Theory, 51:77-89. Series A , 1989

[P.R. Stein and M.S. Waterman] On some new sequences generalizing the Catalan and Motzkin numbers Discrete Math., 26 261-272 , 1978

References

yann ponty andy lorenz peter clote biology department boston college asymptotics of rna shapes

Documents

aspects of secondary

terminal loop of length

possible secondary structures

num of secondary structures

number of secondary

sequence of length

minimum terminal loop

negative terminal loop