yann ponty andy lorenz peter clote biology department boston college asymptotics of rna shapes
Post on 21-Dec-2015
220 views
TRANSCRIPT
Yann PontyAndy LorenzPeter Clote
Biology DepartmentBoston College
Asymptotics of RNA shapes
Talk outline
1.Biological motivation2.General approach used3.Results for RNA shapes
Biological motivation
Primary structurePrimary structure Secondary structureSecondary structure• By definition it is (half-)planar in nature• Commonly computationally tractable
Tertiary structureTertiary structureUltimate goal, but
difficult to predict. Sequence ofA’s, C’s, G’s and U’s
Biological motivation – RNA secondary structure
…(((…((((((((…....)))))…))).(((….)))..)))…
Picture representation (planar graph representation)
Balanced parenthesis sequence representation
Non-negative excursion(path that starts and ends at 0 but is never negative)
Terminal loop
Terminal loop
n=53
Biological motivation – RNA secondary structure
This is a pseudo-knot, not allowed in secondary structure
pseudo-knot
No crossing is allowed in secondary structure. Such crosses are called pseudoknots.
Biological motivation – RNA secondary structureIn many algorithms, the set of all possible secondary structures is the search space. The size of this search space can affect how large an RNA a given algorithm can be applied to?
Let S(n) denote the number of secondary structures on a sequence of length n. As described, S(n) are known as the Motzkin numbers, and the asymptotics of S(n) for large n are
Similarly, if we force terminal loops to be of length 1 or more, we get a different asymptotic growth (Stein, Waterman 1978).
In any case, these numbers grow fast. Can we find equivalence classes of these shapes that do not grow so fast?
S(n) »q
15+7p
58¼ n¡ 3=2
³3+
p5
2
´n» 1:104366¢2:618034n=n3=2
S(n) » 3p
32p
¼3n=n3=2 » 1:46581¢3n=n3=2
Biological motivation – RNA shapes – π shapes
bulge
Terminal loop
Multi-loop
Internal loop
helix
Terminal loops
Multi-loop
Helix regions
π-shapes try to capture basic shape of secondary structure. Bulges and internal loops are ignored. The unpaired bases in multi-loops and terminal loops are ignored. Helix regions are collapsed into 1 bracket. The π-shapes for both of the above secondary structures is the same, and is
[ [ ] [ ] ]
Unpaired base
Paired base
Biological motivation – RNA shapes – π shapes
…(((…((((((((…….)))))…))).(((.(((…)))..))).)))…
[ [ ] [ ] ]
[ [ ] [ ] ]
[
[
]
]
[
][
[[
]]
]
Biological motivation – RNA shapes – π’ shapes
π’-shapes try to capture more aspects of secondary structure. Bulges and internal loops are not ignored. Any group of unpaired bases is reduced to one dot. Helix regions are collapsed into 1 bracket. The π’-shapes for the above secondary structures are
.[.[[.].].[.[.].].]. and .[.[.][.].].
Multi-loop
Helix regions
Unpaired base
Paired basebulge
Internal loop
Biological motivation – RNA shapes – π’ shapes
…(((…((((((((…….)))))…))).(((.(((…)))..))).)))…
[ ]
.[.[[.].].[.[.].].].
. .[[ [[
. ....]] ] ]
...
Talk outline
1.Biological motivation2.General approach used3.Results for RNA shapes
General approach – overview
Structure
Asymptotics
Generating Function
Grammar
Structure
Asymptotics
Functional relation forgenerating function
Recursion relations
Standard method Better method
(comparison 1)
(comparison 2)
General approach Example to be worked through
For our example, we will use secondary structures with minimum terminal loop length of 1.
…(((…((((((((…....)))))…))).(((.)))..)))… OK
..(((..().((..((..))))
Terminal loop of length 7 …length 1
length 0 length 2
not OK
We know the asymptotics of this to be
S(n) »q
15+7p
58¼ n¡ 3=2
³3+
p5
2
´n» 1:104366¢2:618034n=n3=2
Standard method – recursion relations
2 cases
Let where is num of secondary structures on sequence of length n.
nk
Sn-1Sk-1
S(z) =P 1
n=1 Snzn Sn
Sn-k-1
not base paired
From this we see that
Now, being careful with initial conditions, and with a bit of algebraicmanipulation we eventually can get to the relation
Sn = Sn¡ 1 +P n¡ 2
k=1 Sk¡ 1 ¢Sn¡ k¡ 1
S = z + Sz + Sz2 + S2z2
Standard method – recursion relationsFirst note
S2 =
Ã1X
n=1
snzn
! 2
=1X
n=1
Ãn¡ 1X
k=1
sksn¡ k
!
zn : (1)
By induction we ¯rst get Sn = Sn¡ 1 +P n¡ 2
k=1 Sk¡ 1 ¢Sn¡ k¡ 1. Replacing n byn+2, wehaveSn+2 = Sn+1+
P nk=1 Sk¡ 1¢Sn¡ (k¡ 1). Substituting r for k¡ 1, we
have Sn+2 = Sn+1 +P n
r=0 Sr ¢Sn¡ r . Since S0 = 1, we haveP n
r=0 Sr ¢Sn¡ r =Sn +
P n¡ 1r=0 Sr ¢Sn¡ r , so
Sn+2 ¡ Sn+1 ¡ Sn =n¡ 1X
r =0
Sr ¢Sn¡ r :
Now
S2 =
Ã1X
n=1
Snzn
! 2
=1X
n=1
Ãn¡ 1X
k=1
SkSn¡ k
!
zn
so
S2 =1X
n=1
(Sn+2 ¡ Sn+1 ¡ Sn)zn =1X
n=1
Sn+2zn ¡1X
n=1
Sn+1zn ¡1X
n=1
Snzn :
Note thatS ¡ S1z ¡ S2z2
z2 =1X
n=1
Sn+2zn
andS ¡ S1z
z=
1X
n=1
Sn+1zn :
Thus
S2 =S ¡ z ¡ z2
z2 ¡S ¡ z
z¡ S
Multiply by z2 to get
z2S2 = S ¡ z ¡ z2 ¡ zS + z2 ¡ Sz2 (1)
soz2S2 ¡ S(1¡ z ¡ z2) + z = 0
and thusS = z2S2 + Sz2 + Sz + z
Standard method – recursion relations
Better method – grammarGrammars are a way of generating a language. We will restrict to the simple case of context-free grammars. The context-free grammar for secondary structures (with minimal terminal loop length 1) is given by
S ! ² jS ² j ( S ) jS ( S )Here our terminal symbols (letters in the language this grammar describes) are (, ) and ●. The language generated is exactly secondary structures (with minimum terminal loop length 1).
To generate a word, we make a substitution on the right for a symbol on the left. So, for an example for the above language, we could generate the word (always substituting the left-most non-terminal)
! ² ² ( ² ( S ) ² ) ! ² ² ( ² ( ² ) ² )S ! S ( S ) ! S ² ( S ) ! ² ² ( S ) ! ² ² ( S ² ) ! ² ² ( S ( S ) ² )
In addition, because this grammar is unambiguous, there is no different path to this same word.
Better method – grammar
S
S ! ² jS ² j ( S ) jS ( S )
S ( S )
S ●
●
S●
S ( S )
●●
Here is the parse tree for the same word.
Reading the leaves of the tree gives the same word, .This is only tree giving rise to this word, because the language is unambiguous (for this grammar can be shown by induction).
² ² ( ² ( ² ) ² )
Better method – grammar
Type of nonterminal Equation for the l.g.f.S ! T j U S(z) = T(z) + U(z)S ! T U S(z) = T(z)U(z)S ! t S(z) = zS ! " S(z) = 1
Given an unambiguous context-free grammar, we can find the corresponding generating function relations with the above properties. This is very fast!
S ! ² jS ² j ( S ) jS ( S )S = z + zS + z2S + z2S2
Significantly faster than other method at getting here!This method for getting the relations on the generating equation is sometimes called the DSV method.
General approach – overview
Structure
Asymptotics
Generating Function
Grammar
Structure
Asymptotics
Functional relation forgenerating function
Recursion relations
Standard method Better method
(comparison 1)
(comparison 2)
Standard method – getting asymptotics
Suppose that f (z) =P 1
n=1 f nzn is analytic at z = 0, that f n ¸ 0 for all n,and that f (z) = G(z;f (z)), whereG(z;w) =
Pm;n¸ 0 gm;nzmwn . Suppose that
there exist real numbers ±;r;s > 0 such that
² G(z;w) is analytic in jzj < r + ±and jwj < s + ±.
² G(r;s) = s, Gw(r;s) = 1,
² Gz(r;s) 6= 0 and Gw;w(r;s) 6= 0.
Suppose that gm;n is real and non-negative for all m;n, that g0;0 = 0, g0;1 6=1,and gm;n > 0 for some m and some n ¸ 2. Assume further that there existh > j > i ¸ 1 such that f hf i f j 6= 0 while the greatest common divisor of j ¡ iand h ¡ i is 1. Then f (z) converges at z = r, f (r) = s, and
f n = [zn]f (z) »
srGz(r;s)
2¼Gw;w(r;s)r¡ nn¡ 3=2:
The following theorem due to Meir-Moon and modified by Odlyzko can be used to get the asymptotics.
For us, we identify S with w and get G(z;w) = z + zw+ z2w+ z2w2
Standard method – getting asymptoticsWhat is good and bad about the Bender-Meir-Moon theorem?
The good•If all of the conditions are satisfied, one merely has to plug in the answer.•This approach, unlike the one we are going to describe, does not need an explicit formula for the generating function.
The bad•If all of the conditions are not satisfied, (as happened to us), you cannot use the theorem. There are many conditions that are very constraining.• This approach will only tackle one (common) kind of singularity in the generating function. The approach we show next is much more general in this respect.
The ugly…
Better method – getting asymptotics
We now describe a more general approached as described by Flajolet and Odlyzko (1990).
Start with the relation for the generation function given by the grammar.
S = z + zS + z2S + z2S2
Solve for generating function.
Choose the solution that is analytic at 0 (a necessary condition for generating functions).
S§ = 1¡ z¡ z2§p
1¡ 2z¡ z2¡ 2z3+z4
2z2
S = 1¡ z¡ z2¡p
1¡ 2z¡ z2¡ 2z3+z4
2z2
This function will be analytic except possibly where the denominator is zero (in this case it is analytic at z=0), and where the square root is zero (since is not analytic at 0)z1=2
Better method – getting asymptotics
S = 1¡ z¡ z2¡p
1¡ 2z¡ z2¡ 2z3+z4
2z2
The above equation is non-analytic at the roots of the polynomial
1¡ 2z ¡ z2 ¡ 2z3 + z4
These are found to be roots of unity (modulus 1) and
z = 3+p
52 ;z = 3¡
p5
2The dominant singularity, ρ, is the one with with smallest modulus. Thus
½= 3¡p
52
This means the series S =P 1
n=0 Snzn converges for jzj < ½and diverges for . We immediately get that the terms Sn grow exponentially at a rate of
jzj > ½
1=½= 23¡
p5
= 3+p
52
Thus we know Sn ¼(3+p
52 )n
Better method – finer asymptoticsNow we use the theorem by Flajolet and Odlyzko.
Assume that f(z) is analytic in 4 n1, and that as z ! 1 in 4 ,
f (z) » K (1¡ z)®
Then, as n ! 1 , if ®=2 0;1;2;:::,
f n »K
¡ (¡ ®)n¡ ®¡ 1:
If the singularities are isolated, and if there is a unique dominant singularity, the rescaled function will always be analytic in the required region, \1.
The theorem basically states that the growth of terms is determined completely by the singularity. The rescaling of the singularity to 1 merely gets rid of the exponential portion in the answer, which we already know.
1
i
External singularities
Dominant singularity
φε
The region,
4
4
Better method – finer asymptotics
S = 1¡ z¡ z2¡p
1¡ 2z¡ z2¡ 2z3+z4
2z2
= 1¡ z¡ z2
2z2 ¡p
1¡ 2z¡ z2¡ 2z3+z4
2z2
Slower growing as it does not contain singularity
Only part that matters for asymptotics
S0= ¡p
1¡ 2z¡ z2¡ 2z3+z4
2z2
S0= ¡ (P2(z)(1¡ z=½))1=2
2z2where P2(z) is determined by dividing out the dominant singularity.
= ¡ P2(z)1=2
2z2 (1¡ z=½)1=2
This is portion in front of dominant singularity
In theorem, α=1/2 from here
K = ¡ P2(½)1=2
2½2Evaluate that portion at dominant singularity ρ.
Better method – finer asymptotics
K = ¡ P2(½)1=2
2½2
So from theorem we get
®= 1=2and
(from rescaling)
1=½= 3+p
52
Sn » K¡ (¡ ®) n
¡ ®¡ 1(1½)n
Plugging in gives
S(n) »q
15+7p
58¼ n¡ 3=2
³3+
p5
2
´n» 1:104366¢2:618034n=n3=2
as desired.
Method
Combinatorial class Language
Grammar Generating function
Asymptotics
?
DSV
Singularityanalysis
CodesBijections
ShapesShapes Shapes »» Trees Trees
Well-parenthesized languageWell-parenthesized language (Dyck words)(Dyck words)
[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [
Grammar:Grammar: S [ S ] S |
Shapes Shapes »» Trees Trees Well-parenthesized languageWell-parenthesized language
(Dyck words)(Dyck words)
[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [
Shapes
But nested motifs [ [ ] ] must be prohibited !
Otherwise many different shapes may correspond to a single structure !!!
[ ]
[ [ ] ]
[ [ [ ] ] ]
…
ShapesShapes Shapes ,, Dyck words w/o Dyck words w/o [ [ ] ][ [ ] ] motifs motifs
[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [
Grammar:Grammar: S [ T ] S | [ T ]
T [ T ] S |
TST
T S ST S S
Generating Generating function:function:
T T TT T
T
Asymptotics:Asymptotics:
S
ShapesGoal :Goal : Number of shapes (S) compatible with a given RNA sequence S.
First model:First model: Every base can pair with every other base
) (S) is the number of -Shapes having length n
Problem:Problem: We only know the number of -Shapes having length = n
) Build language whose words are -shapes, prefixed by a sequence of a new dummy symbol ■
S [ T ] S | [ T ]
T [ T ] S |
R ■ R | S
By virtually removing the dummy characters in words of size n generated from R, we get all -Shapes of size n.
ShapesGoal :Goal : Number of shapes (S) compatible with a given RNA sequence S.
Second model:Second model: Terminal loops requires at least bases (sterical constraint)
Equivalent to counting the number of Pi-shapes of size n-t, where t is the number of terminal loops in
Terminal loops
ShapesGoal :Goal : Number of shapes (S) compatible with a given RNA sequence S.
Second model: Terminal loops requires at least bases (sterical constraint)
) Introduce a new dummy symbol and get rid of it later !
Derivation T actually creates a terminal loop, so make it cost bases in the matching sequence.
[ ][ [ [ [ ] [ ] ] [ ] ] ] ] ][ [T
STT S S
T S ST T T
T TT
Equivalent to counting the number of Pi-shapes of size n-t, where t is the number of terminal loops in
S [ T ] S | [ T ]
T [ T ] S |
R ■ R | S
Shapes
S [ T ] S | [ T ]
T [ T ] S |
R ■ R | SGrammar:Grammar:
Generating Generating function:function:
Asymptotics:Asymptotics:
Second model:Second model: Terminal loops requires at least bases (sterical constraint)
’-ShapesModel:Model: Everything can basepair + Terminal loops at least base-long
Introduce unpaired bases between helices while avoiding and [ [ ] ] patterns
S U [ T ] S | U
T U [ T ] U [ T ] S | [ T ] | [ T ] | [ T ] | 3
R ■ R | S
U |
Generating function:Generating function:
Grammar:Grammar:
Asymptotics:Asymptotics:
A bijection between -Shapes and Motzkin words
Def Motzkin words:Def Motzkin words: Words in {(, ), }* whose restriction to {(, )} are well-parenthesized
Naturally encodes:Naturally encodes:
• Positive paths from (0,0) to (n,0) using steps (+1,+1), (+1,-1) and (+1,0)• Unary/binary trees• Non-intersecting drawings of any number of chords among n ordered points on a circle • RNA secondary structures (( ) pattern forbidden)
Example:Example: ( ( ) ( ( ) ) ( ) ( ( ( ) ) ( ) ) ( ) )
A bijection between -Shapes and Motzkin words
Generating function for Motzkin words:Generating function for Motzkin words:
Generating function for Generating function for -Shapes:-Shapes:
There exists a bijection between Motzkin words of size n and -Shapes of size 2n+2 !
-Shapes , Motzkin words
Illustration on trees:Illustration on trees:
[ [ [ ] [ ] ] [ ] [ ] ] [ ]
shapelength 2n+2
Dyck’sbijection
Treepruning
(
) (
)
)( )(
Motzkin wordsn edges
Combinatorialbinary tree2n+2 edges
Binary treen edges
-Shapes , Motzkin words Motzkin words , -Shapes
Is there a way to transpose the Is there a way to transpose the constraint for constraint for -Shapes into -Shapes into a a ’ constraint over Motzkin words ???’ constraint over Motzkin words ???
+ Terminal loops appear in both structures
Short answer:Short answer: No ! Why ?
(If so, -Shapes , ’ RNA secondary structures)
Because !
-Shapes , Motzkin words
(Not so) Long answer :(Not so) Long answer : Because there is a bigger (though comparable) number of terminal loops in -Shapes of size 2n+2 than in Motzkin words of size n.
) Proof requires multivariate generating functions analysis.
) A constraint acting on terminal loops has more impact on -shapes than on Motzkin
Multivariate analysisIdea:Idea: Introduce a new character t, whose length is 0, which marks each occurrence of a terminal loop.
) Transpose into a system of functional equations:
S ( T ) S | S |
T ( T ) S | S | t
Let be the number of Motzkin words of size n having k terminal loops
And the number terminal loops in a Motzkin word
Multivariate analysisFrom , the average number of terminal loops is just a derivative away:
(Also holds for the average number of occurrences of any given character !!!)
Take the ratio of these quantities and do the same analysis for -shapes yields the claimed results.
Conclusion and perspectives
DSV method + Singularity analysis) Easy asymptotics for discrete structures) Perfect for computational biology
Derived asymptotical growth for the numbers of -shapes under different models
Yet another bijection between Motzkin words and a subclass of Dyck words
How to add probabilities for base-pairing to these models ? Why do the equations simplifies and yield beautiful results
when is even ?
[E.A. Bender] Asymptotic methods in enumeration SIAM Rev., 16(4):485-515, 1974.
[P. Flajolet] Singular CombinatoricsIn Proceedings of the International Congress of Mathematicians, Vol 3, World Scientific, pp. 561-57, 2002
[P. Flajolet and A. M. Odlyzko] Singularity analysis of generating functions SIAM Journal of Discrete Mathematics, 3:216-240, 1990.
[R. Giegerich, B. Voss, and M. Rehmsmeier]Abstract shapes of RNA Nucleic Acids Res., 32(16):4843-4851, 2004.
[A.M. Odlyzko] Asymptotic enumeration methodsIn Handbook of Combinatorics, pages 1063-1230. Elsevier Science B. V. and MIT Press, Amsterdam and Cambridge, Volume II. 1995.
[A. Meir and J.W. Moon] On an asymptotic method in enumeration Journal of Combinatorial Theory, 51:77-89. Series A , 1989
[P.R. Stein and M.S. Waterman] On some new sequences generalizing the Catalan and Motzkin numbers Discrete Math., 26 261-272 , 1978
References