rapid ab initio rna folding including pseudoknots via graph tree decomposition jizhen zhao, liming...
DESCRIPTION
Increased number of ncRNAs ncRNA function other than coding proteins, e.g., structural, catalytic, and regulatory factors ncRNA genes do not have strong statistical features, such as ORFs, or polyadenylated, except Transcribed ncRNA molecules can fold into stable (and unique) secondary or tertiary structuresTRANSCRIPT
Rapid ab initio RNA Folding Including Pseudoknots via Graph Tree Decomposition Jizhen Zhao, Liming Cai Russell Malmberg
Computer Science Plant Biology University of Georgia
Why another RNA folding algorithm?
• The need for RNA analysis tools has increased because of the number of recently found functional RNAs (i.e., ncRNAs).
• RNA folding algorithms are not completely satisfactory in spite of having been intensively studied for more than 25 years.
Increased number of ncRNAs
• ncRNA function other than coding proteins, e.g., structural, catalytic, and regulatory factors
• ncRNA genes do not have strong statistical features, such as ORFs, or polyadenylated, except
• Transcribed ncRNA molecules can fold into stable (and unique) secondary or tertiary structures
Increased number of ncRNAs• rRNAs and tRNAs• RNA maturation: snRNA in recognizing splicing
sites• RNA modification: snoRNA converting uridine to
pseudo-uridine• Regulation of gene expression and translation:
e.g., miRNAs• DNA replication: e.g., telomerase RNAs - template
for addition of telomeric repeats• Etc.
In introns, intergenic regions, or 5’ and 3’ UTRs,
Increased number of ncRNAs(Bompfunewerer, et al, 2005)
Class Size Function Phylogeneticdistribution
tRNA 70-80 Translation ubiquitous
rRNA16S/18S28S+5.8S/23S5S
1.5K3K130
translation ubiquitous
RNase PMRP
220-440250-350
tRNA -maturation ubiquitouseukarya
snoRNA
telomerase
130
400-550
pseudouridinylationaddition of repeats
snRNAU1 ~ U6
100-600130-140
SpliceosomemRNA maturation
EukaryaEukarya, archaea
U7
7SK
~65
~300
Histone mRNAMaturationTranslationalregulation
Eukayotes
vertebrata
tmRNA 300-400 Tags proteinFor proteolysis
bacteria
miRNA ~22 Post-tran. Reg. Multi-cellular orgs
Long history of RNA foldings• First simple RNA folding algorithm (Nussinov
1978) • Thermodynamic based (Zuker&Stiegler 1981)• Zuker’s (1989) • mFOLD 3.2 • RNAfold (a part of Vienna Package 1.6.1)
• Not all that accurate on single sequence• Inherent computational complex from DP• Unable to predict pseudoknots
Background
• Base pairings allow RNA to fold
Watson-Crick base pairs: A-U, C-GWobble pair G-U
non-canonical pairs are also possible
N N
N
O
H
H
5’-u-u-c-c-g-a-a-g-c-u-c-a-a-c-g-g-g-a-a-a-u-g-a-g-c-u-3’
P a
P c
5’ 3’
P u a
P
g
P
CYTOSINEN
N
N
O
H
H
H
N
N
GUANINE
URACIL ADENINE
N N
O
O
H
N
N
N
N
N
HH
Secondary structure is important to tertiary structure
Hairpin loopJunction (Multiloop)
Bulge Loop
Single-Stranded
Interior Loop
Stem
Image– Wuchty
Pseudoknot
aacguu ccccucu ggggcagc cc
aga ugccc
stem (double helix): stacked base pairs
loop: strand of unpaired bases
accacc ggu
aacguu ccccucu acc ggggcagc ggucc
aga ugcacccc
Pseudoknots: crossing patterns of stems
terminates translation errors
Bacterial tmRNA consensus structure(Felden et al. 2001. NAR 29)
Pseudoknots in TMV 3’ UTR
Promotes efficient translationBinds EF1A, cooperates with 5’UTR
(Leathers et al. 1993 MCB 13Zeenko et al. 2002 JVI 76)
Previous work (Nussinov’s)
• maximizing the number of base pairs (Nussinov et al, 1978)
simple case(i, j) = 1
Previous work (Zuker’s)
• Thermodynamic energy based method (Zuker and Stiegler 1981)
• Energy minimization algorithm: find the secondary structure to minimize the free energy (G)
G calculated as sum of individual contributions of:– loops– base pairs– secondary structure elements
Previous work (Zuker’s)
• Free-energy values (kcal/mole at 37oC )
• Energies of stems calculated as stacking contributions between neighboring base pairs
Previous work (Zuker’s)
MFOLD: computing loop dependent energies
Previous work (Zuker’s)
Difficult issues
• Energy associated with any position is only influenced by local sequence and structure
• mFOLD does not predict pseudoknots• PKnots: (Eddy and Rivas 1999) predict
restricted cases of pseudoknots, O(n6) time and O(n4) space
• Min energy-based pseudoknot prediction is NP-hard (Lyngso and Pederson 2000)
Pseudoknots drastically increase the complexity
Heuristic RNA folding algoithmsILM (Ruan et al 2004)HotKnots (Ren et al 2005)
• Fast, sometime slow• unlimited class of pseudoknots• do not guarantee the optimality of the
predicted structure
This work
• Graph-theoretic based, aviod nucleotide level DP
• Unlimited pseudoknot structures
• Optimal solutions
• Fast
• Comparable performance in accuracy
This work (summary)
1. Model: similar to ILM, without loop energy
2. Approach: Find all stable stems, construct a stem graphReduce folding to independent set problem
3. Techniques:tree-decompose the stem graphDP to obtain optimal solution
This work (approach)
This work (approach)
A set of non-overlapping stems corresponds to an independent set of the stem graph.
The weight of each vertex is related to the energy of the corresponding stem.
This work (techniques)
A tree decomposition of the stem graph
Tree width t = 4
This work (techniques)
A tree decomposition of the stem graph
Tree width t = 4
Find an approximate tree decomposition of width t
MWIS can be found in time O(2tN), N=O(n2)by DP over the tree
Time can be improved to O(et/e) = O(1.44t)
This work (experimental results) Data sets: 50 tRNAs (length 71 - 79) 50 pseudoknots (23 - 113) 11 large RNAs (210 - 412
Compared with PKnots (DP, optimal, restricted pks)ILM (heuristic, unrestricted)HotKnots (heuristic, unrestricted
Measuresensitivity = TP/Real totalspecificity = TP/(TP+FP)Time
This work (experimental results)
Conclusion
• A new graph-theoretic algorithm to RNA folding
• Performance comparable with the best in both accuracy and speed
• With much room to be improved
• Applications in multiple structure alignment as well as in folding single sequence
• A part of NIH project for ncRNA gene search