rna secondary structures · hierarchical folding: secondary structure forms rst then helices...

68
RNA Secondary Structures Ivo Hofacker Institute for Theoretical Chemistry, University of Vienna http://www.tbi.univie.ac.at/˜ivo/ AlgoSB 2019 Marseille, Janvier 2019

Upload: others

Post on 18-Jul-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

RNA Secondary Structures

Ivo Hofacker

Institute for Theoretical Chemistry, University of Vienna

http://www.tbi.univie.ac.at/˜ivo/

AlgoSB 2019Marseille, Janvier 2019

Page 2: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

What is RNA Bioinformatics?

Algorithms on sequences don’t care whether it’s RNA or protein,so what’s special?

• In contrast to plain sequence analysis we’re interested instructure

• When taking about structural bioinformatics most peoplethink protein structure – RNA is different.

• Structure can be described at many levelsMost useful for RNA work is usually the secondary structure

Page 3: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Why look at Structure?

• Basic paradigm of structural biology:Sequence → Structure → Function

• Understanding structure is a first step toward function

• Function is what we’re ultimately interested inSequences is what we have in plenty

• Structure should be conserved more strongly than sequence

Structure based methods can succeed where sequence analysis fails

f

p

. . .

Physical Properties

Free Energy

Kinetic Constants

FUNCTION

SEQUENCE STRUCTURE Properties, FUNCTION

Page 4: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Structural Bioinformatics of RNA vs. Proteins

• Amino acid sequences are often better for homology searchLarger alphabet and redundancy of the genetic code makes proteinalignment easier than nucleic acid alignment.

• Structural work on proteins often focuses on tertiary structures

• Many more protein than RNA tertiary structures are known

• Protein secondary structures are boring compared to RNAthey only tell which parts of a molecule are in helices/beta-sheetsRNA secondary structure contains interactions (base pairs)

• RNA secondary structures are computationally ease to handle

• Many biological functions can be understood purely on the 2D level

Page 5: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Many non-coding RNAs are Structural RNAs

(a) Group I intron(b) Hammerhead ribozyme(c) HDV ribozyme(d) Yeast tRNAphe

(e) L1 domain of 23S rRNA

Hermann & Patel, JMB 294, 1999

All the “classical” ncRNAs depend on well-defined andevolutionary conserved structure

Page 6: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Example 1: snoRNAs

Small nucleolar RNAs are trans-acting ncRNAs

• SnoRNAs guide chemical modifications of rRNAs, snRNAs,and some mRNAs.

• Characteristic secondary structure and sequence motifs

• Two tasks: How to recognize a snoRNA from its sequence?How to predict the targets of a given snoRNA?

M

M

ACA

NN

3’

ΨΨ

5’ Box C’

3’

5’

3’

5’

rRNA

rRNA

5’3’

14−16nt14−16nt

Box H

rRNA

5nt

5nt

Box CBox D

3−10nt3’

5’

Box D’

Page 7: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Example 2: The Flavin Mononucleotide Riboswitch

Example for a cis-acting RNA elementThe FMN-binding riboswitch in the free and bound state

• How can we find novel riboswitches?

• Can we predict how a novel riboswitch works?Up-, or down-regulation; transcriptional or translational control?

Page 8: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Basic Tasks

Mostly the same questions are the same as in sequence analysisHere we want to base them on structure

• Structure prediction (sequence → structure)(single or multiple sequences, w/o pseudo-knots, tertiary structure)

• Classification:• Recognizing new members of a known RNA class (tRNAs,

snoRNAs, . . .)• Homology searches — Given a novel RNA find all the homologs

where in Evolution did it first occur?

• De novo predictionAnnotate all functional RNAs in genomes or transcriptomes

• Many specialized problems for particular RNA classese.g. micro RNA target prediction

Page 9: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

The RNA Molecule

Double helices formed by Watson-Crick AU, UA, GC, CG, GU, UG pairs.

Page 10: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

The RNA Molecule

Double helices formed by Watson-Crick AU, UA, GC, CG, GU, UG pairs.

Page 11: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

The RNA Molecule

Double helices formed by Watson-Crick AU, UA, GC, CG, GU, UG pairs.

Page 12: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Isostericity of Watson-Crick Pairs

Page 13: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Double Helices

A-form B-form Z-form

Page 14: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

The RNA Folding Problem

• Hierarchical folding: Secondary structure forms first then helicesarrange to form tertiary structure

• Secondary structures cover most most of the folding energy

• Convenient and biologically useful description

• Computationally easy to handle

• Tertiary structure prediction needs knowledge of secondary structure

Page 15: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

RNA Secondary Structures

GCGGAUUU

AGCUC

AGDDG

G G AG A G C

GCCAGAC

UG

AAY

AUCUGGAG

GUCC U G U G

T PCG

AUC

CACAGAAUUCGC

A C C A

D-Loop

T-Loop

Acceptor StemGCGGAUU

U

A

G

C

U

CA

G

DDG

G

G

AG

A

G

C

G

C

C

A

G

AC

U

GA

A

Y

AU

C

U

G

G

A G

G

U

C

CUGUGTP

C

G

AU C

C A C A G A A U U C G C A C C A

A secondary structure is a list of base pairs (i , j) on a sequence x , with

• Any nucleotide (sequence position) can form at most one pair

• No pseudo-knots: No pairs (i , j) and (k, l) with i < k < j < l

• If (i , j) is a pair then xixj ∈ GC,CG,AU,UA,GU,UG• If (i , j) is a base pair, then j − i > 3

Page 16: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Pseudo knots

Excluding pseudo knots makes life easy, because

• It greatly simplifies the mathematical model=⇒ Simpler algorithms without pseudo knots

• Many pk-structures are sterically not feasible

• Energetics unknown, except for a few data onH-type pseudo-knots

On the other hand

• Pseudo knots can have important function

• First step toward tertiary structure

• H-type knots are tractable with extracomputational effort

C

G

G

C

C

U

G

C G

G

G

A

AG

G

C

G

C

C

A

C

A

C

A

A

CL1

S1

S2

L2

L3

G5’

3’

Page 17: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Representation of Secondary Structures

ACACG

AC

CU

CA

UA

UA

A

U

C

U

U

G

G

G

A

A

U

AUG

GC

CC

AU

AA G U U U C U A C C C

GG

CA

AC

CG

UA

A

A

U

U

G

C

C

G

G

A

C

UA

UG

CA

GG

GA

AGUGA A

C

A

CG

AC

C

U

C

A

U

AUA

AUCUU

GG

GAA

U

AU G

G

CC

CAUAAGUUUCUAC

CC

GG

CA

A

CC G

U

AAA

UU

GC

CG

G

A

CU

A

U

GCA

G

GGA

AG

U

G

AACACGACCUCAUAUAAUCUUGGGAAUAUGGCCCAUAAGUUUCUACCCGGCAACCGUAAAUUGCCGGACUAUGCAGGGAAGUGA

0 10

20

30

40

50

60

70

80

ACACGACCUCAUAUAAUCUUGGGAAUAUGGCCCAUAAGUUUCUACCCGGCAACCGUAAAUUGCCGGACUAUGCAGGGAAGUGA

.(((..(((((((......((((.......))))...........(((((((.......)))))))..)))).)))...))).

Page 18: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

RNA Secondary Structure as Trees

A

AGCGGA

AC

GAA A

CG

U U G CU U

UUGCG

CCCU

R

S

B

S

M

S

H H

S

Eh

m

h h

.((.((.(((...)))(((....)))))))

AAGCGGAACGAAACGUUGCUUUUGCGCCCU

p

p

p

p

u

u

C

G

G

A

U

p p

U

C

C

C

p p

p p

C

G

A

A

U

C

G

U

C

G

G C

G

u u u u uu u

A A A U U U U

u

A

Many RNA algorithms are generalizations of well knownstring algorithms to trees

Page 19: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Representing Ensembles of Structures

RF00167

A C A C G A C C U C A U A U A A U C U U G G G A A U A U G G C C C A U A A G U U U C U A C C C G G C A A C C G U A A A U U G C C G G A C U A U G C A G G G A A G U G A

A C A C G A C C U C A U A U A A U C U U G G G A A U A U G G C C C A U A A G U U U C U A C C C G G C A A C C G U A A A U U G C C G G A C U A U G C A G G G A A G U G A

AC

AC

GA

CC

UC

AU

AU

AA

UC

UU

GG

GA

AU

AU

GG

CC

CA

UA

AG

UU

UC

UA

CC

CG

GC

AA

CC

GU

AA

AU

UG

CC

GG

AC

UA

UG

CA

GG

GA

AG

UG

A

AC

AC

GA

CC

UC

AU

AU

AA

UC

UU

GG

GA

AU

AU

GG

CC

CA

UA

AG

UU

UC

UA

CC

CG

GC

AA

CC

GU

AA

AU

UG

CC

GG

AC

UA

UG

CA

GG

GA

AG

UG

A

Ensembles of structures (ther-modynamic equilibrium) arebest represented by base pairprobabilities.A pair (i , j) with probability p isrepresented by a square in rowi and column j with area p.

A CA

C

GA C

CU

CA

UA

UAAUC

UU

G

G

GA

A

UA U

G

GC

C

C

AUAAGUUUCUACC

CG

GC

AA

C CG

UAA

A

UU

GC

CG

G

ACU

AU

GCA

GG

GAAG

UG

A

Page 20: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Loop Decomposition

Secondary structures can be uniquely decomposed into loops.Loops are the faces of the secondary structure graph.

5

5

5

closing base pair

5

3

3

5

closing base pair

closing base pair

stacking pair

interior loop

G A

UC

G

G

G

C

CA

A

3

G

C A

U G

C

A

U

C

3

closing base pairinterior base pair

interior base pair

hairpin loop

bulge

UA

interior base pair

interior base pairs

G

C

A

U

C G

A A

5

3

closing base pair

multi loop

exterior loop

U G CA CU AA

3

Page 21: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Nearest Neighbor Energies

Free energy of a structure approximated as the sum of loopenergies

• Good approximation for mostoligo-nucleotides

• Loop energies depend on loop type and size,with some sequence dependence

• Most relevant parameters measuredexperimentally, some still guesswork

• Free energies are dependent on temperatureand ionic conditions

• Training parameters is becoming analternative to experiment

G

CUU

C

C

G

A

A U

UC

G

G

U

G

C−3.4

−3.3

+3.5

+1.2

−2.4

Total −4.4 kcal/mol

Page 22: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Stacked Pairs

• Major source of stabilizing energy

• all 21 combinations measured, accuracy at least0.1 kcal/mol

• include the hydrogen bonding energy of pair formation

• energies of tandem G·U pairs depend on context, i.e.violate the nearest-neighbor model

G

A

5’

5’

3’

3’

U

C

−2.4kcal/mol

(i + 1, j − 1)CG GC GU UG AU UA

(i , j) CG -2.4 -3.3 -2.1 -1.4 -2.1 -2.1GC -3.3 -3.4 -2.5 -1.5 -2.2 -2.4GU -2.1 -2.5 1.3 -0.5 -1.4 -1.3UG -1.4 -1.5 -0.5 0.3 -0.6 -1.0AU -2.1 -2.2 -1.4 -0.6 -1.1 -0.9UA -2.1 -2.4 -1.3 -1.0 -0.9 -1.3

For comparison: Thermal energy RT ≈ 0.6kcal/mol at 37C.

Page 23: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Determining Nearest Neighbor Parameters

UV absorption can distinguish between double and single strandednucleic Acids — Hyperchromicity

Page 24: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Determining Nearest Neighbor Parameters

UV absorption (at 260nm) can be used to follow the unfoldingtransition as a function of temperature.

Page 25: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Analyzing UV Melting Curves

Assume a 2-state model

funfolded(T ) =A(T )− A(Tmin)

A(Tmax)− A(Tmin)

Keq =funfoldedffolded

= e−∆G/RT

Page 26: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Van’t Hoff Analysis

lnKeq = −∆G

RT= −∆H

RT+

∆S

R

Page 27: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Algorithms for Secondary Structure Prediction

RNA structures can be predicted by Dynamic programmingalgorithms in many variants.

• Minimum free energy structure (Zuker & Stiegler ’81)

• Optimal and certain suboptimal structures (Zuker ’89)

• All structures within an energy range (Wuchty et al. ’99)

• Partition function and base pair probabilities (McCaskill ’90 )

• Stochastic suboptimals (Ding & Lawrence ’01)

• Maximum expected accuracy structures (Do et al ’06)

• Consensus structure prediction from alignment(Knudsen & Hein ’99, Hofacker et al. ’02)

• Minimum free energy with pseudo-knots (Rivas & Eddy ’99)

• Extended secondary structures with non-canonical pairs(Parisien & Major ’08, Honer et al ’11)

Page 28: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

The RNA Conformation Space

The number of secondary structures for a sequence x = x1 . . . xncan be enumerated recursively:

i jj i i+1 j i i+1 k−1 k k+1|=

Sij = Si+1,j +

j∑k=i+m

Si+1,k−1Sk+1,j pair(xk , xj)

pair(xk , xj) = 1 if xkxj is a canonical pair(GC, CG, AU, UA, GU, UG)otherwise pair(xk , xj) = 0.For typical sequences the number of conformations grows as

S1n ∼ n−32 1.85n

Page 29: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Solving the Folding Problem

Toy model for RNA folding: assign energies to base pairs ε(x , y).Easily solved by Dynamic Programming, i.e.:Recursive computation with tabulation intermediate results.

i jj i i+1 j i i+1 k−1 k k+1|=

Eij = mini<k≤j

Ei+1,j ;

(Ei+1,k−1 + Ek+1,j + ε(xi , xk)

)• E1n is the best possible energy for our sequence x .

• Backtracing through the E table yields the corresponding structure.

• The Algorithm requires O(n2) memory and O(n3) CPU time.

In practice this toy model is not good enough!For serious predictions, we need to use loop dependent energies.

Page 30: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0

0 0 -3 -3 -3 -6 -9

A

0 0 0 0

0 -3 -3 -3 -6 -9

G

0 0 0 0

0 0 -3 -6 -6

C

0 0 0 0

0 -3 -3 -3

A

0 0 0 0

-3 -3 -3

C

0 0 0 0

0 0

A

0 0 0 0

0

C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 31: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0

0 -3 -3 -3 -6 -9

A

0 0 0 0 0

-3 -3 -3 -6 -9

G

0 0 0 0 0

0 -3 -6 -6

C

0 0 0 0 0

-3 -3 -3

A

0 0 0 0 -3

-3 -3

C

0 0 0 0 0

0

A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 32: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0

-3 -3 -3 -6 -9

A

0 0 0 0 0 -3

-3 -3 -6 -9

G

0 0 0 0 0 0

-3 -6 -6

C

0 0 0 0 0 -3

-3 -3

A

0 0 0 0 -3 -3

-3

C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 33: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3

-3 -3 -6 -9

A

0 0 0 0 0 -3 -3

-3 -6 -9

G

0 0 0 0 0 0 -3

-6 -6

C

0 0 0 0 0 -3 -3

-3

A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 34: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3

-3 -6 -9

A

0 0 0 0 0 -3 -3 -3

-6 -9

G

0 0 0 0 0 0 -3 -6

-6

C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 35: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3

-6 -9

A

0 0 0 0 0 -3 -3 -3 -6

-9

G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 36: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6

-9

A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 37: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6 -9 A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 38: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6 -9 A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 39: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

.

( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6 -9 A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 40: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. (

( . ( . . . )

) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6 -9 A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 41: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( (

. ( . . .

) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6 -9 A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 42: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding using Nussinov’s Algorithmε(C ,G ) = ε(G ,C ) = −3 ε(A,U) = ε(U,A) = −2ε(G ,U) = ε(U,G ) = −1

Eij = Ei+1,j | Ei+1,k−1 + Ek+1,j + ε(xi , xk)

. ( ( . ( . . . ) ) )

A G C A C A C A G G C

0 0 0 0 0 0 -3 -3 -3 -6 -9 A

0 0 0 0 0 -3 -3 -3 -6 -9 G

0 0 0 0 0 0 -3 -6 -6 C

0 0 0 0 0 -3 -3 -3 A

0 0 0 0 -3 -3 -3 C

0 0 0 0 0 0 A

0 0 0 0 0 C

0 0 0 0 A

0 0 0 G

0 0 G

0 C

Page 43: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Folding with Loop based Energies

M1 M1

M1

i u u+1|

M

C

= |

=| |

FC

i j i+1 j i

hairpinC

interior

i j i i k l j

k k+1 j

=

C

F F

i j

M

=

i ij j−1|

C

i j

M

ui+1 u+1i j−1 j

j i

M

j−1 j|

C

i u u+1

j

j

j

Fij free energy of the optimal substructure on the subsequence x [i ..j ].

Cij optimal free energy on x [i ..j ], where (i , j) pair.

Mij x [i ..j ] is part of a multiloop and contains at least one pair.

M1ij same as Mij but contains exactly one component closed by (i , h).

Page 44: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Minimum Free Energy Folding and Accuracy

MFE folding predicts, the optimal, most probable, structure.

+ easy to program, calculate, and interpret

− one structure to represent the equilibrium ensemble

− no indication of reliability

Accuracy measured by comparison with (phylogenetic) modelstructures.

• about 40-70% accuracy with current parameters

• accuracy decreases with sequence length

• accurate structures can usually be found within small energyincrement of the MFE

Both inaccurate parameters and approximations of the energymodel to blame for poor predictions.Whenever possible don’t rely on a single structure!

Page 45: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Limits to prediction accuracy

A number of different factors contribute to the inaccuracy of ourpredictions:

• Beyond secondary structure:Pseudo-knots, non-canonical pairs, 3D structure

• Deviation from “standard” conditionsIon concentrations (esp. Mg2+), temperature

• Interaction with other molecules:RNA interact with proteins other RNAs and metabolites

• Folding kinetics:native structure 6= ground state structure

Page 46: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Representing Ensembles of Structures

RF00167

A C A C G A C C U C A U A U A A U C U U G G G A A U A U G G C C C A U A A G U U U C U A C C C G G C A A C C G U A A A U U G C C G G A C U A U G C A G G G A A G U G A

A C A C G A C C U C A U A U A A U C U U G G G A A U A U G G C C C A U A A G U U U C U A C C C G G C A A C C G U A A A U U G C C G G A C U A U G C A G G G A A G U G A

AC

AC

GA

CC

UC

AU

AU

AA

UC

UU

GG

GA

AU

AU

GG

CC

CA

UA

AG

UU

UC

UA

CC

CG

GC

AA

CC

GU

AA

AU

UG

CC

GG

AC

UA

UG

CA

GG

GA

AG

UG

A

AC

AC

GA

CC

UC

AU

AU

AA

UC

UU

GG

GA

AU

AU

GG

CC

CA

UA

AG

UU

UC

UA

CC

CG

GC

AA

CC

GU

AA

AU

UG

CC

GG

AC

UA

UG

CA

GG

GA

AG

UG

A

Ensembles of structures (ther-modynamic equilibrium) arebest represented by base pairprobabilities.A pair (i , j) with probability p isrepresented by a square in rowi and column j with area p.

A CA

C

GA C

CU

CA

UA

UAAUC

UU

G

G

GA

A

UA U

G

GC

C

C

AUAAGUUUCUACC

CG

GC

AA

C CG

UAA

A

UU

GC

CG

G

ACU

AU

GCA

GG

GAAG

UG

A

Page 47: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Partition Function, Botzmann probabilities and PairProbabilities

The partition function Z is the fundamental quantity of statisticalmechanics. All thermodynamic properties can be derived from it.

• Partition function Z =∑

s exp(−E(s)RT

)• Boltzmann pobability of a structure s: p(s) = 1

Z exp(−E(s)RT

)• Expected value of any quantity A: 〈A〉 =

∑s A(s)p(s)

• Free energy F = −RT lnZ

• Entropy S = − ∂F∂T

Page 48: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Partition Function, Botzmann probabilities and PairProbabilities

The partition function Z is the fundamental quantity of statisticalmechanics. All thermodynamic properties can be derived from it.

• Partition function Z =∑

s exp(−E(s)RT

)• Boltzmann pobability of a structure s: p(s) = 1

Z exp(−E(s)RT

)• Expected value of any quantity A: 〈A〉 =

∑s A(s)p(s)

• Free energy F = −RT lnZ

• Entropy S = − ∂F∂T

Page 49: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Partition Function, Botzmann probabilities and PairProbabilities

The partition function Z is the fundamental quantity of statisticalmechanics. All thermodynamic properties can be derived from it.

• Partition function Z =∑

s exp(−E(s)RT

)• Boltzmann pobability of a structure s: p(s) = 1

Z exp(−E(s)RT

)• Expected value of any quantity A: 〈A〉 =

∑s A(s)p(s)

• Free energy F = −RT lnZ

• Entropy S = − ∂F∂T

Page 50: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Constrained Parition Functions

Probability of some feature A. Compute the constraint partitionfunction

ZA =∑

s has feature A

exp

(−E (s)

RT

), p(A) =

ZA

Z

• Probability of forming the pair (i , j)

• probability of forming an aptamer structure

• probability of presenting a binding site

• . . .

Page 51: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Computing Partition Function and Pair Probabilities

For simplicity back to the Nussinov model:Computing the partition function Z =

∑Ψ exp(−E (Ψ)/RT ) is

simple:

Zij = Zi+1,j +∑

k, (i ,k) pairs

Zi+1,k−1Zk+1,j exp(−εik/RT ) .

In conjunction with the partition function Zij for structures outsidethe subsequence x [i ..j ] we can compute pair probabilities:

pij = ZijZi+1,j−1 exp(−εij/RT )/Z .

Page 52: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Pair Probabilities

Equilibrium probabilities for all possible pairs can be calculated viaMcCaskill’s partition function algorithm:• provides a rigorous description of

structures in thermodynamicequilibrium

• probability dot plots contain entropicterms not included in energy dot plots

• starting point for measures ofreliability and well-definedness

− not as easy to interpret as a small setof structures

G C G G A U U U A G C U C A G U U G G G A G A G C G C C A G A C U G A A G A U C U G G A G G U C C U G U G U U C G A U C C A C A G A A U U C G C A C C A

G C G G A U U U A G C U C A G U U G G G A G A G C G C C A G A C U G A A G A U C U G G A G G U C C U G U G U U C G A U C C A C A G A A U U C G C A C C A

AC

CA

CG

CU

UA

AG

AC

AC

CU

AG

CU

UG

UG

UC

CU

GG

AG

GU

CU

AG

AA

GU

CA

GA

CC

GC

GA

GA

GG

GU

UG

AC

UC

GA

UU

UA

GG

CG

GC

GG

AU

UU

AG

CU

CA

GU

UG

GG

AG

AG

CG

CC

AG

AC

UG

AA

GA

UC

UG

GA

GG

UC

CU

GU

GU

UC

GA

UC

CA

CA

GA

AU

UC

GC

AC

CA

Page 53: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Well-defined Regions

Pair probailities can help judge the reliability of a prediction.

Well-definedness: Are there many structural alternaives?

e.g. “ensemble diversity” (returned by RNAfold) measuresdissimilarity of structural alternativesComputed directly from pair probabilities 〈d〉 =

∑i,j pij · (1− pij)

Local reliability: Which which parts the prediction we can trust?

High probability pairs are almost always correct.Secondary structure plots can be colored by pair probability or“positional entropy” to highlight reliable and unreliable regions.positional entropy is computed from pair probabilities as S(k) = −

∑i pik ln pik

Page 54: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Structures with reliability annotation

0 20 40 60 80 100 120

position k

0

5

10

15

20

25

m(k

)

pair probabilities

mfe structurephylogenetic structure

positional entropy

5sRNA

U C A A U A G C G G C C A C A G C A G G U G U G U C A C A C C C G U U C C C A U U C C G A A C A C G G A A G U U A A G A C A C C U C A C G U G G A U G A C G G U A C U G A G G U A C G C G A G U C C U C G G G A A A U C A U C C U C G C U G C U A U U G U U

U C A A U A G C G G C C A C A G C A G G U G U G U C A C A C C C G U U C C C A U U C C G A A C A C G G A A G U U A A G A C A C C U C A C G U G G A U G A C G G U A C U G A G G U A C G C G A G U C C U C G G G A A A U C A U C C U C G C U G C U A U U G U U

UC

AA

UA

GC

GG

CC

AC

AG

CA

GG

UG

UG

UC

AC

AC

CC

GU

UC

CC

AU

UC

CG

AA

CA

CG

GA

AG

UU

AA

GA

CA

CC

UC

AC

GU

GG

AU

GA

CG

GU

AC

UG

AG

GU

AC

GC

GA

GU

CC

UC

GG

GA

AA

UC

AU

CC

UC

GC

UG

CU

AU

UG

UU

UC

AA

UA

GC

GG

CC

AC

AG

CA

GG

UG

UG

UC

AC

AC

CC

GU

UC

CC

AU

UC

CG

AA

CA

CG

GA

AG

UU

AA

GA

CA

CC

UC

AC

GU

GG

AU

GA

CG

GU

AC

UG

AG

GU

AC

GC

GA

GU

CC

UC

GG

GA

AA

UC

AU

CC

UC

GC

UG

CU

AU

UG

UU

0 2.4 U

C

A

A

U

A

G

C

G

G

C

CAC

A

GC

A GGU

GU

GU

CACA

CC

C

GU

UC

CCA

UU C

C

GA

AC

AC

G

GA

A

GU U

A

A

GA

CA

C C

UCAC

GU

G G A U G A

CG G U

AC U G A G GUA C

GC

GA

GUCCUCGG

GAA

AUCAUCC

UC

G

C

U

G

C

U

A

U

U

G

U

U

0 1U

C

A

A

U

A

G

C

G

G

CCACA

GCA

GG

UG

UGU

CACA

C

CC

GU

UC

CCA

UU C

C

GAAC

AC

GG

A A GU

U

A

A G

AC

AC

CU

C AC

GU

G G A U G A

CG G U

AC U G A G GUA C

GC

GA

GUCCUCGG

GAA

AUCAUCC

UC

G

C

U

G

C

U

A

U

U

G

U

U

Page 55: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Alternatives to the MFE as “best” Structure

Maximum Expected Accuracy (MEA)

• Select the structure expected to have the most base pairs correct

• → Maximize the sum of pair probabilities < A >=∑

(i,j)∈S pij• Computed by “Nussinov”-like dynamic programming

Centroid Structure• Choose the structure that is “nearest” to all other struture in the

the Boltzman ensemble

• Minimize < d(S) >=∑

(i,j)∈S pij +∑

(i,j)/∈S(1− pij)

• Trivial solution: just pick all pairs with probability > 0.5

No free lunch: Centroid and MEA structure may correspond to avery unlikely structure!

Page 56: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Suboptimal Structures

Zuker’s p-suboptimal structures (mfold)

For each pair (i , j) generate the best structure containing that pair.

+ easy to compute by multiple backtracking

+ present a user user with few, representative alternatives

− important alternatives can be missed

− somewhat arbitrary selection of representatives

Complete suboptimal folding

generates all structures within given energy range from the mfe.

+ Calculate any thermodynamic average

+ Investigate the energy landscapes, e.g. find low-lying local minima

− huge amount of data, only for moderately short sequences

Stochastic suboptimal structure

generate Boltzmann weighted sample through stochastic backtracking

Page 57: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Dot Plots and Suboptimal Structures

Zuker suboptimal structures may miss important alternatives

GUCCGCAGUUGCGGACUGCCCGUAAUUGCGGGCGUUCGUGUAUGCGAACGGCUCGUGUAUGCGAGC

((((((....)))))).((((((....))))))((((((....)))))).((((((....))))))

((((((((((((((.....))))))))))))))((((((....)))))).((((((....))))))

((((((((((((((.....)))))))))))))).(((((((((((((.....))))))))))))).

((((((....)))))).((((((....)))))).(((((((((((((.....))))))))))))).

Only the first 3 structures are found by mfold

G U C C G C A G U U G C G G A C U G C C C G U A A U U G C G G G C G U U C G U G U A U G C G A A C G G C U C G U G U A U G C G A G C

GU

CC

GC

AG

UU

GC

GG

AC

UG

CC

CG

UA

AU

UG

CG

GG

CG

UU

CG

UG

UA

UG

CG

AA

CG

GC

UC

GU

GU

AU

GC

GA

GC

Page 58: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Variations on Backtracing

What we need:• Let π be a partial structure consisting of

• Ωπ a set of already known base pairs• Υπ a list of sequence intervals with unknown structure

• E (π) best energy that can be obtained by completing π

• S a stack of partial structures to be processed

Page 59: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Variations on Backtracing

MFE backtracing:

∅→ S.while S 6= ∅ doπ ← S;if π is complete then output π[i , j ] = I ∈ Υπ.π′ = πJ(i)if E (π′) = Eopt

then π′ → S; next;for all k ∈ [i , j ] doπ′ = πJ(i , k)if E (π′) = Eopt

then π′ → S; next;

Page 60: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Variations on Backtracing

Wuchty suboptimals

∅→ S.while S 6= ∅ doπ ← S;if π is complete then output π[i , j ] = I ∈ Υπ

π′ = πJ(i)if E (π′) ≤ Eopt + ∆E

then π′ → S;for all k ∈ [i , j ] doπ′ = πJ(i , k)if E (π′) ≤ Eopt + ∆E

then π′ → S;

Page 61: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Variations on Backtracing

Stochastic backtracing

∅→ S.while S 6= ∅ doπ ← S;if π is complete then output π[i , j ] = I ∈ Υπ

r = Z (π) · rand();π′ = πJ(i); r = r − Z (π′)if r ≤ 0 then π′ → S; next;for all k ∈ [i , j ] doπ′ = πJ(i , k); r = r − Z (π′)if r ≤ 0 then π′ → S; next;

Page 62: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

SCFG based Methods

Stochastic Context Free Grammars

• An extension to HMMs that can handle long rangecorrelations

• Formally, a CFG is a tuple G = (V , α, S ,R) ofAlphabet α, nonterminals V , Start symbol S , and rewriterules R

• Stochastic CFG adds probabilities to each rule

• Generates sequences using a series of rewriting rules

• Well known repertoire of algorithms that work on any SCFG

Page 63: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Nussinov Algorithm as SCFG

i jj i i+1 j i i+1 k−1 k k+1|=

S → aS |cS |gS |uSS → PS

S → ∅P → aSu|uSa|cSg |gSc |gSu|uSg

• Applying grammar rules produces an RNA sequence

• The parse tree of productions used corresponds to the structure

• A stochastic CFG assigns probabilities to each production rule

• CYK algorithm finds the parse tree (structure) most likely toproduce the given sequence

Page 64: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

SCFGs vs Energy directed folding

Many RNA related algorithms come in two flavors, based on either

energy directed folding or stochastic context free grammars

physics based modelparameters from experiment

probabilistic inferenceparameters learned from data

Algorithms are mostly analogous, but terminology is different!

MFE foldingpair probabilities

CYK algorithmInside/Outside algorithm

Page 65: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Beyond classical secondary structure

State of the art in RNA secondary structure prediction:

• Classical secondary structure considers only 6 types of pairsCG,GC,AU,UA,GU,UG

• All base pairs are assumed to be Watson-Crick type

• Loops are drawn as unstructured bubbles

• Energy model assigns free energies to each loop

Can we improve the level of detail?

Page 66: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

Leontis-Westhof classification

Watson-Crick base pairing is only part of the story!

Overall 12 different pairing types for any two basesStarting point for defining recurring tertiary structure motifs

Page 67: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

2D and 3D Structures

G

G

UC

A

G

GU

C

C

G

A A

A

G

G

AA

G

C

AG

C

C

Looking at known tertiary structureswe find that:

• “Loops” are not unstructured

• full of non-canonical base pairs

• all 16 [ACGU]×[ACGU]combinations form pairs

• the same two nucleotides canpair in many different ways

Page 68: RNA Secondary Structures · Hierarchical folding: Secondary structure forms rst then helices arrange to form tertiary structure Secondary structures cover most most of the folding

2D and 3D Structures

G

G

UC

A

G

GU

C

C

G

A A

A

G

G

AA

G

C

AG

C

C

Looking at known tertiary structureswe find that:

• “Loops” are not unstructured

• full of non-canonical base pairs

• all 16 [ACGU]×[ACGU]combinations form pairs

• the same two nucleotides canpair in many different ways