applied bioinformatics week 11. topics protein secondary structure rna secondary structure
TRANSCRIPT
Applied Bioinformatics
Week 11
Topics
• Protein Secondary Structure
• RNA Secondary Structure
Theory I
Recall Domains
• Functional region of a protein sequence
• Proteins may have several domains
• Generally identified by MSA
Domains
• Convey function
• Function derives from 3D structure
• How to determine 3D structure of proteins?
• First step secondary structure
Four levels of protein structure
Structure
Secondary Structure
• Local three dimensional structure
• Elements– Helix– Sheet– Coil
G = 3-turn helix (310 helix). Min length 3 residues.H = 4-turn helix (α helix). Min length 4 residues.I = 5-turn helix (π helix). Min length 5 residues.T = hydrogen bonded turn (3, 4 or 5 turn)E = extended strand in parallel and/or anti-parallel β-sheet conformation. Min length 2 residues.B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation)S = bend (the only non-hydrogen-bond based assignment)
Secondary Structure 8 different categories
(DSSP):H: - helixG: 310 – helixI: - helix (extremely
rare) E: - strandB: - bridgeT: - turnS: bend L: the rest
Protein Secondary Structure [3]
Alpha Helix-
Structure repeats itself evry5.4 Angstroms along the helix axis
Every main chain CO and NH group is hydrogen bonded to a peptide bond 4 residues away
Beta Sheet – Two or more polypeptide chains run alongside each other and are linked by hydrogen bonds
Yuchun Tang, Preeti Singh, Yanqing Zhang, Chung-Dar Lu and Irene Weber, Georgia State University
Simplification
• 20 amino acids
• 5 - 11 groups of amino acids– Amino acids with similar chemical properties– Depends on the study
• 3 secondary structures
Secondary Structure Preditiction
• Sheet/ helix forming tendency of amino acids– Up to 60% accurate
• MSA -> neighborhood exploitation– Words of several aa are formed– Hydrophobicity is included– Up to 80% accurate
Propensities
Generation of Prediction Methods
• 1st generation : single residue statistics – Base on single amino acid propensity
• 2nd generation : segment statistics – Propensity for segments of 3-51 adjacent residues
• 3rd generation : evolution to better predictions – The use of evolutionary information (evolutionary
profile)
Assignment to Structure
• Sliding window of 7 amino acids– Why 7?
• Middle amino acid is assigned average propensity– Helix, Sheet
• Long stretches of similar assignments
About 2 turns (3.6 per turn)
Example: Window • Consider a secondary structure (x, e) and the window of
length 5 with the special position in the middle (bold letters)
• Fist position of the window is:
x = A R N S T V V S T A A . . .
e = ? ? H H C C C E E E . . . .
Window returns instance:
A R N S T H
Example: Window • Second position of the window is:
x = A R N S T V V S T A A . . .
e = ? ? H H C C C E E E . . . .
• Windows returns instance: R N S T V H
• Next instances are:N S T V V C
S T V V S C
T V V S T C
Practical Secondary Structure Prediction
• Can aid in MSA– If structures are not more similar than the
aligned sequences; there is a problem
• Step towards three dimensional structure
• Clue about architecture– 28 regular protein architectures
PSIPRED Example
Secondary structure prediction methods
PSI-pred (PSI-BLAST profiles used for prediction; David Jones, Warwick)
JPRED Consensus prediction (includes many of the methods given below; Cuff & Barton, EBI)
DSC King & SternbergPREDATORFrischman & Argos (EMBL) PHD home page Rost & Sander, EMBL, Germany ZPRED server Zvelebil et al., Ludwig, U.K. nnPredict Cohen et al., UCSF, USA. BMERC PSA Server Boston University, USA SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College, USA.
http://speedy.embl-heidelberg.de/gtsp/secstrucpred.html
Andrew CR Martin, UCL
Consensus prediction method
hydrophobichighly conservedb= buried, e = exposed
Andrew CR Martin, UCL
Consensus prediction method -JPRED
hydrophobichighly conservedb= buried, e = exposed
amphipathic
hydrophobic
Andrew CR Martin, UCL
Neural network prediction - PHD
Multiple alignment
of protein family
SS profile for window of adjacent residues
Andrew CR Martin, UCL
Hidden Markov Models-HMMSTR
amino acid
secondary structure element
structural context
Markov state
• Recurrent local features of protein sequences
• Accuracy of 74%
Bystroff et al., 2000Andrew CR Martin, UCL
Consensus/ Meta Prediction Method
• Uses more than one existing method
• Learns how to combine the results
• Produces a result which is on average better than the single methods
• E.g.: http://gor.bb.iastate.edu/cdm/
Prediction Accuracy Assessment
• Protein Structure Prediction Center – http://predictioncenter.org/
• CASP– Critical Assessment of protein Structure
Prediction
Hydrophobicity
Assignment to Structure
• Sliding window of 5-7 or 19-21 amino acids– Why?
• Otherwise same idea as for secondary structure forming propensities
End Theory I
Mindmapping
10 min break
Practice I
Sec Struct Predictionhttp://bioinf.cs.ucl.ac.uk/psipred/psiform.htmlhttp://compbio.soe.ucsc.edu/HMM-apps/T02-query.html http://distill.ucd.ie/porter/ http://sable.cchmc.org/ http://www.compbio.dundee.ac.uk/www-jpred/advanced.html http://genamics.com/expression/strucpred.htm http://www.predictprotein.org/ http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_phd.html http://www.chemie.uni-erlangen.de/lanig/PMII/sek_str.html http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_sopma.html http://molbiol-tools.ca/Protein_secondary_structure.htm http://mobyle.pasteur.fr/cgi-bin/portal.py?form=predator http://www.aber.ac.uk/~phiwww/prof/ http://www.expasy.ch/tools/ http://gor.bb.iastate.edu/ http://www.predictprotein.org/
In class assignment• Choose a protein sequence
– Not too short!• Perform secondary structure predictions with as
many tools as possible– Google at least one more than given in the slides
• Retrieve and rewrite the predictions such that they use the 3 letter code (H,C,S; Helix, Coil, Sheet)– Use search and replace functionality of your word
processor• Make an MSA with the predicted secondary
structures to compare the results– Are there gaps? – Are they within the transition from one secondary
structure to the next?
Try to predict TMDs
• Find a protein with TMDs
• Expasy will provide you with prediction methods– DAS - Prediction of transmembrane regions in prokaryotes using the Dense
Alignment Surface method (Stockholm University)– HMMTOP - Prediction of transmembrane helices and topology of proteins
(Hungarian Academy of Sciences)– PredictProtein - Prediction of transmembrane helix location and topology
(Columbia University)– SOSUI - Prediction of transmembrane regions (Nagoya University, Japan)– TMHMM - Prediction of transmembrane helices in proteins (CBS; Denmark)– TMpred - Prediction of transmembrane regions and protein orientation (EMBnet-
CH)– TopPred - Topology prediction of membrane proteins (France)
End Practice I
Theory II
RNA
• Coding RNA– Results in protein
• Non Coding RNA– Structural– Regulational– Catalytic– …
RNA Basicstransfer RNA (tRNA)
messenger RNA (mRNA)
ribosomal RNA (rRNA)
small interfering RNA (siRNA)
micro RNA (miRNA)
small nucleolar RNA (snoRNA)
http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
RNA Secondary Structure
• Just like amino acids interact to form a secondary structure, nucleotides do the same
• Here base pairing is the driving motor
• Generally the structure of RNA molecules is projected onto 2 dimensions
Chemical Structure of RNAFour base types.
Distinguishable ends.
Partial Tertiary Structure
One illustration
Yet Another Tertiary Structure
Found via google
Our Final Tertiary Picture
Very complex
A Partial RNA Secondary Structure
Pure Secondary Structure
RNA Folding
• Single stranded RNA– Unstable– Base pairs with complementary
sequences– Base pair stacking– Favorable loop sizes
• Highest Stability– Lowest energy model
• Folding process– Not known in detail– Extremely fast
RNA Secondary Structure Prediction
Dynamic Programming Approaches
Sarah Aerni
http://www.tbi.univie.ac.at/
OutlineRNA folding
Dynamic programming for RNA secondary structure prediction
Covariance model for RNA structure prediction
RNA Secondary Structure
Hairpin loopJunction (Multiloop)
Bulge Loop
Single-Stranded
Interior Loop
Stem
Image– Wuchty
Pseudoknot
Sequence Alignment as a method to determine structure
Bases pair in order to form backbones and determine the secondary structure
Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure
Base Pair Maximization – Dynamic Programming Algorithm
Simple Example:Maximizing Base Pairing
Base pair at i and jUnmatched at iUmatched at jBifurcation
Images – Sean Eddy
S(i,j) is the folding of the subsequence of the RNA strand from index i to index j which results in the highest number of base pairs
Base Pair Maximization – Dynamic Programming Algorithm
Alignment Method Align RNA strand to itself Score increases for feasible base
pairs
Each score independent of overall structure
Bifurcation adds extra dimension
Initialize first two diagonal arrays to 0
Fill in squares sweeping diagonally
Images – Sean Eddy
Bases cannot pair, similarto unmatched alignment
S(i, j – 1)
Bases can pair, similarto matched alignment
S(i + 1, j)
Dynamic Programming – possible paths S(i + 1, j – 1) +1
Base Pair Maximization – Dynamic Programming Algorithm
Alignment Method Align RNA strand to itself Score increases for feasible base
pairs
Each score independent of overall structure
Bifurcation adds extra dimension
Initialize first two diagonal arrays to 0
Fill in squares sweeping diagonally
Images – Sean Eddy
Reminder:For all k
S(i,k) + S(k + 1, j)
k = 0 : Bifurcation max in this case
S(i,k) + S(k + 1, j)
Reminder:For all k
S(i,k) + S(k + 1, j)
Bases cannot pair, similarBases can pair, similarto matched alignmentDynamic Programming –
possible pathsBifurcation – add values for
all k
Base Pair Maximization - Drawbacks
Base pair maximization will not necessarily lead to the most stable structureMay create structure with many interior loops or
hairpins which are energetically unfavorable
Comparable to aligning sequences with scattered matches – not biologically reasonable
Energy Minimization
Thermodynamic StabilityEstimated using experimental techniques
Theory : Most Stable is the Most likely
No Pseudknots due to algorithm limitations
Uses Dynamic Programming alignment technique
Attempts to maximize the score taking into account thermodynamics
MFOLD and ViennaRNA
Energy Minimization Results
Linear RNA strand folded back on itself to create secondary structure
Circularized representation uses this requirementArcs represent base pairing
Images – David Mount
All loops must have at least 3 bases in them Equivalent to having 3 base pairs between all arcs
Exception: Location where the beginning and end of RNA come together in circularized representation
Trouble with Pseudoknots
Pseudoknots cause a breakdown in the Dynamic Programming Algorithm.
In order to form a pseudoknot, checks must be made to ensure base is not already paired – this breaks down the recurrence relations
Images – David Mount
Energy Minimization Drawbacks
Compute only one optimal structure
Usual drawbacks of purely mathematical approachesSimilar difficulties in other algorithms
Protein structure
Exon finding
Alternative Algorithms - Covariaton
Incorporates Similarity-based methodEvolution maintains sequences that are importantChange in sequence coincides to maintain structure
through base pairs (Covariance)Cross-species structure conservation example – tRNA
Manual and automated approaches have been used to identify covarying base pairs
Models for structure based on resultsOrdered Tree ModelStochastic Context Free Grammar
Expect areas of basepairing in tRNA to be covarying betweenvarious species
Base pairing creates same stable tRNA structure in organisms
Mutation in one baseyields pairing impossible and breaksdown structure
Covariation ensuresability to base pair is maintained and RNAstructure is conserved
Binary Tree Representation of RNA Secondary Structure
Representation of RNA structure using Binary tree
Nodes represent
Base pair if two bases are shown
Loop if base and “gap” (dash) are shown
Pseudoknots still not represented
Tree does not permit varying sequences
Mismatches
Insertions & Deletions
Images – Eddy et al.
Covariance Model
HMM which permits flexible alignment to an RNA structure – emission and transition probabilities
Model trees based on finite number of states Match states – sequence conforms to the model:
MATP – State in which bases are paired in the model and sequence
MATL & MATR – State in which either right or left bulges in the sequence and the model
Deletion – State in which there is deletion in the sequence when compared to the model
Insertion – State in which there is an insertion relative to model
Transitions have probabilitiesVarying probability – Enter insertion, remain in current state, etc
Bifurcation – no probability, describes path
Covariance Model (CM) Training Algorithm
S(i,j) = Score at indices i and j in RNA when aligned to the Covariance Model
Independent frequency of seeing the symbols (A, C, G, T) in locations i or j depending on symbol.
Frequencies obtained by aligning model to “training data” – consists of sample sequences Reflect values which optimize alignment of sequences to model
Frequency of seeing the symbols (A, C, G, T) together in locations i and j depending on symbol.
Alignment to CM Algorithm
Calculate the probability score of aligning RNA to CM
Three dimensional matrix – O(n³)Align sequence to given subtrees in CM
For each subsequence calculate all possible states
Subtrees evolve from Bifurcations
For simplicity Left singlet is default
Images – Eddy et al.
•For each calculation take intoaccount the
• Transition (T) to next state • Emission probability (P) in the
state as determined by training data
Bifurcation – does not have a probabilityassociated with the stateDeletion – does not have an emission probability (P) associated with it
Images – Eddy et al.
Alignment to CM Algorithm
Covariance Model Drawbacks
Needs to be well trained
Not suitable for searches of large RNAStructural complexity of large RNA cannot be
modeled
Runtime
Memory requirements
End Theory II
Mindmapping
10 min break
Practice II
RNA Secondary Structure
• Online• http://compbio.cs.sfu.ca/taverna/alterna/• http://www.bioinfo.rpi.edu/applications/mfold/
• Download• RNAShapes• RNAFold
• Get RNAs– http://www.ncrna.org/frnadb/search.html