finding, aligning and analyzing non coding rnas

38
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Upload: saman

Post on 04-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said… - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding, Aligning and Analyzing Non Coding RNAs

Finding, Aligning and AnalyzingNon Coding RNAs

Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program

Page 2: Finding, Aligning and Analyzing Non Coding RNAs

They are Everywhere…

And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)

How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins

.

Page 3: Finding, Aligning and Analyzing Non Coding RNAs

Searching

“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”

Page 4: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs can have different sequences and Similar Structures

Page 5: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

GAACGGACC

CTTGCCTGG

GG

AAC CA

CGG

AG

AC G

CTTGCCTCC

GAACGGAGG

GG

AAC CA

CGG

AG

AC G

Page 6: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs are Difficult to Align

--CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** *

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

Regular Alignment

Page 7: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs are Difficult to Align

Same Structure Low Sequence Identity

Small Alphabet, Short Sequences Alignments often Non-Significant

Page 8: Finding, Aligning and Analyzing Non Coding RNAs

Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment

Page 9: Finding, Aligning and Analyzing Non Coding RNAs

The Holy Grail of RNA Comparison:Sankoff’ Algorithm

Page 10: Finding, Aligning and Analyzing Non Coding RNAs

The Holy Grail of RNA ComparisonSankoff’ Algorithm

Simultaneous Folding and Alignment

– Time Complexity: O(L2n)– Space Complexity: O(L3n)

In Practice, for Two Sequences:

– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.

Forget about– Multiple sequence alignments– Database searches

Page 11: Finding, Aligning and Analyzing Non Coding RNAs

The next best Thing: Consan

Consan = Sankoff + a few constraints

Use of Stochastic Context Free Grammars

– Tree-shaped HMMs– Made sparse with constraints

The constraints are derived from the most confident positions of the alignment

Equivalent of Banded DP

Page 12: Finding, Aligning and Analyzing Non Coding RNAs

Consan for Databases: Infernal

Infernal is a Faster version of Consan

For Database Search

Sill Very Slow

Receiver operating characteristic (ROC)Comparison of Infernal with BLAST

Page 13: Finding, Aligning and Analyzing Non Coding RNAs

Consan for Databases: Infernal

BLAST: 360 s.

Fast Infernal: 182 000 s. Slow Infernal: 5 320 000 s.

Page 14: Finding, Aligning and Analyzing Non Coding RNAs

Searching Databases for New RNAs

Page 15: Finding, Aligning and Analyzing Non Coding RNAs

Rfam: In practice

Rfam contains RNA families

– Families Multiple Sequence Alignment Models

– Models are like Pfam Profiles Use Consan or Cmsearch rather than HMMer Much Slower

– Too expensive to search the models Models are used to build Rfam People usually BLAST Rfam

Page 16: Finding, Aligning and Analyzing Non Coding RNAs

Where do Rfam Families Come From?

Infernal Requires a Model

Models requires an MSA

The MSA requires a Family

It all starts with a BlastN

Rfam, Gardner et al. NAR 2008

Page 17: Finding, Aligning and Analyzing Non Coding RNAs

Can we make BlastN more accurate ?

BlastN is not very accurate because:

– Poor substitution models for Nucleic Acids– Low information density (4 symbols)

BlastN assumes– Equal evolution rates for all nucleotides– Independence form Neighbors

Page 18: Finding, Aligning and Analyzing Non Coding RNAs

Love Thy Neighbor

Measured Nearest Neighbor Dependencies on Rfam sequences

Page 19: Finding, Aligning and Analyzing Non Coding RNAs

High Rate of CpG mutations

Page 20: Finding, Aligning and Analyzing Non Coding RNAs

Measuring Di-Nucleotide Evolution

Each Nucleotide can be made more informative

It can incorporate the “name” of its Neighbor– AA => a– AG => b– AC => c– AT => d– …

A 16 Letter alphabet can be used to recode all nucleotide sequences

We name these extended Nucleotides

Page 21: Finding, Aligning and Analyzing Non Coding RNAs
Page 22: Finding, Aligning and Analyzing Non Coding RNAs

Blosum-R and eRNA

Page 23: Finding, Aligning and Analyzing Non Coding RNAs

Substitutions ??

How much does it cost to turn one nucleotide into another one ?

Blosum/Pam style matrix

Matrices estimated on Rfam families

Page 24: Finding, Aligning and Analyzing Non Coding RNAs

Blosum-R and eRNA

Page 25: Finding, Aligning and Analyzing Non Coding RNAs

Using BlastR

When Nucleic Acids look like Proteins They can be aligned with Protein Methods

– BlastN BlastP

– BlastP with eRNA is BlastR

Page 26: Finding, Aligning and Analyzing Non Coding RNAs

Validating Blast-R

Page 27: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

Rfam

PPPN

E

VALUES

Blast

Query

Page 28: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

Rfam 001

Rfam 002

Rfam …

Rfam 001

Rfam 002

Rfam …

Blast

Blast

Blast

ROC

Page 29: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

Good Bad

False Positives

True Positive

GoodBad

Page 30: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

False Positives

True Positive

GoodBad

Area Under Curve

Small AUC Better

Page 31: Finding, Aligning and Analyzing Non Coding RNAs

BlastR vs The World

Page 32: Finding, Aligning and Analyzing Non Coding RNAs

The 3 Components of Blast R

BlastP is better than BlastN BlosumR makes BlastP a little

bit better

Blast: wuBlast

Page 33: Finding, Aligning and Analyzing Non Coding RNAs

The 3 Components of Blast R

BlastP is better than BlastN BlosumR makes BlastP a little

bit better And Faster

Page 34: Finding, Aligning and Analyzing Non Coding RNAs

BlastR and Clustering

Given all Rfam in Bulk

How good is BlastR at reconstituting all the families

Sensitivity

1-Specificty

Page 35: Finding, Aligning and Analyzing Non Coding RNAs

BlastR and Clustering

Given all Rfam in Bulk

How good is BlastR at reconstituting all the families

Sensitivity

1-Specificty

Page 36: Finding, Aligning and Analyzing Non Coding RNAs

BllastR: In Practice

Page 37: Finding, Aligning and Analyzing Non Coding RNAs

BllastR: In Practice

E-Value Threshold: 10-20

BlastN

BlastR

Page 38: Finding, Aligning and Analyzing Non Coding RNAs

Take Home

Searching Nucleotides is Difficult

BlastN is not a very good algorithm

Simple Adaptations can improve the situation– Changing the algorithm (BlastP)– Changing the Scoring Scheme (BlastP-Nuc)– Changing the alphabet (BlastR)