the modern rna world: computational screens for noncoding rna genes eddy lab hhmi/washington...

43
The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Upload: bryce-watson

Post on 28-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The modern RNA world: computational screens for noncoding RNA genes

Eddy labHHMI/Washington University, Saint Louis

Page 2: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The human genome sequence is (almost) done

Page 3: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The genome, famously, is digital

1892: Miescher postulates that genetic information may be encoded in a linear form using a few different chemical units:

“...just as all the words and concepts in all languages can find expression in twenty-four to thirty letters of the alphabet.”

Page 4: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Symbolic texts can be cracked

“Cryptography has contributed a new weapon to the student of unknown scripts.... the basic principle is theanalysis and indexing of coded texts, sothat underlying patterns and regularitiescan be discovered. If a number of instances can be collected, it may appearthat a certain group of signs in the codedtext has a particular function....” - John Chadwick, The Decipherment of Linear B, Cambridge Univ. Press, 1958

Michael Ventris and John Chadwick, 1953

Page 5: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The phylogenetic history of life

Page 6: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Comparative genome analysisVISTA plot; I. Dubchak, E. Rubin, et al.

human, mouse, dog genomes

Page 7: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Estimates of human gene numberwww.ensembl.org/Genesweep/

mean: 61,710

high: 153,478low: 27,462

Want to place a bet? The book is held by the bartender at Cold Spring Harbor Laboratory.

Page 8: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Life with 6000 Genes

A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon,H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston,

E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen,H. Tettelin, S.G. Oliver

Science 274:546, 1996

but besides the ~6000 large protein-coding genes, there’s also:140 ribosomal RNA genes,275 transfer RNA genes,~40 small nuclear RNA genes,~100 small nucleolar RNA genes,... and ... ?

The yeast genome completed

where “gene” = ORF of 100 amino acids or more.

Page 9: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Structure of the large ribosomal subunitHaloarcula marismortui

Ban et. al., Science 289:905, 2000

Page 10: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

inside-out genes

Human UHG (U22 host gene)no significant ORFs; not conserved with mouse; rapidly degraded

Eight intron-encoded snoRNAsconserved with mouse; stable

Tycowski, Shu, and SteitzNature 379:464, 1996

Page 11: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

An RNA motorSimpson et al, Nature 408:745, 2000

“Structure of the bacteriophage 29 DNA packaging motor”

Page 12: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Cartilage-hair hypoplasia mapped to an RNAM. Ridanpaa et al. Cell 104:195, 2001

RMRP: Human RNase MRP, 267 nt

Page 13: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

microRNAs (miRNAs) in metazoa

~22-mer processed from ~70-mer precursorby RNAi pathway

lin-4 acts as translational repressorby binding 3’ UTR

T. Tuschl; D. Bartel; V. Ambros

Page 14: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

RNA genes can be hard to detect

UGAGGUAGUAGGUUGUAUAGU

C. elegans Let-7; 21 ntPasquinelli et al. Nature 408:86, 2000

• often small• sometimes multicopy and redundant• often not polyadenylated (and remember EST libraries are poly-A selected)• immune to frameshift and nonsense mutation• no open reading frame or codon bias• relatively little information in primary sequence consensus

Page 15: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Two computational analysis problems

1. Similarity search (e.g. BLAST): I give you a query; you find sequences in a database that look like the query.

For RNA, you want to take the secondary structure of the query into account.

2. Genefinding (e.g. GENSCAN): Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence.

For RNA – with no open reading frame and no codon bias – what do you look for?

Page 16: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

RNA structure: nested pairwise correlations

Page 17: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Context-free grammarsNoam Chomsky, 1956

a CFG “derivation”Basic CFG “production rules”

Page 18: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Sequence vs. secondary structure alignmentR Durbin, SR Eddy, GJ Mitchison, A Krogh

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic AcidsCambridge Univ. Press, 1998

Goal

optimal alignmentP(sequence | model)

EM parameter estimation

memory complexity:time complexity (general):time complexity (as used):

HMM algorithm(sequence)

ViterbiForward

Forward-Backward

O(MN)O(M2N)O(MN)

SCFG algorithm(structure)

CYKInside

Inside-Outside

O(MN2)O(M3N3)O(MN3)

• we can analyze target sequences with secondary structure models;• but the algorithms are computationally expensive.

Page 19: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

SCFG-based RNA similarity searchC/D methylation guide snoRNA consensus:

Graphical model, prior to conversion to probabilistic model:

the program snoscan was used to detect C/D snoRNA homologues in Archaea;Omer et al., Science 288:517-522, 2000

Page 20: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

SCFGs for RNA folding

Full SCFG analogue of Michael Zuker’s minimum energy RNA folding –

means we can apply statistical models to any RNA structure(e. g., what’s the probability that this is a plausible RNA structure?)

Elena Rivas and S.R. Eddy, Bioinformatics 16:573, 2000

Page 21: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Genefinding by comparative analysisJonathan Badger, Gary Olsen: CRITICA, Mol Biol. Evol. 16:512, 1999

The OTHER model:score with terms P(a,b | OTH)models divergence only

the CODING model:score with terms P(aaa,bbb | COD)models divergence, constrained byamino acid substitution matrix andcodon bias

Most comparative analysis relies just on differential rates of evolution.However, the pattern of mutation is also informative.

Page 22: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

add: a comparative model of structural RNAs

The RNA model:terms: P(a-a’, b-b’ | RNA)models DNA divergence constrained by a secondary structure

Elena Rivas, S.R. Eddy: QRNA, BMC Bioinformatics 2:8, 2001

Page 23: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Some technical issues

- The structure is unknown; must do ensemble averaging.

- model must deal with gapped alignments.

- bounds of conservation or alignment don’t correspond to bounds of RNA.

- evolutionary divergence times of the three models must be the same.

We use a form of probabilistic model called “pair-SCFGs”.

Page 24: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Three models – examples of their scores

Page 25: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

A screen for novel ncRNAs in E. coliElena Rivas et al., Curr Biol 11:1369, 2001

2367 E. coli intergenic sequences >50 nt in length

WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniaegave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity

QRNA classified: 556 candidate RNA loci 160 candidate small ORFs (not examined further)

281 candidate loci are explainable: cis-regulatory RNA structures (terminators, attenuators, etc.) and certain inverted repeat elements

leaves 275 candidate ncRNA gene loci

Northerns on 49 candidates: 11/49 are expressed as small stable RNAsin exponentially growing E. coli in rich media

Page 26: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Northern blots confirming E. coli RNAs

Page 27: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The Altuvia screen

“Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....”

Argaman et al., Current Biology 11:941, 2001“Novel small RNA-encoding genes in the intergenic regions of E. coli”

sraA 120 ntsraB 149-168 ntrprA 105 ntsraC 234-249 ntsraD 70 ntgcvB 205 ntsraE 88 ntsraF 189 ntsraG 146-174 ntsraH 88-108 ntsraI 91-94 ntsraJ 172 ntsraK 245 ntsraL 140 nt

• start w/ “intergenic” regions

• computational identification of putative promoter and terminator, 50-400 nt apart

• select regions conserved with other bacteria by BLAST

Page 28: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The Gottesman screenWassarman et al., Genes Dev. 15:1637, 2001

“Identification of novel small RNAs using comparative genomics and microarrays”

rydB 60 ntryeE 86 ntryfA 320 ntryhA 45 nt (sraH)ryhB 90 nt (sraI)ryiA 210 ntryjA 92 ntrybB 80 ntryiB 270 nt (sraK, csrC)rybA 205 ntrygA 89 nt (sraE)rygB 83 ntryeA 275 ntryeB 100 ntryeC 107,143 ntryeD 102,137 ntrygC 107,139 nt

• intergenic regions >= 180 nt

• conserved w/ other bacteria by BLAST

• manual inspection of location & sequence

• expression detected on high-density oligo probe array

“... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....”

Page 29: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Summary of three E. coli screens

31 different new RNAs found and confirmed by the three screens:Altuvia: 14Gottesman: 19 (1 showed no expression; 1 untested)Rivas: 22 (1 showed no expression; 10 untested)

Conclusions: Sensitivity of QRNA is respectable; most E. coli ncRNAs conserve secondary structure

Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes

Conclusions: These screens have not saturated E. coli for new ncRNAs; We have >200 other candidates in testing; We have confirmed transcripts as short as 40 nt; The functions of these RNAs are unknown.

Page 30: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Pyrococcus: three hyperthermophile genomes

A “black smoker” – deep sea hydrothermal ventphoto: American Natural History Museum

• P. horikoshii 1.8 Mb, complete isolated off Okinawa, 1400m depth Kawarabayasi et al. (NITE, Tokyo)

• P. furiosus 1.9 Mb, complete from Vulcano Island, Italy Robb et al. (Utah Genome Center)

• P. abyssi 1.8 Mb, complete from South Pacific vent, 3500m depth Genoscope (France)

Page 31: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

G/C composition detects RNAs in Pyrococcus

Page 32: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

RNAs stand out in AT-rich hyperthermophiles

Methanococcus 85 31% 67% 36% 97%Pyrococcus 98 42% 71% 29% 52%Borrelia 37 29% 54% 25% 29%Aquifex 90 44% 68% 24% 14%Archaeoglobus 83 48% 68% 20% 2%S. cerevisiae 30 38% 54% 16% 0E. coli 37 51% 59% 8% 0

grow

th tem

p (C)

% G

C (gen

ome)

% G

C (RNA)

%RNA-%

geno

me

% kn

own R

NAs dete

cted

!!

Page 33: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The G/C computational screen

Implemented as a 2-state hidden Markov model, using Viterbi or posterior decoding algorithms.

Methanococcus jannaschii: (Viterbi parse alone)43 regions detected (some span multiple RNAs)includes 36/37 tRNAs; SSU and LSU rRNA; 5S, 7S, RNase P.9 unassigned candidates.4/9 express small RNAs detectable on Northern.

Pyrococcus furiosus: (posterior decoding, plus conservation w. P.a., P.h.)51 regions detected (some span multiple RNAs)includes 46/46 tRNAs, SSU and LSU rRNA; 2 5S, 7S, and RNase P.8 unassigned candidates.4/8 express small RNAs detectable on Northern.

Robbie Klein et al., manuscript submitted

Page 34: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

pyrococcus genome comparisons

Page 35: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Comparison of G/C to QRNA screenRobbie Klein et al., PNAS, in press

Candidate loci:

G/C screen QRNA screen

51 73

known tRNAs detected (of 46): 46 45

novel loci: 8 17

Both

n.d.

45

4

Confirmed by Northern: 4 4 3

• Like the E. coli screen, about 25% of QRNA candidates were confirmed by Northern (again in a single growth condition only).

• QRNA is detecting most novel structural RNA genes.

P. furiosus – screened by QRNA by comparison to P. horikoshii, P. abyssi

Page 36: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Archaeal RNA Northerns

Page 37: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

human/mouse ncRNA detection

the cartilage-hair hypoplasia region:

QRNA is a general genefinder for structural ncRNA genes.

Page 38: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

The ancient RNA WorldGesteland, Cech, Atkins: The RNA World, CSHL Press, 1999

Page 39: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

RNA is very good at recognizing RNAHa, Wightman, Ruvkun; Genes Dev. 10:3041, 1996

Page 40: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

A closing idea: The modern RNA world

Hypothesis:When a cell needs a molecule that specifically recognizes a target RNA molecule, and the function is either:

- catalytically unsophisticated - something that can be abstracted onto a shared protein (e.g.

many guide snoRNAs, one methylase)

then RNA may be the material of choice. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.

Page 41: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

In fact, an old idea...Jacob and Monod, JMB 3:318, 1961

Page 42: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Summary

• There appear to be many noncoding RNA genes.

• Methods to find homologous RNAs by structural similarity have been greatly improved, using stochastic context free grammar algorithms.

• Methods to find novel RNAs by de novo genefinding have finally become possible, for instance by using comparative genome analysis.

.

[SR Eddy, Nature Reviews Genetics, 2:919, 2001]

[R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998]

[E Rivas, RJ Klein, TA Jones, SR Eddy, Curr Biol 11:1369, 2001;E Rivas, SR Eddy, BMC Bioinformatics, 2:8, 2001]

Page 43: The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis

Acknowledgementsthe Eddy lab: http://www.genetics.wustl.edu/eddy/the Eddy lab: http://www.genetics.wustl.edu/eddy/

senior scientist: Elena Rivas

students:Zhirong BaoChristian ZmasekRobin DowellRobbie KleinSteve JohnsonShawn StricklinJohn McCutcheon

systems:Goran Ceric

webmaster:Ajay Khanna

wet lab:Ziva Misulovin

secret agent man:Tom Jones

funding:HHMINIH NHGRINSFMonsanto