searching genomes for noncoding rna cs374 leticia britos 10/03/06

Searching genomes for Searching genomes for noncoding RNA noncoding RNA

CS374

Leticia Britos

10/03/06

DNA to RNA, and genes

G

A

G

U

C

A

G

C

DNA, ~3x109 long in humansContains ~ 22,000 genes

RNA: carries the “message” for “translating”, or “expressing” one gene

transcription translation

folding

1

2easy

3

http://www.biochemistry.bham.ac.uk/sjwb/alpharot.gif

“Structural genes encode proteins and regulatory genes produce non-coding RNA”

F. Jacob and J. Monod (1961)

Where are the genes?Where are the genes?

Gene Finding

atg

tga

ggtgag

ggtgag

ggtgag

cagatg

cagttgcaggcc

ggtgag

Where are the genes?Where are the genes?

Gene Finding

In humans:

~22,000 genes~1.5% of human DNA

An expanding universe of noncoding RNA

• rRNArRNA (structure/function of ribosomes)• tRNAtRNA (translation)• snRNAsnRNA (RNA splicing, telomere maintenance)• snoRNAsnoRNA (chemical modification of rRNA)• miRNAmiRNA (translational regulation)• gRNAgRNA (mRNA editing)• tmRNAtmRNA (degradation of defective proteins)• riboswitches riboswitches (translational and transcriptional regulation)• ribozymesribozymes (autocatalytic RNA)• RNAi RNAi (gene regulation by dsRNA)

Exciting times for the RNA world (and for Stanford)

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttgcaggcc

ggtgag

How to find ncRNAs?

How to find ncRNAs?

Riboswitches

5’ 3’

promoter

5’ UTR exons 3’ UTR

introns

coding

5’ 3’

promoter


introns

coding

noncoding

5’ 3’

promoter


introns

coding

Sequence conservation is not enough

Secondary structure is not enough

Noncoding RNA signals in the genome are not as strong as the signals for protein

coding genes

Look for structure in evolutionary conserved sequences

Identify new instances of a given ncRNA family in a genome

Existing algorithms

• CMSearch

• RSEARCH

• ERPIN

Example: finding 5S RNAs in a 1.6Mb genome

• RSEARCH: 6.5 h

• FastR: 103 s

What is a Database filter?

A computational procedure that takes a DB as input and outputs a subset of it.

filter

The object being searched for remains in the DB after filtering (sensitivity)

The filtered DB is significantly smaller

The filtering operation is fast (efficiency)

Problem

• Given an RNA sequence with known structure, find homologous sequences in a RNA DB

AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA

UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC

AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA

AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG

AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU



QUERY DB


• Stage 2: alignalign the selected sequences in the DB with the query and determine the best alignments

Solution

filterfilter

• Stage 1: filterfilter the DB

Filtering

• Sequence alone is not sufficient

• Structure alone is not sufficient

Filter using both sequence and structural features

Structural features: (k,w)-stacks(k,w)-stacks


3 6 25 28

a a’

a a’

a a’

a a’

AU

UC

CG

GG

AA

CA

UA

GU

AU

AG

GC

GA

CG

GA

UU

AG

CU

AG

CC

AA

Definition of a (k,w)-stack(k,w)-stack


d = 18

A pair of substrings of at least length k, that are at most w bases apart

a

a’

Is a,a’ :(4,18)-stack?(4,20)-stack?(4,9)-stack?(3, 20)-stack?

Use of (k,w)-stacks as filters in the search for ncRNAs

If we use a (7,70)-stack filter, we eliminate 90% of the DB from consideration

Structural features: nestednested (k,w,l)-stacks


3 6 25 2812 1614 18

Structural features: parallelparallel (k,w,l)-stacks


3 6 25 2812 1614 18 32 3534 36

Structural features: multiloopmultiloop (k,w,l)-stacks

Filtering criteria

nested stacks

Parallel stacks

Multiloop stacks

Filtering algorithm

1. Build a hash table hash table of kmers in the DB

AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA

kmer indexAUUC

k=4

1

Filtering algorithm



kmer indexAUUC

k=4

1UUCC 2

Filtering algorithm



kmer indexAUUC

k=4

UUCG

1UUCC 2

3

Filtering algorithm



kmer indexAUUC

k=4

UUCGCCGG

1UUCC 2

34

Filtering algorithm



kmer indexAUUC

k=4

UUCGCCGG

1UUCC 2

34

14

Filtering algorithm

2. Identify (k,w)-stacks


kmer indexAUUC

k=4

UUCGCCGG

1UUCC 2

34

14

reverse complement

GAAU n

Filtering algorithm

2. Identify (k,w)-stacks


kmer indexAUUC

k=4

UUCGCCGG

1UUCC 2

34

14

reverse complement

GAAU n

d w?

Filtering algorithm

3. Compute complex stacks using DP

nested

parallel

multiloop

Result of stage 1 (filtering)AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA


AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCA





• Stage 2: alignalign the selected sequences in the DB with the query and determine the best alignments

Solution

filterfilter

• Stage 1: filterfilter the DB

Possible ways to align RNAs

1. sequence to sequence

2. structure to structure

3. sequence to structure

RNA sequence structure alignment

AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA

QueryQuery

s [1,……………………………………………………..m]


t [1,……………………………………………………..n]

DB (filtered)DB (filtered)

S (set of base pairings)

The secondary structure of the query is represented by a binarized tree

ji j -1i +1

i - jRule 1: when i and j are

paired


ji j -1

Rule 2: when j is unpaired

Rule 3: when j is paired

but not to the left-most base


jk -1i k


ji


Final binary tree


Binary treeBinary tree

Alignment algorithm

Alignment algorithm

Optimal alignment between the query (with structure v) and substring (i-j) of the DB

black node

white node / one child

white node / two children



Optimal alignment

i

jv

A [i,j,v]

Alignment algorithm



black node

Alignment algorithm



black node

ji j -1i +1

i - j

j -1i +1

i - j

alignment score for pairing and structure

Alignment algorithm

black node



ji j -1

i - j

ji

j -1

i - j

j-1

alignment score for pairing

Alignment algorithm

black node



jk -1i k

sliding k

Validation

Known instances of ncRNAs (tRNA, 5S rRNA, ribozymes, riboswitches) are

inserted in a random sequence (1Mb)

Filtering and alignment algorithms are applied (with k, l, w, %GC)

Validation: filtering

Validation: filtering and alignment

Results: riboswitches

5’ 3’

promoter


introns

coding

non-coding

Riboswitches

Riboswitch families

Results: new riboswitches

RealReal motivation of bioinformaticists:

“We design novel filters and show that they dominatedominate the HMM filters of Weiberg and Ruzzo…”

world domination

searching genomes for noncoding rna cs374 leticia britos 10/03/06

Documents

db slide

genome slide

fastr slide

s slide

sufficient slide

dsrna slide

stanford slide

human dna slide