gene discovery using combined signals from genome sequence and natural selection michael brent...

Post on 19-Jan-2016

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Gene discovery using combined Gene discovery using combined signals from genome sequence signals from genome sequence

and natural selectionand natural selection

Michael BrentWashington University

The mouse genome analysis group

GENSIPS 10/7/2002 2

Genes are read out via mRNAGenes are read out via mRNA

& processing

GENSIPS 10/7/2002 3

RNA ProcessingRNA Processing

GENSIPS 10/7/2002 4

A typical human gene structureA typical human gene structure

GENSIPS 10/7/2002 5

In a mammalian genomeIn a mammalian genome

Finding all the genes is hard• Mammalian genomes are large

– 5,051 miles of 10pt type– Raleigh to Tripoli, Libya

• Only about 1.5% protein coding– Raleigh to Winston-Salem

GENSIPS 10/7/2002 6

Genes are fairly unconstrainedGenes are fairly unconstrained

Intron length is highly variable• ~5% are 40-100 nt long• ~3% are longer than 30,000 nt

Distance between genes is highly variable• From 103 to 106 nt or more (probably)

GENSIPS 10/7/2002 7

Exons per gene (RefSeq)Exons per gene (RefSeq)

0%

2%

4%

6%

8%

10%

12%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30+

Number of Exons

Pe

rce

nt

of

Ge

ne

s

Ref Seq

GENSIPS 10/7/2002 8

Background is not randomBackground is not random

Segmental duplications• Entire regions duplicate, then diverge slowly

Processed pseudogenes• Spliced transcripts integrate back into the genome

– Sequence is similar to source genes– Generally not functional

GENSIPS 10/7/2002 9

Gene prediction: two approachesGene prediction: two approaches

1. Transcript-based (E.g., GeneWise)A. Map experimentally determined sequences of

spliced transcripts to their genomic sourceB. Map transcript sequences to genomic regions

that could produce similar transcripts

2. De novo (genome only)• Model DNA patterns characteristic of gene

components– Splice donor and accepter– Protein coding sequence– Translation start and stop

GENSIPS 10/7/2002 10

Advantages and disadvantagesAdvantages and disadvantages

Transcript-based • Advantage: conservative

–Evidence of transcription for every exon• Disadvantage: conservative

–Can’t find “truly novel” genes• Still subject to error

GENSIPS 10/7/2002 11

Advantages and disadvantagesAdvantages and disadvantages

De novo• Advantage 1: Less biased toward

–Known transcripts–Transcripts that can be sequenced easily

• Advantage 2: Genome sequencing is easy• Disadvantages

–No direct evidence of transcription–Presumably, more false positives

GENSIPS 10/7/2002 12

Single-genome Single-genome dede novonovo: : GenscanGenscan

Strengths• For mammalian sequence, one of the best

single-genome, de novo gene predictors• Widely used to great practical advantage• De facto standard for mammalian sequence

Limitations• Predicts >45K genes (best est.: 25-30K)• Predicts >315K exons (best est. 200K-250K)• Gets only 9% of known genes exactly right*

GENSIPS 10/7/2002 13

Dual genome Dual genome de novode novo

We developed algorithms that use two genomes to• Reduce the number of false positives• Refined the details of the structures

GENSIPS 10/7/2002 14

Probability model• Assigns probability to annotated DNA sequences:

5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’

Optimization algorithm• Given a DNA sequence, find the most probable

annotation, according to the model

Exon5’ UTR Intron

Single-genome de novo methodSingle-genome de novo method

GENSIPS 10/7/2002 15

CCATGGCGTCTTCAGGCAGTGACTC

Genscan’s generative modelGenscan’s generative model

IntronExonIntron

GENSIPS 10/7/2002 16

Generalized

HMM

• States correspond to gene features• Model generates DNA sequence

by passing through states• The probability of annotated DNA

sequence is the probability of –generating the DNA sequence –by passing through states corre-

sponding to the annotation.

Genscan’s generative modelGenscan’s generative model

GENSIPS 10/7/2002 17

Dual genome predictionDual genome prediction

Input• Target and informant genomes

Idea• Patterns of evolution since the last common

ancestor may reveal gene structure

GENSIPS 10/7/2002 18

Two conservation signalsTwo conservation signals

1. Local alignment signal• Selective pressures differ by feature• This leaves a characteristic signature

2. Structural signal• Locations of introns tend to be conserved

GENSIPS 10/7/2002 19

Characteristic local alignmentsCharacteristic local alignments

TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC

|||||||||||||||||||| || ||||| || || |||

TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC

Coding exon

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

|||||| || | ||||||||| || || ||

CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT

Intron (non-coding)

human

human

mouse

mouse

GENSIPS 10/7/2002 20

Conservation of intron locationConservation of intron location

GENSIPS 10/7/2002 21

AlignAlign→→predictpredict→→filterfilter→→testtest

WU-BLAST

Aligned Intron Filter

Validation (RT-PCR)

TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC

TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC

TCTGCCACC|| || ||TCAGCTACT

TWINSCAN

GENSIPS 10/7/2002 22

gHMM decodingRepresentation change

TCTGCCACC||:||:||

TCTGCCACC|| || ||TCAGCTACT

Conservation sequenceTWINSCAN

GENSIPS 10/7/2002 23

BLAST AlignmentsBLAST Alignments

TargetInformant

GENSIPS 10/7/2002 24

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

GENSIPS 10/7/2002 25

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

GENSIPS 10/7/2002 26

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

GENSIPS 10/7/2002 27

Projecting BLAST AlignmentsProjecting BLAST Alignments

TargetInformant

GENSIPS 10/7/2002 28

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Synthetic (projected) local alignmenthuman

mouse

|||||| | ||||||||| || || ||

CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with• “|” if it is aligned and identical

GENSIPS 10/7/2002 29

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Synthetic (projected) local alignmenthuman

mouse

|||||| |:|||||||||::||:|| ||:

CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap

GENSIPS 10/7/2002 30

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Synthetic (projected) local alignmenthuman

mouse

||||||. . . . . . . . .|:|||||||||::||:|| ||:

CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned

GENSIPS 10/7/2002 31

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC

Conservation sequencehuman

||||||. . . . . . . . .|:|||||||||::||:|| ||:

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned

GENSIPS 10/7/2002 32

Conservation sequenceConservation sequence

CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC

Conservation sequencehuman

||||||. . . . . . . . .|:|||||||||::||:||||:

Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned

GENSIPS 10/7/2002 33

Probability model• Assigns probability to annotated DNA:

5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ |||........|:||||:|||||||||:||::||

Optimization• Given DNA and conservation sequence, find the most

probable annotation, according to the model

Exon5’ UTR Intron

Twinscan: Extending the modelTwinscan: Extending the model

GENSIPS 10/7/2002 34

• Each state “generates” DNA and conservation sequence independently

• Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states

TwinscanTwinscan

GENSIPS 10/7/2002 35

Performance EvaluationPerformance Evaluation

RefSeq• A set ~13,000 “Known” mRNAs• Represents ~40-50% of human genes

–Usually, only one of several splices• Mapping to genome is imperfect• Best available gold standard

GENSIPS 10/7/2002 36

GENSIPS 10/7/2002 37

0%

2%

4%

6%

8%

10%

12%

14%

16%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30+

Number of Exons

Pe

rce

nt

of

Ge

ne

s

Ref Seq

Twinscan

GENSIPS 10/7/2002 38

GENSIPS 10/7/2002 39

GENSIPS 10/7/2002 40

Short term goalShort term goal

All multi-exon human genes• Predict accurately

–Integrate information from more genomes• Verify at least one intron experimentally• Follow up with full-length verification

GENSIPS 10/7/2002 41

AcknowledgmentsAcknowledgmentsFunding agencies

• National Institutes of Health (NHGRI)• National Science Foundation (DBI)

Sequencing centers• Sanger, Whitehead, Wash. U.

My group• Ian Korf, Paul Flicek, Evan Keibler, Ping Hu

Collaborators• Roderic Guigo, Josep Abril, Genis Parra

– Pankaj Agarwal• Stylianos Antonarakis, Alexandre Reymond, Manolis

Dermitzakis

GENSIPS 10/7/2002 42

Other cladesOther clades

Plants• Arabidopsis thaliana, cabbage, rice

Nematodes• C. elegans, C. briggsae

Fungi• Cryptococcus neoformans (JEC21, H99)

GENSIPS 10/7/2002 43

Pair HMM algorithms (SLAM,…)Pair HMM algorithms (SLAM,…)

• Input is orthologous sequences.• Aligns and predicts simultaneously, using a

joint probability model• Predicts orthologous genes in 2 sequences• All predicted CDS is aligned• Some aligned regions are not predicted CDS

–Labeled conserved non-coding sequence

GENSIPS 10/7/2002 44

The algorithms (SLAM,…)The algorithms (SLAM,…)

sgp2• Alignment before prediction (tblastx)• Predicts genes in target sequence only• Don’t need orthologous input sequences

–Paralogs & low-coverage shotgun can help• Modifies scores of all potential exons, by

–At each base, add tblastx score of best overlapping local alignment (roughly)

–To gene-id scores of that potential exon

GENSIPS 10/7/2002 45

The algorithmsThe algorithmsTWINSCAN

• Alignment before prediction (blastn)• Predicts in target sequence only• Modifies scores of all potential exons, UTRs,

splice sites, start and stop models, by–At each base, apply a feature-specific

scoring model (estimated for this purpose)–to the best overlapping local alignment,

and adding the result–To Genscan scores for that feature

GENSIPS 10/7/2002 46

% Aligned, CDS vs. other% Aligned, CDS vs. other

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Coding Non-coding

Genscansgp2TwinscanSanger

10/7/2002 47GENSIPS

QuerySequence

tblastxHSPs

geneidExons

HSPsProjectio

ns

SGPExons

Syntenic Gene Prediction (sgp2)Syntenic Gene Prediction (sgp2)

GENSIPS 10/7/2002 48

Why work on gene finding?Why work on gene finding?

Genes are• Components responsible for biological function• Variations cause human disease / susceptibility• Controls for modifying biological function

–Human gene therapy–Agriculture–Nanotechnology, etc.

top related