![Page 1: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/1.jpg)
Gene Recognition
Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov
![Page 2: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/2.jpg)
The Central Dogma
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
![Page 3: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/3.jpg)
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
![Page 4: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/4.jpg)
Locating Genes
• We have a genome sequence, maybe with related genomes aligned to it…where are the genes?
• Yeast genome is about 70% protein coding• About 6000 genes
• Human genome is about 1.5% protein coding• About 22,000 genes
![Page 5: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/5.jpg)
Finding Genes in Yeast
Start codonATG
5’ 3’
Stop codonTAG/TGA/TAA
Intergenic Coding Intergenic
Mean coding length about 1500bp (500 codons)
Transcript
![Page 6: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/6.jpg)
Finding Genes in Yeast
• ORF Scanning Look for long open reading frames (ORFs)
ORFs start with ATG and contain no in-frame stop codons
Long ORFs unlikely to occur by chance (i.e., they are probably genes)
![Page 7: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/7.jpg)
Finding Genes in Yeast
Yeast ORF distribution
![Page 8: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/8.jpg)
Introns: The Bane of ORF Scanning
Start codonATG
5’ 3’
Stop codonTAG/TGA/TAA
Splice sites
Intergenic Exon Intron IntergenicExon ExonIntron
Transcript
![Page 9: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/9.jpg)
Introns: The Bane of ORF Scanning
• Drosophila:
• 3.4 introns per gene on average
• mean intron length 475, mean exon length 397
• Human:
• 8.8 introns per gene on average
• mean intron length 4400, mean exon length 165
• ORF scanning is defeated
![Page 10: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/10.jpg)
Where are the genes?Where are the genes?
![Page 11: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/11.jpg)
![Page 12: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/12.jpg)
Needles in a Haystack
![Page 13: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/13.jpg)
Now What?
• We need to use more information to help recognize genes
Regular structure
Exon/intron lengths
Nucleotide composition
Biological signals• Start codon, stop codon, splice sites
Patterns of conservation
![Page 14: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/14.jpg)
Regular Gene Structure
• Protein coding region starts with ATG, ends with TAA/TAG/TGA
• Exons alternate with introns
• Introns start with GT/GC, end with AG
• Each exon has a reading frame determined by the codon position at the end of the last exon
![Page 15: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/15.jpg)
Next Exon:Frame 0
Next Exon:Frame 1
![Page 16: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/16.jpg)
Exon/Intron Lengths
![Page 17: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/17.jpg)
Nucleotide Composition
• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
![Page 18: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/18.jpg)
Biological Signals
• How does the cell recognize start/stop codons and splice sites? In part, from characteristic base composition
• Donor site (start of intron) is recognized by a section of U1 snRNA
U1 snRNA: GUCCAUUCADonor site consensus: MAGGTRAGT
M means “A or C”, R means “A or G”
![Page 19: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/19.jpg)
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
![Page 20: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/20.jpg)
5’ 3’Donor site
Position
-8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 25
Splice Sites
![Page 21: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/21.jpg)
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
![Page 22: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/22.jpg)
• WMM: weight matrix model = PSSM (Staden 1984)• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)• MDD: maximal dependence decomposition (Burge & Karlin 1997)
Decision-tree algorithm to take pairwise dependencies into account
• For each position I, calculate Si = ji2(Ci, Xj)
• Choose i* such that Si* is maximal and partition into two subsets, until
• No significant dependencies left, or
• Not enough sequences in subset
Train separate WMM models for each subset
All donor splice sites
G5
not G5
G5G-1
G5
not G-1
G5G-1
A2
G5G-1
not A2
G5G-1
A2U6
G5G-1A2
not U6
Splice Sites
![Page 23: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/23.jpg)
Patterns of Conservation
• Functional sequences are much more conserved than nonfunctional sequences
• Signal sequences show compensatory mutations If one position mutates away from consensus, often a
different one will mutate to consensus
• Coding sequence shows three-periodic pattern of conservation
![Page 24: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/24.jpg)
Three Periodicity
• Most amino acids can be coded for by more than one DNA triplet
• Usually, the degeneracy is in the last position
Human CCTGTT (Proline, Valine)Mouse CCAGTC (Proline, Valine)Rat CCAGTC (Proline, Valine)Dog CCGGTA (Proline, Valine)Chicken CCCGTG (Proline, Valine)
![Page 25: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/25.jpg)
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
Exon Exon ExonIntronIntronIntergenic Intergenic
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
![Page 26: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/26.jpg)
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
Exon Exon ExonIntronIntronIntergenic Intergenic
Hidden Markov Models for Gene Finding
Intergene State
First Exon State
IntronState
![Page 27: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/27.jpg)
GENSCAN
![Page 28: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/28.jpg)
GENSCAN
• Burge and Karlin, Stanford, 1997
• Before The Human Genome Project No alignments available Estimated human gene count was 100,000
• Explicit state duration HMM (with tricks) Intergenic and intronic regions have geometric length
distribution Exons are only possible when correct flanking
sequences are present
![Page 29: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/29.jpg)
GENSCAN
• Output probabilities for NC and CDS depend on previous 5 bases (5th-order) P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5)
• Each CDS frame has its own model
• WAM models for start/stop codons and acceptor sites
• MDD model for donor sites
• Separate parameters for regions of different GC content
![Page 30: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/30.jpg)
GENSCAN Performance
• First program to do well on realistic sequences Long, multiple genes in both orientations
• Pretty good sensitivity, poor specificity 70% exon Sn, 40% exon Sp
• Not enough exons per gene
• Was the best gene predictor for about 4 years
![Page 31: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/31.jpg)
TWINSCAN
• Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001
• Uses an informant sequence to help predict genes For human, informant is normally mouse
• Informant sequence consists of three characters Match: | Mismatch: : Unaligned: .
• Informant sequence assumed independent of target sequence
![Page 32: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/32.jpg)
The TWINSCAN Model
• Just like GENSCAN, except adds models for conservation sequence
• 5th-order models for CDS and NC, 2nd-order models for start and stop codons and splice sites One CDS model for all frames
• Many informants tried, but mouse seems to be at the “sweet spot”
![Page 33: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/33.jpg)
TWINSCAN Performance
• Slightly more sensitive than GENSCAN, much more specific Exon sensitivity/specificity about 75%
• Much better at the gene level Most genes are mostly right, about 25% exactly right
• Was the best gene predictor for about 4 years
![Page 34: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/34.jpg)
N-SCAN
• Gross and Brent, Washington University in St. Louis, 2005
• If one informant sequence is good, let’s try more!
• Also several other improvements on TWINSCAN
![Page 35: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/35.jpg)
N-SCAN Improvements
• Multiple informants
• Richer models of sequence evolution
• Frame-specific CDS conservation model
• Conserved noncoding sequence model
• 5’ UTR structure model
![Page 36: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/36.jpg)
• GENSCAN
• TWINSCAN
• N-SCAN
HMM Outputs
Target GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA.
..
![Page 37: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/37.jpg)
Phylogenetic Bayesian Network Models
)|()|()|(
)|()|()|()(),,,,,,(
3323
21211321
ARPAMPAAP
AHPAAPACPAPAAARMCHP
![Page 38: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/38.jpg)
Homology-Based Gene Prediction
• Idea: Try to predict a gene in one organism using a known orthologous gene or protein from another organism
• Genewise Protein homology
• Projector Gene structure homology
• Very accurate if (and only if??) homology is high
![Page 39: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/39.jpg)
Evaluating Performance
• Three main levels of performance: gene, exon, nucleotide
• Two measures of performance: Sensitivity: what fraction of the true features did we
predict correctly? Specificity: what fraction of our predicted features
were correct?
• Testing standard is whole-genome prediction Predicting on single-gene sequences is easier and less
interesting
![Page 40: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/40.jpg)
Exact Exon Accuracy
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Exon Sn Exon Sp
GENSCAN EXONIPHY SGP2 TWINSCAN 2.0 N-SCAN
![Page 41: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/41.jpg)
Exact Gene Accuracy
0
0.1
0.2
0.3
0.4
0.5
Gene Sn Gene Sp
GENSCAN SGP2 TWINSCAN 2.0 N-SCAN
![Page 42: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/42.jpg)
Intron Sensitivity By Length
0
0.2
0.4
0.6
0.8
1
0-10
10-2
0
20-3
0
30-4
0
40-5
0
50-6
0
60-7
0
70-8
0
80-9
0
90-1
00Length (Kb)
N-SCAN
SGP2
GENSCAN
TWINSCAN
![Page 43: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/43.jpg)
Human Informant Effectiveness
00.10.20.30.40.50.60.70.80.9
Gene Sn Gene Sp Exon Sn Exon Sp
Chicken Rat Mouse All
![Page 44: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/44.jpg)
Drosophila Informant Effectiveness
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Gene Sn Gene Sp Exon Sn Exon Sp
A. gambiae D. yakuba D. pseudoobscura All
![Page 45: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649d565503460f94a33c88/html5/thumbnails/45.jpg)
The Future
• Many new genomes being sequenced—they will need annotations! Current experimental “shotgun” methods not enough However, cheap targeted experiments are available to
verify predicted genes
• Promising directions in gene prediction: Conditional random fields Multiple informants—can we actually get them to
work???