gene feature recognition
DESCRIPTION
For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician. Gene Feature Recognition. Limsoon Wong. Recognition of Splice Sites. A simple example to start the day . Splice Sites. Donor. Acceptor. Image credit: Xu. Acceptor Site (Human Genome). - PowerPoint PPT PresentationTRANSCRIPT
NUS-KI Course on Bioinformatics, Nov 2005
For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician
Gene Feature Recognition
Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Splice Sites
A simple example to start the day
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Splice Sites
AcceptorDonor
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Acceptor Site (Human Genome)
• If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution
• Acceptor site: CAG | TAG | coding regionImage credit: Xu
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Donor Site (Human Genome)
• If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution
• Donor site: coding region | GTImage credit: Xu
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
What Positions Have “High” Information Content?
• For a weight matrix, information content of each column is calculated as
– X{A,C,G,T} Prob(X)*log (Prob(X)/0.25)
• When a column has evenly distributed nucleotides, its information content is lowest
• Only need to look at positions having high information content
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Information Content Around Donor Sites in Human Genome
• Information content column –3 = – .34*log (.34/.25) – .363*log
(.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04
column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30
Image credit: Xu
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Weight Matrix Model for Splice Sites
• Weight matrix model – build a weight matrix for donor, acceptor,
translation start site, respectively– use positions of high information content
Image credit: Xu
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Splice Site Prediction: A Procedure
• Add up freq of corr letter in corr positions:
• Make prediction on splice site based on some threshold
AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24
TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56
Image credit: Xu
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Translation Initiation Sites
An introduction to the World’s simplest TIS recognition system
A simple approach to accuracy and understandability
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Translation Initiation Site
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
A Sample cDNA
• What makes the second ATG the TIS?
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Approach
• Training data gathering• Signal generation
– k-grams, distance, domain know-how, ...• Signal selection
– Entropy, 2, CFS, t-test, domain know-how...• Signal integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]
• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Generation
• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame
0
0.5
1
1.5
2
2.5
3
A C G T
seq1seq2seq3
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Generation: An Example
• Window = 100 bases• In-frame, downstream
– GCT = 1, TTT = 1, ATG = 1…• Any-frame, downstream
– GCT = 3, TTT = 2, ATG = 2…• In-frame, upstream
– GCT = 2, TTT = 0, ATG = 0, ...
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
An Example File Resulting From Feature
Generation
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Too Many Signals
• For each value of k, there are 4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!
• This is too many for most machine learning algorithms
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (Basic Idea)
• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (eg., t-statistics)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (eg., 2)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Selection (eg., CFS)
• Instead of scoring individual signals, how about scoring a group of signals as a whole?
• CFS– Correlation-based Feature Selection– A good group contains signals that are highly
correlated with the class, and yet uncorrelated with each other
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Sample k-grams Selected by CFS
• Position –3• in-frame upstream ATG• in-frame downstream
– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC
Kozak consensusLeaky scanning
Stop codon
Codon bias?
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Signal Integration
• kNN– Given a test sample, find the k training samples
that are most similar to it. Let the majority class win
• SVM– Given a group of training samples from two
classes, determine a separating plane that maximises the margin of error
• Naïve Bayes, ANN, C4.5, ...
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Neighborhood
5 of class3 of class
=
Illustration of kNN (k=8)
Image credit: ZakiTypical “distance” measure =
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Using WEKA for TIS Prediction
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Results (3-fold x-validation)
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
Naïve Bayes 84.3% 86.1% 66.3% 85.7%
SVM 73.9% 93.2% 77.9% 88.5%
Neural Network 77.6% 93.2% 78.8% 89.4%
Decision Tree 74.0% 94.4% 81.1% 89.4%
3-NN* 73.2% 92.9% 77.2% 88.0%
* Using top 20 2-selected features from amino-acid features
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Validation Results (on Chr X and Chr 21)
• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s
ATGpr
Ourmethod
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Technique Comparisons
• Pedersen&Nielsen [ISMB’97]
– 85% accuracy– Neural network– No explicit features
• Zien [Bioinformatics’00]
– 88% accuracy– SVM+kernel engineering– No explicit features
• Hatzigeorgiou [Bioinformatics’02]
– 94% accuracy (with scanning rule)
– Multiple neural networks– No explicit features
• Our approach– 89% accuracy (94% with
scanning rule)– Explicit feature
generation– Explicit feature selection– Use any machine
learning method w/o any form of complicated tuning
NUS-KI Course on Bioinformatics, Nov 2005
Recognition of Transcription Start Sites
An introduction to the World’s best TSS recognition system
A heavy tuning approach
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Transcription Start Site
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Structure of Dragon Promoter Finder
-200 to +50window size
Model selected based on desired sensitivity
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Each model has two submodels based on GC content
GC-rich submodel
GC-poor submodel
(C+G) =#C + #GWindow Size
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Data Analysis Within Submodel
K-gram (k = 5) positional weight matrix
p
e
i
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Promoter, Exon, Intron Sensors
• These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers)
• They are calculated as s below using promoter, exon, intron data respectively Pentamer at ith
position in input
jth pentamer atith position in training window
Frequency of jthpentamer at ith positionin training window
Window size
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Data Preprocessing & ANNTuning parameters
tanh(x) =ex e-x
ex e-x
sIE
sI
sEtanh(net)
Simple feedforward ANN trained by the Bayesian regularisation method
wi
net = si * wi
Tunedthreshold
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Accuracy Comparisons
without C+G submodels
with C+G submodels
NUS-KI Course on Bioinformatics, Nov 2005
Notes
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
References (TIS Recognition)
• A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997
• H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003
• A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000
• A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
References (TSS Recognition)
• V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003
• J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997
• A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999
• M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000