gene feature recognition

40
NUS-KI Course on Bioinformatics, Nov 2005 ten notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician Gene Feature Recognition Limsoon Wong

Upload: khuong

Post on 07-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician. Gene Feature Recognition. Limsoon Wong. Recognition of Splice Sites. A simple example to start the day . Splice Sites. Donor. Acceptor. Image credit: Xu. Acceptor Site (Human Genome). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005

For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician

Gene Feature Recognition

Limsoon Wong

Page 2: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005

Recognition of Splice Sites

A simple example to start the day

Page 3: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Splice Sites

AcceptorDonor

Page 4: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Acceptor Site (Human Genome)

• If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution

• Acceptor site: CAG | TAG | coding regionImage credit: Xu

Page 5: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Donor Site (Human Genome)

• If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution

• Donor site: coding region | GTImage credit: Xu

Page 6: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

What Positions Have “High” Information Content?

• For a weight matrix, information content of each column is calculated as

– X{A,C,G,T} Prob(X)*log (Prob(X)/0.25)

• When a column has evenly distributed nucleotides, its information content is lowest

• Only need to look at positions having high information content

Page 7: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Information Content Around Donor Sites in Human Genome

• Information content column –3 = – .34*log (.34/.25) – .363*log

(.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04

column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30

Image credit: Xu

Page 8: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Weight Matrix Model for Splice Sites

• Weight matrix model – build a weight matrix for donor, acceptor,

translation start site, respectively– use positions of high information content

Image credit: Xu

Page 9: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Splice Site Prediction: A Procedure

• Add up freq of corr letter in corr positions:

• Make prediction on splice site based on some threshold

AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24

TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56

Image credit: Xu

Page 10: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005

Recognition of Translation Initiation Sites

An introduction to the World’s simplest TIS recognition system

A simple approach to accuracy and understandability

Page 11: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Translation Initiation Site

Page 12: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

A Sample cDNA

• What makes the second ATG the TIS?

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Page 13: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Approach

• Training data gathering• Signal generation

– k-grams, distance, domain know-how, ...• Signal selection

– Entropy, 2, CFS, t-test, domain know-how...• Signal integration

– SVM, ANN, PCL, CART, C4.5, kNN, ...

Page 14: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Training & Testing Data

• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]

• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts

Page 15: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Generation

• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame

0

0.5

1

1.5

2

2.5

3

A C G T

seq1

seq2

seq3

Page 16: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Generation: An Example

• Window = 100 bases• In-frame, downstream

– GCT = 1, TTT = 1, ATG = 1…• Any-frame, downstream

– GCT = 3, TTT = 2, ATG = 2…• In-frame, upstream

– GCT = 2, TTT = 0, ATG = 0, ...

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

Page 17: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

An Example File Resulting From Feature

Generation

Page 18: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Too Many Signals

• For each value of k, there are 4k * 3 * 2 k-grams

• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!

• This is too many for most machine learning algorithms

Page 19: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Selection (Basic Idea)

• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance

Page 20: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Selection (eg., t-statistics)

Page 21: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Selection (eg., 2)

Page 22: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Selection (eg., CFS)

• Instead of scoring individual signals, how about scoring a group of signals as a whole?

• CFS– Correlation-based Feature Selection– A good group contains signals that are highly

correlated with the class, and yet uncorrelated with each other

Page 23: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Sample k-grams Selected by CFS

• Position –3• in-frame upstream ATG• in-frame downstream

– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC

Kozak consensusLeaky scanning

Stop codon

Codon bias?

Page 24: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Signal Integration

• kNN– Given a test sample, find the k training samples

that are most similar to it. Let the majority class win

• SVM– Given a group of training samples from two

classes, determine a separating plane that maximises the margin of error

• Naïve Bayes, ANN, C4.5, ...

Page 25: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Neighborhood

5 of class

3 of class

=

Illustration of kNN (k=8)

Image credit: ZakiTypical “distance” measure =

Page 26: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Using WEKA for TIS Prediction

Page 27: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Results (3-fold x-validation)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

3-NN* 73.2% 92.9% 77.2% 88.0%

* Using top 20 2-selected features from amino-acid features

Page 28: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Validation Results (on Chr X and Chr 21)

• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

ATGpr

Ourmethod

Page 29: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Technique Comparisons

• Pedersen&Nielsen [ISMB’97]

– 85% accuracy– Neural network– No explicit features

• Zien [Bioinformatics’00]

– 88% accuracy– SVM+kernel engineering– No explicit features

• Hatzigeorgiou [Bioinformatics’02]

– 94% accuracy (with scanning rule)

– Multiple neural networks– No explicit features

• Our approach– 89% accuracy (94% with

scanning rule)– Explicit feature

generation– Explicit feature selection– Use any machine

learning method w/o any form of complicated tuning

Page 30: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005

Recognition of Transcription Start Sites

An introduction to the World’s best TSS recognition system

A heavy tuning approach

Page 31: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Transcription Start Site

Page 32: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Structure of Dragon Promoter Finder

-200 to +50window size

Model selected based on desired sensitivity

Page 33: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Each model has two submodels based on GC content

GC-rich submodel

GC-poor submodel

(C+G) =#C + #GWindow Size

Page 34: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Data Analysis Within Submodel

K-gram (k = 5) positional weight matrix

p

e

i

Page 35: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Promoter, Exon, Intron Sensors

• These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers)

• They are calculated as s below using promoter, exon, intron data respectively Pentamer at ith

position in input

jth pentamer atith position in training window

Frequency of jthpentamer at ith positionin training window

Window size

Page 36: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Data Preprocessing & ANNTuning parameters

tanh(x) =ex e-x

ex e-x

sIE

sI

sEtanh(net)

Simple feedforward ANN trained by the Bayesian regularisation method

wi

net = si * wi

Tunedthreshold

Page 37: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

Accuracy Comparisons

without C+G submodels

with C+G submodels

Page 38: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005

Notes

Page 39: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

References (TIS Recognition)

• A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997

• H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003

• A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000

• A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002

Page 40: Gene Feature Recognition

NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong

References (TSS Recognition)

• V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003

• J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997

• A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999

• M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000