tis prediction in human cdnas with high accuracy
DESCRIPTION
Correct identification of the Translation Initiation Start (TIS) in cDNA is an important issue for genome annotation. The aim of this work is to improve upon current methods and provide a performance guaranteed prediction.TRANSCRIPT
Translation initiation start prediction in human cDNAs
with high accuracy
A. G. Hatzigeorgiou
Paper PresentationIntroduction to BioinformaticsAnaxagoras Fotopoulos | Marina Adamou - Tzani
21/01/2014
2
• Primary objective of the present research is contribution to the definition of the coding part of a gene.
• The search is performed in cDNA sequences.
• Coding regions are surrounded by UnTraslated Regions (UTRs).
• The interest is focused in finding the Translation Initiation Start (TIS) which defines the start of the coding region.
Introduction
complementary DNA (cDNA) is DNA synthesized from a messenger RNA (mRNA) in a reaction catalyzed by the enzymes reverse transcriptase and DNA polymerase.
cDNA
3
Generalized Second Order Profiles. • Implementation of the Ribosome Scanning Model
(Kozak, 1996)
Previous Research
Positional Conditional Probability matrix. Salzberg, 1997
Agarwal and Bafna, 1998a
The ribosome first attaches to a specific region in the 5’ end of the mRNA and then scans the sequence for the first ATG
• No significant deferences were observed between the above methods and a weight matrix
• The above methods are studied in common due to the high rate of false positives.
4
Six characteristics are applied for the analysis of the region around TIS including weight matrix and hexanucleotide difference.
Use of Support Vector Machines (SVMs) for TIS prediction
Previous Research
Usage of ANNs for the recognition of local context and statistical properties around the TIS. Large region of analysis 100 bases before and 100 after the start codon
Pedersen and Nielsen, 1997
Salamov et. al., 1998
Zien et. al., 2000
All of the above methods give up to 85% correct predictions.
5
Coding/Non
Coding Potential
Coding
Conserved Motif
Consensus
Methods – Suggested Model
NN
ScoreMultiplication
Swissprot
Training Gene Pool
Test Gene PoolTraining Set + Evaluation Set
Test Set
Parameter estimation
TIS Prediction
Training Gene Pool
Test Gene PoolTraining Set + Evaluation Set
Test Set
Parameter estimation
TIS Prediction
NN
475 cDNAs(Verified + Checked)
6
Consensus Neural Network
12-nucleotides long window
325 positive+
325 negativeexamples
Binirization of the input
Selection of the appropriate
feed-forward NN
Feed forward with short cut connections & two hidden units trained with cascade correlation algorithm
Cascade Correlation Algorithm
7
Coding Neural Network
12-nucleotides long window
54 nucleotides length window
Use Smith – Waterman algorithm for the elimination
of homologies between training
and test data
282 genes with less than 70% homology
were used for training
700 positive +
700 negative Sequence regions
extracted for training
250 positive+
250 negativeSequence regions
extracted for testing
Apply codon usage static(Count for every window
all non-overlapping codons)
The sequence window is
rescaled to 64 units
Every unit gives the normalized
frequency of the codon in the
window
Resilient back- propagation algorithm is applied to a
feed-forward NN.
8
Integrated method Analysis of full length mRNA
sequences
1st stage
• Calculation of coding score for every nucleotide of the mRNA sequence
2nd stage
• Calculation of coding evidence of the coding region included in the longest ORF of the sequence
3rd stage
• For every in-frame ATG a consensus score is calculated
4th stage
• For the same in-frame ATG, a coding difference score is calculated
The final score is obtained by combining the output of the consensus ANN and the
coding difference
9
Integrated method Analysis of full length mRNA
sequences
The use of the Las Vegas algorithm gives a confident decision. The incorporation of this algorithm leads to a highly accurate recognition of the TIS in human
cDNAs for 60% of the cases!
Las Vegas algorithm provides a correct prediction in some cases and has a “no answer” option in the
remaining cases. That is, it always produces the correct result or it informs about the failure.
• This method provides only one prediction for every ORF• According to the results of the test group:
• 94% of the TIS were correctly predicted• 6% of the predictions were false positive
Las Vegas
10
Results – Score Combination 1/3
Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame
Nucleotide 255 : cod 0.98 – local 0.2
A score combination of coding ANN and consensus ANN gives low final score.
11
Results – Score Combination 2/3
Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame
Nucleotide 270: cod 0.44 – local 0.4
A score combination of coding ANN and consensus ANN gives low final score.
12
Results – Score Combination 3/3
Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame
Nucleotide 148: cod 0.95 – local 0.8
A score combination of coding ANN and consensus ANN gives high final score.
Correct TIS
13
Results – Methods Comparison
Correct TIS positions
14
Results – Methods Comparison Prediction for the 3 TIS positions
with the highest scores
15
Results – Methods Comparison Consensus motif scores
(only for DIANA-TIS)
16
Results – Methods Comparison
Final scores
17
Results – Methods Comparison
Correct predictions
18
Results – Methods Comparison
High prediction
score difference
Found TIS but other higher score exists
TIS correct position: 471
Did not find TIS
Prediction Analysis
19
Results – Methods Comparison
Correct TIS positions
Performance of the three programs for TIS prediction along the mRNA with signal peptide sequences
20
Results – Methods Comparison
Length of signal peptide
21
Results – Methods Comparison
Prediction for the 2 TIS positions with the highest scores
22
Results – Methods Comparison
Consensus motif scoresonly for DIANA-TIS)
23
Results – Methods Comparison
Final scores
24
Results – Methods Comparison
Prediction example #1:DIANA-TIS is able to distinguish between TIS and other ATGs better than other ANN based programs like NetStart:
2 suitable ATGs are 12 nucleotides away
Coding/non-coding information is similar
Consensus motif is completely different
25
Results – Methods Comparison
Consensus motif is completely different
Combined score is much lower
In some signal peptides sequences the coding potential score is relatively low, and can thus affect the combined score.
Prediction example #2:A favorable prediction does not work for all examples:
26
Results – Methods Comparison
TIS prediction program
TIS prediction
rate
DIANA-TIS (2001) 94%
Agarwal & Bafna (1998) 85%
ATGPred (Salamov et al, 1998) 79%
NetStart (Pedersen & Nielsen, 1997) 78%
These methods allow more than one prediction per gene
Notice The results come from different datasets and thus these numbers should not be directly compared.
27
Thank you!
National & KapodistrianUniversity of AthensDepartment of Informatics
Technological Education Institute of AthensDepartment of Biomedical Engineering
Biomedical ResearchFoundation Academy of Athens
Demokritos National Center for Scientific Research
Introduction to BioinformaticsInformation Technologies in Medicine and Biology