tis prediction in human cdnas with high accuracy

Translation initiation start prediction in human cDNAs

with high accuracy

A. G. Hatzigeorgiou

Paper PresentationIntroduction to BioinformaticsAnaxagoras Fotopoulos | Marina Adamou - Tzani

21/01/2014

2

• Primary objective of the present research is contribution to the definition of the coding part of a gene.

• The search is performed in cDNA sequences.

• Coding regions are surrounded by UnTraslated Regions (UTRs).

• The interest is focused in finding the Translation Initiation Start (TIS) which defines the start of the coding region.

Introduction

complementary DNA (cDNA) is DNA synthesized from a messenger RNA (mRNA) in a reaction catalyzed by the enzymes reverse transcriptase and DNA polymerase.

cDNA

3

Generalized Second Order Profiles. • Implementation of the Ribosome Scanning Model

(Kozak, 1996)

Previous Research

Positional Conditional Probability matrix. Salzberg, 1997

Agarwal and Bafna, 1998a

The ribosome first attaches to a specific region in the 5’ end of the mRNA and then scans the sequence for the first ATG

• No significant deferences were observed between the above methods and a weight matrix

• The above methods are studied in common due to the high rate of false positives.

4

Six characteristics are applied for the analysis of the region around TIS including weight matrix and hexanucleotide difference.

Use of Support Vector Machines (SVMs) for TIS prediction

Previous Research

Usage of ANNs for the recognition of local context and statistical properties around the TIS. Large region of analysis 100 bases before and 100 after the start codon

Pedersen and Nielsen, 1997

Salamov et. al., 1998

Zien et. al., 2000

All of the above methods give up to 85% correct predictions.

5

Coding/Non

Coding Potential

Coding

Conserved Motif

Consensus

Methods – Suggested Model

NN

ScoreMultiplication

Swissprot

Training Gene Pool

Test Gene PoolTraining Set + Evaluation Set

Test Set

Parameter estimation

TIS Prediction

Training Gene Pool

Test Gene PoolTraining Set + Evaluation Set

Test Set

Parameter estimation

TIS Prediction

NN

475 cDNAs(Verified + Checked)

6

Consensus Neural Network

12-nucleotides long window

325 positive+

325 negativeexamples

Binirization of the input

Selection of the appropriate

feed-forward NN

Feed forward with short cut connections & two hidden units trained with cascade correlation algorithm

Cascade Correlation Algorithm

7

Coding Neural Network

12-nucleotides long window

54 nucleotides length window

Use Smith – Waterman algorithm for the elimination

of homologies between training

and test data

282 genes with less than 70% homology

were used for training

700 positive +

700 negative Sequence regions

extracted for training

250 positive+

250 negativeSequence regions

extracted for testing

Apply codon usage static(Count for every window

all non-overlapping codons)

The sequence window is

rescaled to 64 units

Every unit gives the normalized

frequency of the codon in the

window

Resilient back- propagation algorithm is applied to a

feed-forward NN.

8

Integrated method Analysis of full length mRNA

sequences

1st stage

• Calculation of coding score for every nucleotide of the mRNA sequence

2nd stage

• Calculation of coding evidence of the coding region included in the longest ORF of the sequence

3rd stage

• For every in-frame ATG a consensus score is calculated

4th stage

• For the same in-frame ATG, a coding difference score is calculated

The final score is obtained by combining the output of the consensus ANN and the

coding difference

9

Integrated method Analysis of full length mRNA

sequences

The use of the Las Vegas algorithm gives a confident decision. The incorporation of this algorithm leads to a highly accurate recognition of the TIS in human

cDNAs for 60% of the cases!

Las Vegas algorithm provides a correct prediction in some cases and has a “no answer” option in the

remaining cases. That is, it always produces the correct result or it informs about the failure.

• This method provides only one prediction for every ORF• According to the results of the test group:

• 94% of the TIS were correctly predicted• 6% of the predictions were false positive

Las Vegas

10

Results – Score Combination 1/3

Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame

Nucleotide 255 : cod 0.98 – local 0.2

A score combination of coding ANN and consensus ANN gives low final score.

11



Nucleotide 270: cod 0.44 – local 0.4

A score combination of coding ANN and consensus ANN gives low final score.

12



Nucleotide 148: cod 0.95 – local 0.8

A score combination of coding ANN and consensus ANN gives high final score.

Correct TIS

13

Results – Methods Comparison

Correct TIS positions

14

Results – Methods Comparison Prediction for the 3 TIS positions

with the highest scores

15

Results – Methods Comparison Consensus motif scores

(only for DIANA-TIS)

16


Final scores

17


Correct predictions

18


High prediction

score difference

Found TIS but other higher score exists

TIS correct position: 471

Did not find TIS

Prediction Analysis

19


Correct TIS positions

Performance of the three programs for TIS prediction along the mRNA with signal peptide sequences

20


Length of signal peptide

21


Prediction for the 2 TIS positions with the highest scores

22


Consensus motif scoresonly for DIANA-TIS)

23


Final scores

24


Prediction example #1:DIANA-TIS is able to distinguish between TIS and other ATGs better than other ANN based programs like NetStart:

2 suitable ATGs are 12 nucleotides away

Coding/non-coding information is similar

Consensus motif is completely different

25


Consensus motif is completely different

Combined score is much lower

In some signal peptides sequences the coding potential score is relatively low, and can thus affect the combined score.

Prediction example #2:A favorable prediction does not work for all examples:

26


TIS prediction program

TIS prediction

rate

DIANA-TIS (2001) 94%

Agarwal & Bafna (1998) 85%

ATGPred (Salamov et al, 1998) 79%

NetStart (Pedersen & Nielsen, 1997) 78%

These methods allow more than one prediction per gene

Notice The results come from different datasets and thus these numbers should not be directly compared.

27

Thank you!

National & KapodistrianUniversity of AthensDepartment of Informatics

Technological Education Institute of AthensDepartment of Biomedical Engineering

Biomedical ResearchFoundation Academy of Athens

Demokritos National Center for Scientific Research

Introduction to BioinformaticsInformation Technologies in Medicine and Biology

tis prediction in human cdnas with high accuracy

Health & Medicine

consensus score

coding difference score

score of coding ann

coding regions

coding frame

results score combination

high final score

position of consensus