cs5238 combinatorial methods in bioinformatics

53
CS5238 Combinatorial methods in bioinformatics Topic: Gene Finding – Promoter Recognition Cen Cen, Er Inn Inn, Miao Xiaoping, Piyush Kanti Bhunre, Yin Jun 1 November 2002

Upload: torn

Post on 16-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

CS5238 Combinatorial methods in bioinformatics. Topic: Gene Finding – Promoter Recognition Cen Cen, Er Inn Inn, Miao Xiaoping, Piyush Kanti Bhunre, Yin Jun. 1 November 2002. Outline of Presentation. Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS5238 Combinatorial methods in bioinformatics

CS5238 Combinatorial methods in bioinformatics

Topic: Gene Finding –

Promoter Recognition

Cen Cen, Er Inn Inn, Miao Xiaoping,

Piyush Kanti Bhunre, Yin Jun1 November 2002

Page 2: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 3: CS5238 Combinatorial methods in bioinformatics

Biological Background What is gene?

A sequence of DNA that encodes a protein or an RNA molecule.

Gene has 4 regions: Coding region, 5’ UTR, 3’ UTR and regulatory region (promoter – regulate the transcription process)

Human genome – 3G bp, but only 3% is coding region.

Page 4: CS5238 Combinatorial methods in bioinformatics

Central Dogma Central Dogma- process where DNA

sequence generates a protein Transcription & Translation

Promoter – responsible for initiation and regulation of transcription

RNA-polymerase binds to a TATA base sequence in promoter region

Page 5: CS5238 Combinatorial methods in bioinformatics

Central Dogma

Page 6: CS5238 Combinatorial methods in bioinformatics

Promoter Region Core Promoter –

TATA-box Initiator (Inr) Downstream promoter element

3 types of core promoter TATA-box TATA-less, Inr-containing Inr + DPE

Upstream promoter elements TSS -where transcription starts

on DNA

The biology of eukaryotic promoter prediction – a review by Pedersen, A.G. et. al.

Page 7: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 8: CS5238 Combinatorial methods in bioinformatics

What is Gene Finding? Generate predictions of gene locations from

primary genomic sequence (DNA sequence) by computational methods.

Task of gene finding – separate the coding regions, non-coding regions and intergenic regions. Input: A seq of DNA, X = x1x2x…xn, where xi

belongs to {A, C, G, T} Output: Correct labeling of each element in X as a

belonging to CR, NCR, Intergenic Region

Page 9: CS5238 Combinatorial methods in bioinformatics

Gene Finding 3 major kinds of gene finding strategies:

Content-based – overall properties of the sequence when making predictions

Site-based – make use of presence or absence of a specific sequence, pattern or consensus

Comparative – sequence homology (database searching)

Combinatorial approach - GeneMachine GRAIL, FGENEH, MZEF, GenScan, GeneID,

GeneParser, HMMgene and so on.

Page 10: CS5238 Combinatorial methods in bioinformatics

Gene Finding – Open Problems Overlapping genes – no existing method

that can deal with this problem Alternative splicing, alternative

transcription/translation problem Sequencing errors Difficult to identify promoter region (PR)

& polyA (high true pos + high false pos)

Page 11: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 12: CS5238 Combinatorial methods in bioinformatics

Promoter Recognition Accurate PR can help to:

Detect a respective gene more easily Determine the 5’ ends of the respective gene

more precisely Localize the regions that contain numerous

different transcription control components Developing a perfect predictive model of

PR is challenging

Page 13: CS5238 Combinatorial methods in bioinformatics

Main Approach to PR Pattern-driven strategy

1. Collect a set of real binding sites to build characteristics definition, representation, pattern or profile from them

2. Recognition of individual potential binding sites by using their characteristic profiles

3. Assembling the candidates’ binding sites following some descriptions and rules about how these arrangements should be done.

Page 14: CS5238 Combinatorial methods in bioinformatics

Problem: Given a collection of known binding sites,

how to develop a representation of those sites, which is useful to search for them in new sequence? Consensus sequences Positional Weight Matrices (PWM) Hidden Markov profiles Multilayer neural networks and so on

Page 15: CS5238 Combinatorial methods in bioinformatics

Promoter Recognition Program Statistical approach + artificial intelligence

techniques - Dragon Promoter Finder (DPF) PromoterInspector Promoter 2.0

Page 16: CS5238 Combinatorial methods in bioinformatics

Accuracy Metric for PR

A common measure of prediction accuracy

Sensitivity Specificity

TP TNSE = ——— SP = ——— TP + FN TN + FP

Evaluation largely influenced by training set and test sets

Page 17: CS5238 Combinatorial methods in bioinformatics

Prediction of Promoter2 x 2 contingency table

Page 18: CS5238 Combinatorial methods in bioinformatics

Example of Prediction - DPFPromoter positions - exact positions of the TSS

2360, 2585, 4125, 5026, 5734, 7090, 8567, 10641, -2700, -12561, -12855

PREDICTED TRANSCRIPTION START SITES:gi_59865_emb_X02138.1_HEHSV1SU Herpes simplex virus type 1 _HSV1_ short unique region DNA

Sequence length: 12979 # of bases: A=2286, C=4271, G=4078, T=2344 Predicted TSS

Forward strand 4125 5733 7093 8567 10641 # of guesses = 5

Reverse complement strand -12561 -2698 # of guesses = 2

Page 19: CS5238 Combinatorial methods in bioinformatics

MeasurementDragon Promoter Finder, BIC-KRDL Singapore

Exp. positive negativepositive 7 4negative 0 6479

Pred.

SE = 7/11 = 0.64

SP = 6479/6479 = 1

Page 20: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 21: CS5238 Combinatorial methods in bioinformatics

Dragon Promoter Finder -Introduction Dragon Promoter Finder( DPF)

locates RNA polymerase II promoters in DNA sequences of vertebrates

predicts Transcription Start Site (TSS) positions. strand specific Components:

nonlinear promoter recognition models signal procession artificial neural networks (ANNs ) sensors.

Page 22: CS5238 Combinatorial methods in bioinformatics

Introduction (cont) The latest version

Dragon Promoter Finder Ver. 1.3

Main difference in new version models are now specialized for C+G-rich and

for C+G-poor sequences.

Page 23: CS5238 Combinatorial methods in bioinformatics

Structure

Overall Model comprises a collection of a number of basic

models Basic Model

made up of two sub-models, A and B trained for different ranges of system sensitivity trained separately for the best performance. 

Sub-Model

Page 24: CS5238 Combinatorial methods in bioinformatics

Overall Model

Page 25: CS5238 Combinatorial methods in bioinformatics

Basic Model

A composite collection of basic models Possess identical structure Trained for narrow specificity range. Data procession in each model is analogous.

Page 26: CS5238 Combinatorial methods in bioinformatics

Basic Model

Page 27: CS5238 Combinatorial methods in bioinformatics

Sub-model

Page 28: CS5238 Combinatorial methods in bioinformatics

Sub-model Three Sensors

Specific functional regions of a gene: promoter, coding-exon, intron

Represented as positional distributions of overlapping pentamers

ANNs

Page 29: CS5238 Combinatorial methods in bioinformatics

Sensors Pentamers :

All sequences of 5 consecutive nucleotides. AAAAA,AAAAC,AAAAG…… 4^5=1024 pentamers Selected the most significant 256 pentamers from 1024

pentamers according to statistical relevance

Positional weight matrices (PWM): The positional distribution of selected pentamers Generate PWMs for each of the 3 functional groups,

promoter, exon & intron, by counting the frequencies of all selected pentamers at each position.

Page 30: CS5238 Combinatorial methods in bioinformatics

How to analyze the content of a data window: Sequence W=n1n2…nL-1nL, ni belongs to{A, C, G, T}

Sequence P of successive overlapping pentamers pj:

P = p1p2… pL–5pL–4.

iji

ijiij

ijij

L

iij

j

L

iij

ij

ppif

ppifffp

f

fpS

,0

,

,max

,,

4

1,

4

1,

: The jth pentamer at position i

: The frequency of the jth pentamer at position i

ijp

jif ,

S = score for each data window

The higher the s, the more likely the data window represents the respective functional region.

These scores are input to nonlinear signal processing block (SPB)

Output from SPB is then input to ANN

Page 31: CS5238 Combinatorial methods in bioinformatics

ANNs Inputs: scores (outputs of sensors) A multi-sensor integration. Trained by the Bayesian regularization method to

separate promoter regions from the non-promoter regions.

The threshold that best separated promoters from non promoter was selected

ANN output > threshold promoter region + TSS at a position 50bp before the data window’s end

Page 32: CS5238 Combinatorial methods in bioinformatics

Evaluation

Successfully recognize both CpG island-related and CpG island-nonrelated promoters.

Its performance on several large sets(A,B,and human chromosome 22) is reasonably consistent

On the average, its expected maximum sensitivities is approximately 66 percent.

In general, the DPF produces many times fewer FP predictions than comparative systems at the same sensitivity level.

Page 33: CS5238 Combinatorial methods in bioinformatics

Comparison

Page 34: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 35: CS5238 Combinatorial methods in bioinformatics

Open Problem & Future Research Open problem:

Lack of biological information on transcription process

Characteristics of promoter -> low ratio of accuracy

Future research work: Designing specific algorithm for either classes of

promoters or species-specific promoters Comparative sequence analysis Combinatorial approach Data mining tools

Page 36: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 37: CS5238 Combinatorial methods in bioinformatics

Gene Recognition Algorithm

Using Dynamic Programming Approach Presented by: Yin Jun

Page 38: CS5238 Combinatorial methods in bioinformatics

Dynamic Programming Algorithm

Existing Dynamic Programming Algorithm for Gene Finding

Snyder and Stormo’s method GeneParser

Solovyev et al’s method FGENEH

MORGAN’s DP algorithm

Page 39: CS5238 Combinatorial methods in bioinformatics

Goal of those Algorithm

1. Divide DNA sequence into alternate intron and exon regions.

2. Define a score for each kind of division. Try to find a kind of division which has the maximum score. The higher the score, the better the division.

Page 40: CS5238 Combinatorial methods in bioinformatics

Advantage and Disadvantage of Snyder and Stormo’s algorithm Advantage

the donor and the acceptor site HMM hidden status

Disadvantage Cannot recognize promoter 3-mer based

Page 41: CS5238 Combinatorial methods in bioinformatics

Our Algorithm

Combine the ideas of “Dragon Promoter Finder” and “Snyder and Stormo’s algorithm”

Can deal with promoters Use pentamer instead of 3-mer, more

efficient Dynamic Programming

Page 42: CS5238 Combinatorial methods in bioinformatics

Training Phase

Pentamer – 5 consecutive bases For example: “ACGGT” There are 45=1024 different kind of

pentamers Divide a DNA sequence into pentamers From training data, we can obtain the

probability for each kind of pentamer to become a promoter, an intron or an exon

Page 43: CS5238 Combinatorial methods in bioinformatics

Probability Table

Pentamer promoter intron Exon

A: ACGGT 0.13 0.20 0.67

B: CGATA 0.10 0.44 0.46

C: AUGCC 0.87 0.07 0.06

D: TAGTG 0.24 0.49 0.27

Page 44: CS5238 Combinatorial methods in bioinformatics

Principle of Division (1) Good (red: promoter; green: intron; blue:

exon)

Bad (low sum of probability)

C AA B B C BC D D D

C AA B B C BC D D D

Page 45: CS5238 Combinatorial methods in bioinformatics

Principle of Division (2) Good (red: promoter; green: intron; blue:

exon)

Bad (too frequent mutation)C AA B B C BC D D D

C AA B B C BC D D D

Page 46: CS5238 Combinatorial methods in bioinformatics

Mutation Penalty M(x, x) should be 0, x {1, 2, 3}∈

1: promoter 2: intron 3: exon

Example

To

From1 2 3

1 0 4.1 4.4

2 8 0 2.6

3 7.1 3.2 0

Page 47: CS5238 Combinatorial methods in bioinformatics

Notation P(p, r) – Probability for pentamer p belongs

region r Obtain from training data

M(s, t) – Mutation penalty Parameters to specify

pi (1≤i≤n) – The i th pentamer in the DNA sequence Input data (testing data)

a(pi) – Region assignment result; a(pi) {1, 2, 3}∈ Output data

Page 48: CS5238 Combinatorial methods in bioinformatics

Score Function For division assignment a, its score is

We use dynamic programming algorithm to find the best division assignment, whose score is the highest

n

iii

n

iii papaMpapPaS

21

1

))(),(())(,()(

Page 49: CS5238 Combinatorial methods in bioinformatics

Bases Let F(i, j, s, t) be the optimal score for the

consecutive segment of pentamers from i th to j th, where i th pentamer is assigned region s, j th pentamer is assigned region t

Bases

0),,1,( tsiiF

),(),,,( spPssiiF i

Page 50: CS5238 Combinatorial methods in bioinformatics

Recursive Definition Recursive Definition

Finally, we get F(1, n, s, t) where s, t {1, ∈2, 3}

Pick up the highest score from the 9 scores

)),(),,,1(),,,((max),,,(,,

vuMtvjkFuskiFtsjiFvuk

Page 51: CS5238 Combinatorial methods in bioinformatics

Time Complexity There are 9n2/2=O(n2) entries in the

dynamic programming table Filling each entry needs average n/2=O(n)

time The total time complexity is O(n3)

Page 52: CS5238 Combinatorial methods in bioinformatics

Outline of Presentation Biological Background Gene Finding Promoter Recognition Dragon Promoter Finder Open Problem and Future Research New Algorithm Conclusion

Page 53: CS5238 Combinatorial methods in bioinformatics

Conclusion Significant achievement in promoter

recognition technique & algorithms contributes to major advances in gene finding.

There is still room for improvement in promoter recognition.

A new algorithm is proposed for gene recognition.