poisson approximation for palindrome distributions in dna

18
Poisson Approximation for Palindrome Distributions in DNA Viral Genomes Ming-Ying Leung, Ph.D. Director, Bioinformatics Program Professor, Department of Mathematical Sciences The University of Texas at El Paso El Paso, Texas, U.S.A.

Upload: others

Post on 30-Dec-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Poisson Approximation for Palindrome Distributions in DNA

Poisson Approximation for Palindrome Distributions in DNA Viral Genomes

Ming-Ying Leung, Ph.D.Director, Bioinformatics Program

Professor, Department of Mathematical Sciences The University of Texas at El Paso

El Paso, Texas, U.S.A.

Page 2: Poisson Approximation for Palindrome Distributions in DNA

Law of Small Numbers (Poisson, 1837)

1

Let , ,..., be i.i.d. Bernoulli1 2random variables with success

probability and let .n

ii

X X Xn

p W X=

=�

If and 0 in such a way that 0,then for 0,1, 2,...

n p npk

�� � � = >=

( ) (1 ) ( )!

where is a Poisson random variable with parameter .

kn k n kP W k p p e P Y kk k

Y�

���

� � � ��= = � � = =� � � �

Page 3: Poisson Approximation for Palindrome Distributions in DNA

The Chen-Stein Method (Chen, 1970)Let be an index set. For any , is a Bernoulli randomvariable with success probability and ( ) is a subset of containing , called the neighborhood of dependence of . Let

and I

I I Xp B I

W X

��

��

� �

=� be a Poisson random variable with parameter

.I

Y

p

��

��

=�

� �

1( )

2( )\

3

Define

where [ ]

where [ | , ( )]

I B

I B

I

b p p

b p p E X X

b s s E E X p X B

� �� � �

�� �� � �� � � �

� � � � ��

� �

� �

� �

� �

� �

� �

� �

1 2 31 2then ( , ) ( ) min 1,TV

ed W Y b b b�

� � �

�� � � � �

� �

Page 4: Poisson Approximation for Palindrome Distributions in DNA
Page 5: Poisson Approximation for Palindrome Distributions in DNA

DNA• DNA is deoxyribonucleic acid, made

up of 4 nucleotide bases – Adenine (A)– Cytosine (C) – Guanine (G)– Thymine (T)

• The bases A and T form a complementary pair, so are C and G.

G

AC

T

G

C

T

A

Page 6: Poisson Approximation for Palindrome Distributions in DNA

Replication Origins and Palindromes

• High concentration of palindromes exists around replication origins of other herpesviruses

• Locating clusters of palindromes (abovea minimal length) on CMV genome sequence might reveal likely locations ofits replication origins.

Page 7: Poisson Approximation for Palindrome Distributions in DNA

Palindromes in Letter Sequences

Odd Palindrome:“A nut for a jar of tuna”

ANUTFORA AROFTUNAJ

remove spaces and capitalize

Even Palindrome:“Step on no pets”

STEPON NOPETS

Page 8: Poisson Approximation for Palindrome Distributions in DNA

DNA Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A DNA palindrome must be even in length, e.g., palindrome of length 10:

5’ ….. GCAATATTGC …..3’ 3’ .…. CGTTATAACG …..5’

j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L

We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability � �2

LA T C Gp p p p� ��� � .

Page 9: Poisson Approximation for Palindrome Distributions in DNA

Poisson Process Approximation of Palindrome Distribution

Let be the process representing the palindrome occurrences on a random nucleotide sequence generated by the i.i.d. model; and

Ξ

Zλ be the Poisson process with rate λ . Proposition (Leung et al. 2005, J. Computational Biology) Assuming ,A T Cp p p p= =

LnG and suppose that

such that ,n L →∞

θ λ= where 1/ 32λ ≥ is a fixed positive constant, then

/ 22 ( , ) 0Ld Z cLλ θΞ ≤ →

Here stands for the Wasserstein distance, and c is a constant ≤ 131.

2d

Page 10: Poisson Approximation for Palindrome Distributions in DNA

The Scan Statistic

X1, X2, …, Xn � i.i.d. Uniform (0,1) Si = X(i+1) - X(i) = i th spacing Ar(i) = Si + … + Si+r-1 = sum of r adjoining spacingr-Scan Statistic ( )minr r

iA A i=

Page 11: Poisson Approximation for Palindrome Distributions in DNA

Palindrome Length Score (PLS) Chew et al. (Nucleic Acids Res 33: e134, 2005): (1) Identify all palindromes of length at least 2L

(using EMBOSS). (2) Score a fully extended palindrome: If length

is 2s, then this palindrome is given a score s.(3) Window score. Wi is defined as the sum of

the scores of all the palindromes whose center lies in this window.

Nonparametric approach with PLS to predict replication origins in herpesviruses:

Sensitivity = 67% Positive Predicted Value (PPV) = 15%

Page 12: Poisson Approximation for Palindrome Distributions in DNA

Compound Poisson Random Variable

Let N be a Poisson random variable withmean Lw� �� where

w �window length2( )A T C Gp p p p� � �

L �minimal palindrome length

PLS for a window is defined as

1

N

jj

Z Y���

whereN = number of (fully extended) palindromes

jY = score given to the jth palindrome

Page 13: Poisson Approximation for Palindrome Distributions in DNA

Compound Poisson Random Variable Cont’d

With an i.i.d. model for the nucleotide sequence, the probability mass function of can be written as jY

(1 ) if ( )

if

l L

Y M L

L l Mp l

l M� ��

� ��� �

� where M is a prescribed upper bound for palindrome lengths used. (M = 3L for the herpes dataset.)

Then the probability mass function of Z is computed using the recursive formula derived from Stein’s identity (Barbour et al. 1992).

1( ) ( ) (

k

Yl

)P Z k lp l P Z k lk�

��� � �

Page 14: Poisson Approximation for Palindrome Distributions in DNA

Compound Poisson Approximation (CPA)

Kolmogorov distance between random variables: sup ( ) ( )K

ld P X l P Y� � l�

Values of Kd between the CPA and empirical PLS distributions from simulated sequences using Markov models of orders 0 to 3:

M0 M1 M2 M3 BICminimum 0.00246 0.00251 0.00103 0.00190 0.00275maximum 0.01868 0.02683 0.01996 0.02743 0.02743mean 0.00799 0.00939 0.00878 0.00778 0.00788std. dev. 0.00362 0.00457 0.00504 0.00578 0.00569

Page 15: Poisson Approximation for Palindrome Distributions in DNA

Compound Poisson Approximation of Palindrome Length Score in Herpesviruses 11

Table 5Windows with scores exceeding the critical score at 5% for the BIC scheme. Column 2 shows the Markov model selected foreach virus using the BIC. Rows on upper half list viruses with known replication origins, those on lower half without. Entriesin bold indicate that window score is also significantly high at 1%. Underlined entries indicate that window is within 2mu of

some known ORI.

Virus Model L used Mid point of Window

bohv1 M3 6 105901,113401,124501,132301,77401,87601,82801,30901,35701bohv4 M2 5bohv5 M3 6 78001,108301,134701,19201,36901,6601,33901,61801,84301,31501,67501cehv1 M3 6 133001,149451,61601,113051,125301,701 ,102201,109901,32551,36401,50751,117601cehv2 M3 6 129501,144201,61601,123551,22051,75951,107101,92401cehv7 M2 5

cehv16 M3 6 118301,8751,21001,37801,137201,33251,154001ebv M3 5 7601,141201,41201

ebv2 M3 5 7601,142001,41201ehv1 M3 5 116201,146651ehv4 M2 5

gahv1 M2 5hcmv M3 5 94051hhv6 M2 5

hhv6b M2 5 90401hhv7 M3 5hsv1 M3 6 62301,129851,148401,72801hsv2 M3 6 74551,7351,119701,28001,128801,152951rcmv M3 5 75901shv1 M3 6 37801,58451,93101,30451,85051,78751,124601,75251vzv M2 5 119401,110101

Page 16: Poisson Approximation for Palindrome Distributions in DNA

Prediction Performance of PLS with CPA

CPA CPA+

PLS (10 windows) 0.01 0.05 0.01 0.05Sensitivity 67% 48% 52% 65% 65%PPV 15% 37% 31% 37% 31%

Page 17: Poisson Approximation for Palindrome Distributions in DNA

Further Questions

! Can the compound Poisson approximation be generalized to other scoring schemes? e.g., the base weighted scheme (BWS) which gives a higher score to palindromes which have lower probabilities to occur at random.

! Replication origin prediction for other DNA genomes (viral, bacterial, and eukaryotic)?

! Other sequence features important to replication origin predictions?

Page 18: Poisson Approximation for Palindrome Distributions in DNA

Acknowledgments

CollaboratorsDavid Chew, University of Southern California Kwok Pui Choi, National University of Singapore Raul Cruz-Cano, Texas A&M University at Texarkana Deepak Chandran, University of Washington

Funding Support Texas Advanced Research Program:

003661-0008-2006, 003661-0013-2007National Science Foundation: DMS0800272 National Institutes of Health: 1T36GM078000-01,

S06GM08012-35, 5G12RR008124-11, 1R01AI077413