poisson approximation for palindrome distributions in dna
TRANSCRIPT
Poisson Approximation for Palindrome Distributions in DNA Viral Genomes
Ming-Ying Leung, Ph.D.Director, Bioinformatics Program
Professor, Department of Mathematical Sciences The University of Texas at El Paso
El Paso, Texas, U.S.A.
Law of Small Numbers (Poisson, 1837)
1
Let , ,..., be i.i.d. Bernoulli1 2random variables with success
probability and let .n
ii
X X Xn
p W X=
=�
If and 0 in such a way that 0,then for 0,1, 2,...
n p npk
�� � � = >=
( ) (1 ) ( )!
where is a Poisson random variable with parameter .
kn k n kP W k p p e P Y kk k
Y�
���
�
� � � ��= = � � = =� � � �
The Chen-Stein Method (Chen, 1970)Let be an index set. For any , is a Bernoulli randomvariable with success probability and ( ) is a subset of containing , called the neighborhood of dependence of . Let
and I
I I Xp B I
W X
�
�
��
��
� �
�
�
=� be a Poisson random variable with parameter
.I
Y
p
�
��
��
=�
� �
1( )
2( )\
3
Define
where [ ]
where [ | , ( )]
I B
I B
I
b p p
b p p E X X
b s s E E X p X B
� �� � �
�� �� � �� � � �
� � � � ��
� �
� �
� �
�
�
� �
� �
� �
� �
�
1 2 31 2then ( , ) ( ) min 1,TV
ed W Y b b b�
� � �
�� � � � �
� �
DNA• DNA is deoxyribonucleic acid, made
up of 4 nucleotide bases – Adenine (A)– Cytosine (C) – Guanine (G)– Thymine (T)
• The bases A and T form a complementary pair, so are C and G.
G
AC
T
G
C
T
A
Replication Origins and Palindromes
• High concentration of palindromes exists around replication origins of other herpesviruses
• Locating clusters of palindromes (abovea minimal length) on CMV genome sequence might reveal likely locations ofits replication origins.
Palindromes in Letter Sequences
Odd Palindrome:“A nut for a jar of tuna”
ANUTFORA AROFTUNAJ
remove spaces and capitalize
Even Palindrome:“Step on no pets”
STEPON NOPETS
DNA Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A DNA palindrome must be even in length, e.g., palindrome of length 10:
5’ ….. GCAATATTGC …..3’ 3’ .…. CGTTATAACG …..5’
j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L
We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability � �2
LA T C Gp p p p� ��� � .
Poisson Process Approximation of Palindrome Distribution
Let be the process representing the palindrome occurrences on a random nucleotide sequence generated by the i.i.d. model; and
Ξ
Zλ be the Poisson process with rate λ . Proposition (Leung et al. 2005, J. Computational Biology) Assuming ,A T Cp p p p= =
LnG and suppose that
such that ,n L →∞
θ λ= where 1/ 32λ ≥ is a fixed positive constant, then
/ 22 ( , ) 0Ld Z cLλ θΞ ≤ →
Here stands for the Wasserstein distance, and c is a constant ≤ 131.
2d
The Scan Statistic
X1, X2, …, Xn � i.i.d. Uniform (0,1) Si = X(i+1) - X(i) = i th spacing Ar(i) = Si + … + Si+r-1 = sum of r adjoining spacingr-Scan Statistic ( )minr r
iA A i=
Palindrome Length Score (PLS) Chew et al. (Nucleic Acids Res 33: e134, 2005): (1) Identify all palindromes of length at least 2L
(using EMBOSS). (2) Score a fully extended palindrome: If length
is 2s, then this palindrome is given a score s.(3) Window score. Wi is defined as the sum of
the scores of all the palindromes whose center lies in this window.
Nonparametric approach with PLS to predict replication origins in herpesviruses:
Sensitivity = 67% Positive Predicted Value (PPV) = 15%
Compound Poisson Random Variable
Let N be a Poisson random variable withmean Lw� �� where
w �window length2( )A T C Gp p p p� � �
L �minimal palindrome length
PLS for a window is defined as
1
N
jj
Z Y���
whereN = number of (fully extended) palindromes
jY = score given to the jth palindrome
Compound Poisson Random Variable Cont’d
With an i.i.d. model for the nucleotide sequence, the probability mass function of can be written as jY
(1 ) if ( )
if
l L
Y M L
L l Mp l
l M� ��
� ��� �
� where M is a prescribed upper bound for palindrome lengths used. (M = 3L for the herpes dataset.)
Then the probability mass function of Z is computed using the recursive formula derived from Stein’s identity (Barbour et al. 1992).
1( ) ( ) (
k
Yl
)P Z k lp l P Z k lk�
��� � �
Compound Poisson Approximation (CPA)
Kolmogorov distance between random variables: sup ( ) ( )K
ld P X l P Y� � l�
Values of Kd between the CPA and empirical PLS distributions from simulated sequences using Markov models of orders 0 to 3:
M0 M1 M2 M3 BICminimum 0.00246 0.00251 0.00103 0.00190 0.00275maximum 0.01868 0.02683 0.01996 0.02743 0.02743mean 0.00799 0.00939 0.00878 0.00778 0.00788std. dev. 0.00362 0.00457 0.00504 0.00578 0.00569
Compound Poisson Approximation of Palindrome Length Score in Herpesviruses 11
Table 5Windows with scores exceeding the critical score at 5% for the BIC scheme. Column 2 shows the Markov model selected foreach virus using the BIC. Rows on upper half list viruses with known replication origins, those on lower half without. Entriesin bold indicate that window score is also significantly high at 1%. Underlined entries indicate that window is within 2mu of
some known ORI.
Virus Model L used Mid point of Window
bohv1 M3 6 105901,113401,124501,132301,77401,87601,82801,30901,35701bohv4 M2 5bohv5 M3 6 78001,108301,134701,19201,36901,6601,33901,61801,84301,31501,67501cehv1 M3 6 133001,149451,61601,113051,125301,701 ,102201,109901,32551,36401,50751,117601cehv2 M3 6 129501,144201,61601,123551,22051,75951,107101,92401cehv7 M2 5
cehv16 M3 6 118301,8751,21001,37801,137201,33251,154001ebv M3 5 7601,141201,41201
ebv2 M3 5 7601,142001,41201ehv1 M3 5 116201,146651ehv4 M2 5
gahv1 M2 5hcmv M3 5 94051hhv6 M2 5
hhv6b M2 5 90401hhv7 M3 5hsv1 M3 6 62301,129851,148401,72801hsv2 M3 6 74551,7351,119701,28001,128801,152951rcmv M3 5 75901shv1 M3 6 37801,58451,93101,30451,85051,78751,124601,75251vzv M2 5 119401,110101
Prediction Performance of PLS with CPA
CPA CPA+
PLS (10 windows) 0.01 0.05 0.01 0.05Sensitivity 67% 48% 52% 65% 65%PPV 15% 37% 31% 37% 31%
Further Questions
! Can the compound Poisson approximation be generalized to other scoring schemes? e.g., the base weighted scheme (BWS) which gives a higher score to palindromes which have lower probabilities to occur at random.
! Replication origin prediction for other DNA genomes (viral, bacterial, and eukaryotic)?
! Other sequence features important to replication origin predictions?
Acknowledgments
CollaboratorsDavid Chew, University of Southern California Kwok Pui Choi, National University of Singapore Raul Cruz-Cano, Texas A&M University at Texarkana Deepak Chandran, University of Washington
Funding Support Texas Advanced Research Program:
003661-0008-2006, 003661-0013-2007National Science Foundation: DMS0800272 National Institutes of Health: 1T36GM078000-01,
S06GM08012-35, 5G12RR008124-11, 1R01AI077413