cis: compound importance sampling for binding site p-value estimation the hebrew university,...
TRANSCRIPT
CIS: Compound Importance Sampling for Binding Site p-value Estimation
The Hebrew University, Jerusalem, Israel
Yoseph Barash Gal Elidan
Tommy KaplanNir Friedman
2
Detecting Target Genes
promoterbinding site?
genebinding site?binding site?binding site?binding site?
Probabilistic framework Log odds Score:
),...,(
),...,(log),...,(
10
11
k
kMk ssP
ssPssScore
ACGT
1 2 k
p[i,c] – prob. of letter c at position i
3
Detecting target genes (2)
-180 -160 -140 -120 -100 -80 -60
score
15
Promoter upstream region
13
11
9
7
?
?
4
p-value of Scores
Score
Prob
S
)()value( 0 sScorePsp
5
Gal4 regulates Gal80-180 -160 -140 -120 -100 -80 -60
Promoter upstream regionp-value score:
•“Universal”•Interpretable•Control false positive error rate
Detecting target genes (3)
Bonferroni corrected p-value ≤ 0.01
score
15
13
11
9
7
p-v
alu
e
10-7
10-6
10-5
10-4
10-2
10-3
6
p-value Estimation
Score
Problem 1: naïve enumeration infeasible #seq = 4k
Prob
S*
Estimate the p-value by sampling from P0:
in
*s s11
(s*)value-p i
samples scores: s1…sn
7
p-value Estimation
Need ~107 attempts to geta sample with pvalue < 10-7
Prob
Problem 2: Multiple hypothesis Testing low p-values (10-7)
S*Scor
eS*
8
Importance Sampling Approach
Score
1. “Cheat:” Sample from Q(s1…sk), to get high scoring samples
2. Get “absolution”: Weigh each sample
S*
Prob
in
*s s11
(s*)value-p i
ii
ii
swsw
*s s1)()(
1(s*)value-p i
Empirical p-value ~ 10-8
N ~ 104
9
Why is this allowed?
x = subsequence
x
xPxf )()( 0
Importance Sampling
)(0
xfEP x
xQxQxP
xf )()()(
)( 0
Desired estimate:expectation of log-odds
Sample from P0(x) and count
Multiply and divide by Q(x)
Sample from Q(x) and reweight
How to choose Q?
W(x)
10
Choosing Sampling Distribution
Score
Q10 = MotifQ1= Background Q5
Under-sampled region
Den
sity
11
Choosing Sampling Distribution Rescale Combine
Comprehensive Coverage Sampling distribution
Score
Den
sity
Mixing ratio
i
kik ssQissQ )...()()...( 11
12
PSSM Example
6e-5
Naive
0
2e-5
4e-5
10 12 14 16 18 20 22
MAST(Bailey et al. 98)
Normal
p-v
alu
e
Score
CIS (10 000000) (40 000)
What if we want something else?
13
Dependency Models - Many possible variants:
Trees, Mixture of PSSMs, Mixture of Trees etc.
Tree Example:
Suggested by several recent papers:Barash et al.(2003), King & Roth (2003), Zhou & Liu (2004),…
Beyond PSSM Models
Main Point:Capture dependencies between biding site positions Improve sites predictions
Challenge: compute p-values for general models
X1 X2 X3 X4 X5
14
Tree Model Example
0
2e-5
4e-5
6e-5
8e-5
1e-4
10 12 14 16 18 20
p-v
alu
e
Score X Not efficient
X Not applicable
X Not accurate
“Naïve” Sampling MAST (Baily et al,98) Normal Approx.
Naive
Normal CIS
(10 000000) (40 000)
15
Decreased Estimator Variability
0
2e-5
4e-5
6e-5
8e-5
1e-4
10 12 14 16 18 20
p-v
alu
e
Score
10 repeats of sampling
Naive
Normal CIS
(10x10 000000) (10x40 000)
16
CIS - Summary
√ General form –Wide range of probabilistic models
√ Computationally efficient
√ Handles low p-values accurately
√ Available online, at: http://compbio.cs.huji.ac.il/CIS
17
Thank you
http://compbio.cs.huji.ac.il/CIS
Joint Work with:
Nir Friedman
Gal Elidan
Tommy Kaplan