cis: compound importance sampling for binding site p-value estimation the hebrew university,...

CIS: Compound Importance Sampling for Binding Site p-value Estimation

The Hebrew University, Jerusalem, Israel

Yoseph Barash Gal Elidan

Tommy KaplanNir Friedman

2

Detecting Target Genes

promoterbinding site?

genebinding site?binding site?binding site?binding site?

Probabilistic framework Log odds Score:

),...,(

),...,(log),...,(

10

11

k

kMk ssP

ssPssScore

ACGT

1 2 k

p[i,c] – prob. of letter c at position i

3

Detecting target genes (2)

-180 -160 -140 -120 -100 -80 -60

score

15

Promoter upstream region

13

11

9

7

?

?

4

p-value of Scores

Score

Prob

S

)()value( 0 sScorePsp

5

Gal4 regulates Gal80-180 -160 -140 -120 -100 -80 -60

Promoter upstream regionp-value score:

•“Universal”•Interpretable•Control false positive error rate

Detecting target genes (3)

Bonferroni corrected p-value ≤ 0.01

score

15

13

11

9

7

p-v

alu

e

10-7

10-6

10-5

10-4

10-2

10-3

6

p-value Estimation

Score

Problem 1: naïve enumeration infeasible #seq = 4k

Prob

S*

Estimate the p-value by sampling from P0:

in

*s s11

(s*)value-p i

samples scores: s1…sn

7

p-value Estimation

Need ~107 attempts to geta sample with pvalue < 10-7

Prob

Problem 2: Multiple hypothesis Testing low p-values (10-7)

S*Scor

eS*

8

Importance Sampling Approach

Score

1. “Cheat:” Sample from Q(s1…sk), to get high scoring samples

2. Get “absolution”: Weigh each sample

S*

Prob

in

*s s11

(s*)value-p i

ii

ii

swsw

*s s1)()(

1(s*)value-p i

Empirical p-value ~ 10-8

N ~ 104

9

Why is this allowed?

x = subsequence

x

xPxf )()( 0

Importance Sampling

)(0

xfEP x

xQxQxP

xf )()()(

)( 0

Desired estimate:expectation of log-odds

Sample from P0(x) and count

Multiply and divide by Q(x)

Sample from Q(x) and reweight

How to choose Q?

W(x)

10

Choosing Sampling Distribution

Score

Q10 = MotifQ1= Background Q5

Under-sampled region

Den

sity

11

Choosing Sampling Distribution Rescale Combine

Comprehensive Coverage Sampling distribution

Score

Den

sity

Mixing ratio

i

kik ssQissQ )...()()...( 11

12

PSSM Example

6e-5

Naive

0

2e-5

4e-5

10 12 14 16 18 20 22

MAST(Bailey et al. 98)

Normal

p-v

alu

e

Score

CIS (10 000000) (40 000)

What if we want something else?

13

Dependency Models - Many possible variants:

Trees, Mixture of PSSMs, Mixture of Trees etc.

Tree Example:

Suggested by several recent papers:Barash et al.(2003), King & Roth (2003), Zhou & Liu (2004),…

Beyond PSSM Models

Main Point:Capture dependencies between biding site positions Improve sites predictions

Challenge: compute p-values for general models

X1 X2 X3 X4 X5

14

Tree Model Example

0

2e-5

4e-5

6e-5

8e-5

1e-4

10 12 14 16 18 20

p-v

alu

e

Score X Not efficient

X Not applicable

X Not accurate

“Naïve” Sampling MAST (Baily et al,98) Normal Approx.

Naive

Normal CIS

(10 000000) (40 000)

15

Decreased Estimator Variability

0

2e-5

4e-5

6e-5

8e-5

1e-4

10 12 14 16 18 20

p-v

alu

e

Score

10 repeats of sampling

Naive

Normal CIS

(10x10 000000) (10x40 000)

16

CIS - Summary

√ General form –Wide range of probabilistic models

√ Computationally efficient

√ Handles low p-values accurately

√ Available online, at: http://compbio.cs.huji.ac.il/CIS

17

Thank you

http://compbio.cs.huji.ac.il/CIS

Joint Work with:

Nir Friedman

Gal Elidan

Tommy Kaplan

http://www.cs.huji.ac.il/~nirf/

cis: compound importance sampling for binding site p-value estimation the hebrew university,...

Documents

pvalue scor e x

ilcis slide

normal pvalue score

score s

s n slide

prob empirical pvalue

pvalue estimation need

scores score prob s