art b. owen stanford universitystatweb.stanford.edu/~owen/pubtalks/adaptiveisweb.pdfadaptive...

106
Adaptive Importance Sampling 1 Adaptive Importance Sampling Art B. Owen Stanford University Grid Science Winter School, January 2017

Upload: others

Post on 13-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 1

Adaptive Importance Sampling

Art B. Owen

Stanford University

Grid Science Winter School, January 2017

Page 2: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 2

These are slides that I presented at Grid Science 2017 in Los Alamos NM, on Wednesday

January 11.

I have amended them correcting typos and adding a few more references. I also add a few

interstitial slides like this one to cover things that I either said or should have said during the

presentation.

Grid Science Winter School, January 2017

Page 3: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 3

Outline1) Basic Importance Sampling.

often our only chance to solve a problem.

2) Mixture Distributions.

the key to effective use of IS. Also an Oscar.

3) What-if Simulations.

using data from one sim to learn how another would have worked.

4) Adaptive Importance Sampling.

Still mostly intuitive and ad hoc.

Grid Science Winter School, January 2017

Page 4: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 4

What is missing1) Some paywalled articles.

2) Many articles from application areas.

3) Some very advanced / hard to explain developments.

Google Scholar, Dec 2016

“adaptive importance sampling”

About 3,040 results (0.07 sec)

Since 2012: About 1,070 results (0.10 sec)

Grid Science Winter School, January 2017

Page 5: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 5

At the meeting, several people told me about the importance sampling work of Ian Dobson on

estimating rare event probabilities for power system cascades. So I’m looking into those now.

Grid Science Winter School, January 2017

Page 6: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 6

The canonical problemApproximate µ =

∫Rdh(x) dx.

Use the first one that works of

1) Symbolic / closed form solutions

2) Numerical quadrature

3) Plain Monte Carlo

4) Importance Sampling

5) Adaptive Importance Sampling

Notes• Some problems have d =∞ and / or discrete x

• We may need Markov chain Monte Carlo (MCMC)

or sequential Monte Carlo (SMC).

• Sometimes we can use quasi-Monte Carlo (QMC). Grid Science Winter School, January 2017

Page 7: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 7

There are problems where none of the five approaches on the previous slide will get the answer.

Quasi-Monte Carlo sampling can be used to great effect when the underlying integrands are just

a little bit more regular than Monte Carlo uses. With randomized QMC, RQMC, one gets an

answer that is seldom worse than plain MC or plain QMC and can be much better. This talk was

just about MC.

MCMC and SMC can be brought to bear on problems that are less regular/favorable than plain

MC uses. Those methods can handle even harder problems because they involve flexible

explorations of the problem space. It can be hard to tell if/when they work.

Grid Science Winter School, January 2017

Page 8: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 8

Closed formSometimes we can get a closed form.

Symbolic computation (Mathematica, Maple, Sage, etc.) help.

Unfortunately

We can seldom do this for problems with real-world complexity,

so we resort to numerical methods.

Grid Science Winter School, January 2017

Page 9: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 9

Numerical quadratureFor instance tensor products of Simpson’s rule.

Evaluate f at n points xi ∈ Rd.

Can attain an error rate of O(n−r/d) for h ∈ C(r)(Rd).

Unfortunately

Low smoothness (small r) or high dimension (large d) make the methods ineffective.

Grid Science Winter School, January 2017

Page 10: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 10

Monte CarloWrite h(x) = f(x)p(x) for a probability density function p(·). Then,

µ =

∫h(x) dx =

∫f(x)p(x) dx = E(f(x)), x ∼ p.

Estimate

µ =1

n

n∑i=1

f(xi), xi ∼ p

√n(µ− µ)

d→ N (0, σ2), CLT σ2 = E((f(x)− µ)2)

UpshotVery competitive rate O(n−1/2) when r/d 1/2. We get confidence intervals too.

Unfortunatelyσ/√n

µmight be large, or even infinite.

Grid Science Winter School, January 2017

Page 11: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 11

Places where MC has troubleRare events

• Bankruptcy / default for finance / insurance

• Energy grid failure Bent, Backhaus & Yamangil (2014),

cascade simulations Chertkov (2012)

• Product failure, e.g., battery

• Monograph by Rubino & Tuffin (2009)

Spiky/singular integrands

• High energy physics (Feynman diagram),

Aldins, Brodsky, Dufner, Kinoshita (1970)

multiple singularities in [0, 1]7 that dominate the integral

• Graphical rendering

Kollig & Keller (2006) Integrand includes factors like ‖x− x0‖−1

• Particle transport, e.g., T. BoothGrid Science Winter School, January 2017

Page 12: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 12

Rare events

f(x) = 1A(x) =

1, x ∈ A0, else,

µ = P(x ∈ A) =

∫A

p(x) dx = ε

Coefficient of variation of µ

σ/√n

µ=

√ε(1− ε)√nε

≈ 1√nε

To get cv = 0.1 takes n > 100/ε.

Grid Science Winter School, January 2017

Page 13: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 13

Singularitiesf(x) ≈ ‖x− x0‖−r, x ∈ A ⊂ Rd

0 < a 6 p(x) 6 b <∞, for x ∈ A.

f is integrable over bounded A containing x0 iff r < d.

Then f2 is integrable iff 2r < d.

For r ∈ [d/2, d)

µ exists but σ2 =∞

Monte Carlo converges but slower than Op(1/√n).

Other singularities

‖x− x0‖p−r, min(x1, x2, . . . , xd)−r,

min(x1, 1− x1, x2, 1− x2, . . . , xd, 1− xd)−r

Grid Science Winter School, January 2017

Page 14: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 14

Very often singularities are simply not a big problem for MC and the interesting thing is that high

dimension can make singularities less harmful. This is a ‘concentration of measure’ result as

described below.

Let f be singular on some subset of the input space. For simplicity, take subsets of the form

Ωd,p(ρ) = x ∈ Rd | ‖x‖p 6 ρ for integer d > 1 and 1 6 p 6∞ and f(x) = ‖x‖−rp .

Let x be uniformly distributed on Ωd,p = Ωd,p(1) of volume |Ωd,p|. Then

µ =∫ 1

0yd−1−r dy|Ωd−1,p|/|Ωd,p| = (d− r)−1|Ωd−1,p|/|Ωd,p|. The singularity is severe

if, say, half of µ comes from a very tiny region around 0.

The radius ρ that captures half of the integral satisfies∫ ρ0y−r+d−1 dy = (1/2)

∫ 1

0y−r+d−1 dy, and it is ρ = 2−1/(d−r). The relative volume is

|Ωd,p(ρ)|/|Ωd,p(1)| = ρd = 2−d/(d−r). If r is close to d then half of the integral comes from

a very tiny fraction of the volume. If r = 1 and d grows then half of the integral comes from

about half of the volume, hence, not much of a spike. If r < d/2, so the variance is finite, then

this volume is larger than 2−d/(d−d/2) = 1/4. Once again, not a very prominent spike.

For singularities on a k dim manifold replace d by d− k.

Conclusion: rare events described by 0/1 random variables can be much harder to handle than

unbounded random variables. Grid Science Winter School, January 2017

Page 15: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 15

Importance samplingFor rare events and (some) singularities there is a region A of relatively small P(x ∈ A) that

dominates the integral. Taking x ∼ p does not get enough data from the important region A.

Main ideaGet more data from A, and then correct the bias.

Kahn (1950), Kahn & Marshall (1953), Trotter & Tukey (1956)

Good news Importance sampling can solve these sampling problems.

Bad news It can also fail spectacularly.

For a density q

f(x)p(x) = f(x)p(x)

q(x)q(x)

µq =1

n

n∑i=1

f(xi)p(xi)

q(xi), xi ∼ q

For now, assume that p(x)/q(x) is computable.Grid Science Winter School, January 2017

Page 16: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 16

UnbiasednessIf q(x) > 0 whenever f(x)p(x) 6= 0 then for xi ∼ q,

Eq(µq) =

∫f(x)

p(x)

q(x)q(x) dx =

∫f(x)p(x) dx = µ.

Sufficient condition

q(x) > 0 whenever p(x) > 0.

Yields unbiasedness for generic f .

Notation

Eq(·), Varq(·) etc. are for xi ∼ q.

w(x) ≡ p(x)

q(x), Ep(g(x)) = Eq(g(x)w(x)).

Grid Science Winter School, January 2017

Page 17: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 17

Variance of ISBased on O (2013), Ch 9.1

Varq(µq) =σ2q

n, where σ2

q = Varq((fp)/q)

σ2q = Eq

((fpq

)2)− µ2 =

∫(fp)2

qdx− µ2

After some algebra

σ2q =

∫(fp− µq)2

qdx

Lessons

1) q(x) ∝ f(x)p(x) is best, when f > 0.

2) q near 0 is dangerous.

3) Bounding (fp)2/q is useful theoretically.

Adaptive IS is about using sample data to select q.Grid Science Winter School, January 2017

Page 18: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 18

ProportionalityFor f > 0, taking q = fp/µ yields σ2

q = 0. Specifically

µq =1

n

n∑i=1

f(xi)p(xi)

q(xi)=

1

n

n∑i=1

f(xi)p(xi)

f(xi)p(xi)/µ= µ

but we need to know µ to use it.

We look for q nearly proportional to fp.

More generally

q ∝ |f |p is always optimal, but not zero variance if f takes both positive and negative values.

For f > 0

q ∝ f is not best (unless p is flat)

q ∝ p is not best (unless f is flat)

Importance involves both f and p via fp.

Grid Science Winter School, January 2017

Page 19: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 19

Example

−5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

f

q

p

µ.= 1.74× 10−24, Var(µq) = 0

Gaussian p and f =⇒ Gaussian optimal q lies between Grid Science Winter School, January 2017

Page 20: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 20

Here is a positivisation trick from O & Zhou (2000).

If f takes both positive and negative values then we can write

µ =

∫f(x)p(x) dx =

∫f+(x)p(x) dx−

∫f−(x)p(x) dx

for nonnegative functions f+(x) = max(0, f(x)) and f−(x) = max(0,−f(x)). Then we

can use

µ =1

n+

n+∑i=1

f+(xi)p(xi)

q+(xi)− 1

n−

n−∑i=1

f−(xi)p(xi)

q−(xi)

for xi from q±. There then exist zero variance sampling distributions q± that we might

approximate. More generally we can use

µ = c+1

n+

n+∑i=1

(f(xi)− c)+p(xi)

q+(xi)− 1

n−

n−∑i=1

(f(xi)− c)−p(xi)q−(xi)

or

µ = Ep(g(x)) +1

n+

n+∑i=1

(f(xi)− g(xi))+p(xi)

q+(xi)− 1

n−

n−∑i=1

(f(xi)− g(xi))−p(xi)

q−(xi)

for some g with known expectation. Grid Science Winter School, January 2017

Page 21: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 21

Tails of q

σ2q =

∫(fp− µq)2

qdx =

∫(fp)2

qdx− µ2

The term depending on qRecall w(x) = p(x)/q(x).∫

(fp)2

qdx =

∫f2w2q dx = Eq(f2w2)

=

∫f2wpdx = Ep(f2w)

Consequenceµ2 + σ2

q = Ep(f2w) = Eq(f2w2)

∴ bounding w(x) =p(x)

q(x)is advantageous for many f .

Grid Science Winter School, January 2017

Page 22: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 22

Gaussian p with t distributed qMultivariate Gaussian, θ, Σ

p(x) ∝ exp(−(x− θ)TΣ−1(x− θ)/2)

Multivariate t , θ, Σ, degrees of freedom ν > 0

q(x) ∝ (1 + (x− θ)TΣ−1(x− θ))−(ν+d)/2

The t distribution has heavier tails than the Gaussian

supx p(x)/q(x) is bounded

supx p(x) exp(θTx)/q(x) is bounded

Geweke (1989) Split t (at 0)

Reversing roles of t and Gaussian

Leads to unbounded w

Grid Science Winter School, January 2017

Page 23: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 23

q nearly proportional to p

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

x

Den

sitie

s

−6 −4 −2 0 2 4 6

1e−

031e

−01

1e+

011e

+03

x

Den

sity

rat

ios

Nearly proportional densitiesGaussian and scaled t (25 df)

Light tailed q can make p/q diverge.

Can have σ2q =∞ for f(x) = 1.

Ironically it is the unimportant region that causes trouble.Grid Science Winter School, January 2017

Page 24: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 24

Beyond varianceChatterjee & Diaconis (2015) show that the sample size required is roughly

n ≈ exp(KL distance p, q) for generic f .

This analysis involves Eq(|µq − µ|) and Pq(|µq − µ| > ε) instead of Varq(µq). It avoids

working with variances that can be harder to handle than means.

Taking ε = .025 in Theorem 1.2 (for 95% confidence of a small error) shows that we succeed

with n > 6.55× 1012 × exp(KL). Similarly, poor results are very likely for n much smaller

than exp(KL).

The range for n is not precisely determined by these considerations (yet).

Grid Science Winter School, January 2017

Page 25: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 25

Self-normalized importance samplingBased on O (2013, ch 9.2)

p(x) = cppu(x), q(x) = cqqu(x), 0 < cp, cq <∞

µq =cpcq

1

n

n∑i=1

f(xi)pu(xi)

qu(xi).

If pu, qu computable, but cp/cq unknown

wu(xi) = wu,q(xi) = pu(xi)/qu(xi)

wi = wi,q = wu(xi)/ n∑

i′=1

wu(xi′)

µq =n∑i=1

wif(xi)

Grid Science Winter School, January 2017

Page 26: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 26

Self-normalized importance sampling• µq has some bias, asymptotically negligible

• Self-normalized sampling competes with acceptance-rejection

• It does not attain zero variance for any q (and non-trivial f )

Adaptation

There are adaptive versions of both IS and SNIS.

Grid Science Winter School, January 2017

Page 27: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 27

SNIS is consistentby the Law of Large Numbers

µq =1

n

n∑i=1

f(xi)pu(xi)/qu(xi)/ 1

n

n∑i=1

pu(xi)/qu(xi)

→∫f(x)pu(x)

qu(x)q(x) dx

/ ∫ pu(x)

qu(x)q(x) dx, (LLN)

=cqcp

∫f(x)p(x) dx

/ cqcp

∫p(x) dx

= µ

Recall the weights

wi =pu(xi)

qu(xi)

/ n∑i′=1

pu(xi′)

qu(xi′)

Grid Science Winter School, January 2017

Page 28: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 28

SNIS varianceDelta method variance approximation for a ratio estimate:

Var(µq) =σ2q,sn

n

σ2q,sn = Eq(w(x)2(f(x)− µ)2)

Rewrite and compare

σ2q,sn =

∫p2

q(f − µ)2 dx

=

∫(fp− µp)2

qdx

vs σ2q =

∫(fp− µq)2

qdx

No q can make σ2q,sn = 0 (unless f is constant).

Grid Science Winter School, January 2017

Page 29: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 29

Exactness

µq =1

n

n∑i=1

f(xi)p(xi)

q(xi)

µq =1

n

n∑i=1

f(xi)pu(xi)/qu(xi)/ 1

n

n∑i=1

pu(xi)/qu(xi)

Constants f(x) = c

µq = c, while generally µq 6= c

We don’t usually try to average constants, but getting that wrong means:

1) P(x ∈ B) + P(x 6∈ B) 6= 1, and

2) E(f(x) + c) 6= E(f(x)) + c.

Grid Science Winter School, January 2017

Page 30: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 30

Optimal SNIS qopt

qopt ∝ |f(x)− µ|p(x) Hesterberg (1988)

∴ σ2q,sn > Ep(|f(x)− µ|)2

Best possible variance reduction from SNIS

σ2

σ2qopt,sn

=Ep((f(x)− µ)2)

Ep(|f(x)− µ|)2

Versus plain IS

For f > 0, best IS gets σ2q = 0.

Grid Science Winter School, January 2017

Page 31: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 31

Rare eventsf(x) = 1x∈A, µ = P(x ∈ A) = ε

Optimal q

qopt(x) ∝ qoptu (x) = |f(x)− µ|p(x) =

(1− ε)p(x) x ∈ Aεp(x) x 6∈ A.

Normalize q∫qoptu (x) dx = 2ε(1− ε)∫

A

qopt(x) dx =ε(1− ε)2ε(1− ε)

=1

2(!!)

Versus plain ISBest plain IS q = pf/µ puts all probability on A, not half.

Grid Science Winter School, January 2017

Page 32: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 32

Rare event efficiencyσqopt,sn = Ep(|f(x)− µ|) = ε(1− ε) + (1− ε)ε = 2ε(1− ε) ≈ 2ε

Best possible coefficient of variation σq,sn/µ is≈ 2.

Versus plain MC as µ = ε→ 0√Var(µqopt,sn)

µ≈ 2√

nOptimal SNIS√

ε(1− ε)/nµ

≈ 1√ε

1√n

plain MC

Optimal SNIS would be very good. Gets cv 0.1 for n = 400 vs 100/ε

Plain IS can be even better (lower bound 0).

Grid Science Winter School, January 2017

Page 33: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 33

When to use SNISIf we only have pu or qu then we cannot use IS and SNIS is available.

I.e., use SNIS when we can draw x ∼ q and evaluate pu/qu but not p/q.

Suppose we could do both?

We can then use∫p(x) dx = Eq(w(x)) = 1 to define a control variate.

(Much more later.)

The resulting estimate is then exact for q = fp/µ and for f(x) = c.

Asymptotically this combination beats both plain IS and SNIS.

That is, when we have normalized p and q, SNIS is not best.

Grid Science Winter School, January 2017

Page 34: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 34

Effects of dimensionp(x) =

d∏j=1

pj(xj | x1, . . . , xj−1)

q(x) =

d∏j=1

qj(xj | x1, . . . , xj−1)

w(x) =d∏j=1

pj(xj | x1, . . . xj−1)

qj(xj | x1, . . . xj−1)≡

d∏j=1

wj(xj | x1, . . . , xj−1).

We can write p and q as above even if we generate them some other way.

Each Eq(wj(· | ·)) = 1 and Eq(wj(· | ·)2) > 1.

As a result Eq(w(x)2) can grow exponentially with d affecting both IS and SNIS.

If qj gets closer to pj as j increases we might be ok.

Or if f is specially related to p/q.

Grid Science Winter School, January 2017

Page 35: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 35

Gaussian p and qFor p = N (0, I) and q = N (θ, I)

w(x) = exp(−θTx+ θTθ/2), Eq(w(x)2) = exp(θTθ).

For θ = (1, 1, 1, . . . , 1)T we get Eq(w(x)2) = exp(d).

Similar results from θ = (1, 0, 0, . . . , 0)T and θ = (1/√d, 1/√d, . . . , 1/

√d)T.

Variance change

Taking q = N (0, σ2I)

Eq(w(x)2) =

(

σ2

√2σ2 − 1

)d, σ2 > 1/2

∞ else.

We need σ2 closer to 1 as d grows, e.g., 1 +O(1/d).

Grid Science Winter School, January 2017

Page 36: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 36

Infinite dimensionx = (x1, x2, . . . ).

The function f(x) depends on only the first M = M(x) components of x.

Assume M is a stopping timeI.e., we can tell whether M(x) = m from x1, x2, . . . , xm. For example:

M = minm > 1 |

m∑i=1

xi > 100.

We must choose q(·) so that Pq(M <∞) = 1, so that we can compute f in finite time.

The estimate

µq =1

n

n∑i=1

f(xi)

M(x)∏j=1

p(xj | x1, . . . , xM(x))

q(xj | x1, . . . , xM(x))

Note: weight w(x) only uses∏j p/q for j 6M .

We don’t need the infinite product.Grid Science Winter School, January 2017

Page 37: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 37

Infinite dimension ctd.Technicality

Eq(µq) = Ep(f(x)1M(x)<∞).

If Pp(M(x) <∞) = 1 then Eq(µq) = Ep(f(x)).

The technicality is useful

Suppose we want Pp(M(x) <∞). Choose f(x) = 1. Then

Eq(µ) = Pp(M(x) <∞).

Siegmund (1976)

Grid Science Winter School, January 2017

Page 38: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 38

Weights and effective sample sizeKong (1992)

µq =1

n

n∑i=1

wif(xi), wi =p(xi)

q(xi), W ≡

∑i

wi

If a small number of wi dominate the others then µq can be unstable.

An effectively smaller number of f(xi) are being averaged.

Effective sample size

ne =(∑i wi)

2∑i w

2i

ne = n if all wi are equal. ne = 1 if only one wi is positive

Slippery derivationIt follows from variance of a weighted sum of IID random variables.

In IS, the random variables and the weights are both functions of the same x.

ne is the same for IS and SNIS.Grid Science Winter School, January 2017

Page 39: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 39

Effective sample sizene =

(∑i wi)

2∑i w

2i

≈ n

1 + cv(w)2

Population counterpart

n∗e =nEq(w)2

Eq(w2)=

n

Eq(w2)=

n

Ep(w)

Larger w are like getting fewer data points.

InterpretationPlain MC has ne = n and we don’t want that. IS has ne < n.

We just want to detect very badly imbalanced weights via small ne.

Integrand specific versionEvans & Swartz (1995)

Use wi(f) = |f(xi)|p(xi)/q(xi), W (f) =∑i wi(f), wi(f) = wi(f)/W (f).

ne(f) =1∑n

i=1 wi(f)2Grid Science Winter School, January 2017

Page 40: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 40

PERT exampleTen steps in a sofware project. Some cannot start until previous are done.

j Task Predecessors Mean Duration

1 Planning None 4

2 Database Design 1 4

3 Module Layout 1 2

4 Database Capture 2 5

5 Database Interface 2 2

6 Input Module 3 3

7 Output Module 3 2

8 GUI Structure 3 3

9 I/O Interface Implementation 5,6,7 2

10 Final Testing 4,8,9 2

Table 1: A PERT problem from Chinneck (2009, Chapter 11)Grid Science Winter School, January 2017

Page 41: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 41

PERT ctd

1

2

3

4

5

6

7

8

9 10

PERT graph (activities on nodes)

Grid Science Winter School, January 2017

Page 42: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 42

PERT ctdFor given times, the project completes in E = E10 = 15 days.

(They seem optimistic.)

Start time Time taken End time

Sj = maxk→j Ek Tj Ej = Sj + Tj

E.g., from the graph: S9 = max(E5, E6, E7).

For independent exponential random variables Tj , we get E(E10).= 18.

Suppose there’s a large penalty when E > 70.

Plain MC

Got E > 70 twice in n = 10,000 tries.

Let’s try importance sampling.

Grid Science Winter School, January 2017

Page 43: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 43

Change the exponential ratesChange Ti from exponential with mean θi to mean λi.

µλ =1

n

n∑i=1

1Ei,10>70

10∏j=1

exp(−Tij/θi)/θiexp(−Tij/λi)/λi

.

We can make Eλ(E10) ≈ 70 by taking λ = 4θ.

We get µλ.= 2.01× 10−5 with a standard error of 4.7× 10−6.

DiagnosticsWe also get ne

.= 4.90. This is quite low.

The largest weight was about 43% of the total.

The standard error did not warn us about this problem.

The standard error is more vulnerable to large weights than the mean is.

An effective sample size for the se is ne,σ.= 1.15. O (2013), Chapter 9.3

Grid Science Winter School, January 2017

Page 44: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 44

Try different λReplace λ = 4θ by λ = κθ and search for good κ. (None seemed good.)

Unequal scalingAdding a day to some tasks does not always add a day to the project.

Details: Tj = θj + 1, Tk = θk for k 6= j. Critical path: j ∈ 1, 2, 4, 10.

Even doubling some tasks’ time might not cause a delay.

Details: Tj = 2θj , Tk = θk for k 6= j. Changes for: j 6∈ 3, 7, 8.

Scale tasks 1, 2, 4, 10 by κ

κ 3.0 3.5 4.0 4.5 5.0

105 × µ 3.3 3.6 3.1 3.1 3.1

106 × se(µ) 1.8 2.1 1.6 1.5 1.6

ne 939 564 359 239 165

ne,σ 165 88 52 33 23Grid Science Winter School, January 2017

Page 45: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 45

FollowupUsing κ = 4 on the critical path with n = 200,000 yields

µq = 3.18× 10−5, se = 3.62× 10−7 ne.= 7470, ne,σ

.= 992, ne(f)

.= 7400

Vs plain MCPlain MC has variance ε(1− ε)/n.

IS has variance about 1200-fold smaller here.

Vs SNISIn this case SNIS had about double the estimated variance of plain IS.

Further improvementsMuch more subtle tuning could be applied.

The 10’th random variable can be ‘integrated out’ in closed form, reducing variance.

Control variates could be applied.

Antithetics could be applied. Grid Science Winter School, January 2017

Page 46: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 46

Summary of basic IS• IS helps with rare events or spiky integrands

• It becomes unbiased by reweighting

• A good importance sampler q is nearly proportional to fp

• Take extreme care that q is not small

• Self-normalized IS is more widely applicable, but less effective

• Dimension brings challenges

PERT example

This was adaptive IS done by hand.

Grid Science Winter School, January 2017

Page 47: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 47

Devising qWe can go through a storehouse of distributions to find q.

For instance Devroye (1986).

General techniques

• Exponential tilting

• Laplace approximations

• Mixtures

Grid Science Winter School, January 2017

Page 48: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 48

Exponential tiltingAlso called exponential twisting.

Exponential family

p(x; θ) = exp( θTx−A(x)− C(θ) ), x ∈ X

Here p(x) = p(x; θ0) and q(x) = p(x; θ).

Includes: N (θ, I), Poisson(eθ), Binomial, Gamma distributions.

More general case

p(x; θ) = exp( η(θ)TT (x)−A(x)− C(θ) ), x ∈ X

Here η(·) and T (·) are known functions.

Grid Science Winter School, January 2017

Page 49: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 49

IS with tilting

µθ =1

n

n∑i=1

f(xi)p(xi; θ0)

p(xi; θ), xi ∼ p(· ; θ)

=1

n

n∑i=1

f(xi)exp(xT

i θ0 −A(xi)− C(θ0))

exp(xTi θ −A(xi)− C(θ))

=c(θ)

c(θ0)

1

n

n∑i=1

f(xi)exTi (θ0−θ)

where c(θ) = exp(C(θ)). This often simplifies.

ForN (θ,Σ)

µθ =1

n

n∑i=1

f(xi) exTiΣ−1(θ0−θ) × e 1

2 θTΣ−1θ− 1

2 θT0Σ−1θ0︸ ︷︷ ︸

c(θ)/c(θ0)

Grid Science Winter School, January 2017

Page 50: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 50

Further tiltingDefine a family q(x; θ) ∝ p(x)eh(x)Tθ .

The function h(·) defines ’features’ of x.

θj > 0 makes higher values of hj(x) more probable.

q(x; 0) = p(x)

If p is not in the exponential family, then we might not be able to compute w(x).

If we can sample from q(x; θ) then we can use SNIS.

Grid Science Winter School, January 2017

Page 51: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 51

Hessian at the modeSuppose that we find the mode x∗ of p(x) or better yet, of h(x) ≡ p(x)f(x).

Taylor approximation

log(h(x)) ≈ log(h(x∗))−1

2(x− x∗)TH∗(x− x∗)

h(x) ≈ h(x∗) exp(−1

2(x− x∗)TH∗(x− x∗)

), suggests

q = N (x∗, H−1∗ ).

The Hessian of log(h) at x∗ is−H∗.Requires positive definite H∗.

For safety

Use a t distribution instead ofN .

Grid Science Winter School, January 2017

Page 52: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 52

MixturesWhat if there are multiple modes?

https://en.wikipedia.org/wiki/Multimodal_distribution

Closely sampling the highest mode might lead us to undersample or even miss some others.

A mixture of unimodal densities can capture all the modes (that we know of).

Grid Science Winter School, January 2017

Page 53: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 53

Mixture distributionsSample j randomly from 1, 2, . . . , J with probabilities α1, . . . , αJ .

Here αj > 0 and∑Jj=1 αj = 1.

Given j take x ∼ qj .

For exampleJ∑j=1

αj N (θj , σ2Id)

With large J , we get kernel density approximations.

These can approximate generic densities.

The approximation gets more difficult with large dimension.

West (1993), Oh & Berger (1993)

Grid Science Winter School, January 2017

Page 54: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 54

Multiple rare event setsSuppose that x describes a bridge and its load.

There is a failure if x ∈ A1 or x ∈ A2.

Oversampling A1 might lead us to miss A2.

Finance example

Portfolio might fail to meet its goal under:

A1: oil sector underperforms

A2: dollar becomes too weak

A3: interest rates in China are too high

A4: pharmaceutical stocks do too well (portfolio shorted them)

Mixtures

One component can oversample each failure mechanism (that we know of).

Grid Science Winter School, January 2017

Page 55: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 55

Multiple integrands of interestWe want µj =

∫fj(x)p(x) dx for j = 1, . . . , J .

Statistics: numerous posterior moments.

Graphics: red, green, blue for each pixel.

Grid science: J different designs.

Each fj(·)p(·) has its own peak.

Mixtures

One mixture component can be directed towards each integrand (that we have planned for).

We will have to take care that optimizing for f1 does not spoil things for f2. The same issue

arises for multiple modes and failure regions.

Grid Science Winter School, January 2017

Page 56: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 56

Multiple nominal distributionsWe want µj =

∫f(x)pj(x) dx for j = 1, . . . , J .

The pj may correspond to different scenarios (that we are aware of).

Mixtures

Multiple pj and multiple fk amount to multiple products (pf)jk.

We can use a component for each.

Grid Science Winter School, January 2017

Page 57: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 57

IS with mixturesqα(x) =

J∑j=1

αjqj(x)

µα =

n∑i=1

f(xi)p(xi)

qα(xi)=

n∑i=1

f(xi)p(xi)∑Jj=1 αjqj(xi)

. (∗∗)

(∗∗) Balance heuristic Veach & Guibas, also Horvitz-Thompson estimator.

AlternativeSuppose that xi came from component j(i). We could also use

1

n

n∑i=1

f(xi)p(xi)

qj(i)(xi).

It is also unbiased.

This alternative has higher variance.

But you don’t have to compute every qj for every xi. Grid Science Winter School, January 2017

Page 58: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 58

Why higher variance?Argument by convexity. Simple case α = ( 1

2 ,12 ) for q1 and q2.

Recall: σ2q =

∫(fp)2/q dx− µ2

Horvitz Thompson

1

n

(∫(fp)2

(q1 + q2)/2dx− µ2

)Alternative

1

2n

(∫(fp)2

q1dx− µ2

)+

1

2n

(∫(fp)2

q2dx− µ2

)Horvitz-Thompson is better because 1/x is convex.

We gave the alternative a slight advantage by taking exactly n/2 observations from each qj .

Grid Science Winter School, January 2017

Page 59: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 59

Defensive mixturesFrom Hesterberg (1995)

For our best guess q(·) take q1 ≡ p and let q2 = q. Now,

w(x) =p(x)

qα(x)=

p(x)

α1p(x) + α2q2(x)6

1

α1, ∀x

Perhaps α1 = 1/10 or 1/2.

q does not need to be heavy tailed any more because qα is.

Variance bounds

Var(µqα) 61

nα1

(σ2p + α2µ

2)

Var(µqα,sn) 61

nα1σ2p,sn =

1

nα1σ2p

Var(µqα,sn) 61

nα2σ2q,sn

If however σ2q was very small, defensive IS can lose out.

We don’t have a bound vs σ2q .

Grid Science Winter School, January 2017

Page 60: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 60

Loss of proportionality

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

range(u)

f(x)

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

range(u)

c(0,

10)

Sampling densities

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

Likelihood ratios

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

c(0,

10)

Integrands

Defensive importance sampling

• f Gaussian near 0.7

• p Uniform

• q is Beta≈ f

• qα is dotted mixture

• α = (0.2, 0.8)

• fp/q unbounded

• qα not very proportional to fp = f

• Solution = control variates.

Grid Science Winter School, January 2017

Page 61: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 61

Control variatesSuppose that we know E(h(x)) = θ in closed form.

If h ≈ f we can Monte Carlo the difference

1

n

n∑i=1

(f(xi)− h(xi)

)+ θ

for greatly reduced variance.

and h is similar to f then we can use it to estimate E(f(x)).

More generally

For known E(h(x)) = θ ∈ Rd, let

µβ =1

n

n∑i=1

(f(xi)− βTh(xi)

)+ βTθ

Optimal β can be estimated from the same data.

Grid Science Winter School, January 2017

Page 62: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 62

It is easy to see that E(µβ) = µ (unbiased) for any β.

The optimal β minimizes Var(µβ). This βopt is unknown and could well be harder to estimate

than it is to estimate our ultimate goal µ.

A harder intermediate problem is usually to be avoided, but in this case we have a loop hole.

Getting β wrong by ε will mean our variance is too large by a factor of 1 +O(ε2). We do not

need a precise estimate of β. The typical ε is proportional to 1/√n. We can ordinarily analyze

control variates as if we were using the true unknown β.

[Aside: this asymptotic approximation is poor if the dimension of β is comparable to n.]

Grid Science Winter School, January 2017

Page 63: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 63

Control variates via regressionDefine Yi = f(xi) and Zij = hj(xi)− θj .

Fit Yi.= β0 +

J∑j=1

βjZij by least squares.

The intercept β0 is E(µ).

Regression code also gives a standard error for the estimate.

Details in O (2013), Algorithm 8.3.

Centering Zij is essential.

Using estimated β

β − βopt = Op(n−1/2), so we are about n−1/2 from the optimum.

σ2β is quadratic in β =⇒ σ2

β= σ2

βopt(1 +O(1/n)).

Grid Science Winter School, January 2017

Page 64: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 64

Control variates and mixture ISFor normalized qj(·) we know

∫qj(x) dx = 1, j = 1, . . . , J

Eqα( qj(x)

qα(x)

)=

∫qj(x) dx = 1

Unbiased estimate

µα,β =1

n

n∑i=1

f(xi)p(xi)−∑Jj=1 βjqj(xi)∑J

j=1 αjqj(xi)+

J∑j=1

βj

Additional control variates can be added too.

Via regressionRegress Yi = f(xi)p(xi)/qα(xi) on Zij = qj(xi)/qα(xi)− 1.

Get µ = β0 (intercept) and se.

∑j αjZij = 1 for all i, so drop one predictor.

Grid Science Winter School, January 2017

Page 65: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 65

Mixture IS resultsO & Zhou (2000)

Var(µα,βopt) 6 min16j6J

σ2j

nαj

Properties

1) µα,β is unbiased if qα > 0 whenever fp 6= 0

2) If any qj have σ2qj = 0 we get Var(µα,βopt) = 0

3) Even better to take exactly nαj observations from qj .

We could not expect better in general.

We might have σ2j =∞ for all but one j.

The bound is σ2j /(nαj) as if we had just used the good one.

(Without knowing which one it was. Indeed it might be a different one for each of several

different integrands.)

Grid Science Winter School, January 2017

Page 66: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 66

Multiple importance samplingVeach & Guibas (1995)

Widely used in computer graphics.

Eric Veach got an Oscar for it in 2014.

µ =

J∑j=1

1

nj

nj∑i=1

ωj(xij)f(xij)p(xij)

qj(xij)

qj = j ’th light path sampler

Partition of unity

ωj(x) > 0 and∑Jj=1 ωj(x) = 1, ∀x

Can attain Horvitz-Thompson by ‘balance heuristic’ ω(· ) ∝ njqj(· )

The weight on f(xij)p(xij) can depend on j in many ways.

Grid Science Winter School, January 2017

Page 67: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 67

Summary of mixturesUsing mixtures, we can

• bound the importance ratio

• place a distribution near each singularity

• place a distribution near each failure mode

• tune a distribution to each fj of interest

• tune a distribution to each pj of interest

• use control variates to be almost as good as the optimal component

The mixture components can be based on intuition, tilting, Hessians.

Grid Science Winter School, January 2017

Page 68: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 68

What-if simulationsWe estimated µ =

∫f(x)p(x) dx using xi ∼ q

From the sample values f(xi) we can estimate

1)∫f(x)p′(x) dx for a different density p′ 6= p

2) Varq(µq) for xi ∼ q

3) Varq′(µq′) for xi ∼ q′ for a different density q′ 6= q

Estimating what would have happened with a different q is the key to adaptive IS.

Grid Science Winter School, January 2017

Page 69: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 69

What-if simulationsFamily p(x; θ), θ ∈ Θ

We want µ(θ) = E(f(x); θ) =∫f(x)p(x; θ) dx, θ ∈ Θ

Sample from θ0 and reweight for the others

µ(θ) =1

n

n∑i=1

f(xi)p(xi; θ)

p(xi; θ0), xi ∼ p(· ; θ0).

We can recycle our f(xi) values

Common heavy-tailed q

µ(θ) =1

n

n∑i=1

f(xi)p(xi; θ)

q(xi), xi ∼ q(· )

Paraphrase (from memory) of Tukey and Trotter (1956)

“Any sample can come from any distribution.”

Grid Science Winter School, January 2017

Page 70: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 70

What-if simulationsThe may be accurate for θ near θ0 but otherwise may fail.

For p = N (θ0,Σ) and q = N (θ,Σ)

n∗e >n

100requires (θ − θ0)TΣ−1(θ − θ0) 6 log(10)

.= 2.30 O (2013), Chapter 9.14.

Examplef(xi; ν) = max(xi1, xi2, xi3, xi4, xi5) where xij ∼ Poi(ν).

Importance sample from λ instead of ν.

µν =1

n

n∑i=1

max(xi1, . . . , xi5)5∏j=1

e−ννxij

xij !

/ e−λλxijxij !

=1

n

n∑i=1

max(xi1, . . . , xi5)e5(λ−µ)(νλ

)∑j xij

Grid Science Winter School, January 2017

Page 71: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 71

Poisson example

0.5 1.0 1.5 2.0 2.5

1e−

011e

+01

1e+

03

**

***

* * * * * * * * * ***

** * * * * * *

Effective sample sizes

ν

λ

0.5 1.0 1.5 2.0 2.51

23

4

rang

e(va

ls[k

eep,

])

Means and 99% C.I.s

ν

λ

xij ∼ Poi(λ)

Solid curve is pop’n effective sample size. Asterisks are sample estimates.

When ν > λ then q has lighter tails than p. Hence wide CIs and poor ne.

Grid Science Winter School, January 2017

Page 72: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 72

What-if for varianceFor q(x; θ), σ2

θ =∫ (f(x)p(x))2

q(x;θ) dx− µ2 ≡ MSθ − µ2

Now

MSθ =

∫(f(x)p(x))2

q(x; θ)dx =

∫(f(x)p(x))2

q(x; θ)q(x; θ0)q(x; θ0) dx

Sample estimate

For xi ∼ q(x; θ0),

MSθ(θ0) =1

n

n∑i=1

(f(xi)p(xi))2

q(xi; θ)q(xi; θ0)

Grid Science Winter School, January 2017

Page 73: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 73

Summary of what-if simulations

• We can learn about alternative p’s and q’s

• Effectiveness is over a limited range

Grid Science Winter School, January 2017

Page 74: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 74

The next slides are a survey of a selected set of adaptive importance sampling methods.

These are the ones that I thought were interesting and potentially useful in grid science.

Of course there are many others.

Grid Science Winter School, January 2017

Page 75: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 75

Adaptive importance samplingOverview:

Use the xi sampled so far to change q and sample some more.

Two stageGet a pilot sample xi1 ∼ q1.

Use them to choose q2.

Take xi2 ∼ q2 to estimate µ.

Multi-stageGiven q1, for k = 1, . . . ,K − 1, use xik, i = 1, . . . , nk to choose qk+1.

Or use xij , 1 6 i 6 nj , 1 6 j 6 k to choose qk+1.

n-stageGiven q1, take xi from qi computed from x1, . . . ,xi−1.

Grid Science Winter School, January 2017

Page 76: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 76

AIS detailsWe have to pick a familyQ of proposal distributions.

We need an initial q ∈ Q.

We have to choose a termination criterion:

e.g., K steps, total N observations, confidence interval width below ε.

We need a way to choose qk+1 ∈ Q from prior data.

We need a way to combine information from all K stages.

Grid Science Winter School, January 2017

Page 77: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 77

CombinationSuppose we get µk for k = 1, . . . ,K .

Unbiased with variances σ2k.

Best linear unbiased combination

K∑k=1

µkVar(µk)

/ K∑k=1

1

Var(µk)

It is not safe to replace Var(µk) by Var(µk).

It is common for Var(µk) to be correlated with µk.

E.g., largest µk may coincide with largest Var(µk).

Plug-in estimates could bias the estimate down.

Deterministic weighting is less risky.

Grid Science Winter School, January 2017

Page 78: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 78

Deterministic weightsSuppose true variance improves: Var(µk) ∝ k−r0 for some 0 6 r0 6 1.

We should weight proportionally to kr0 but we might not know r0. So we use r1.

If we take r1 = 1/2 then we never raise variance by more than 12.5%.

sup16K<∞

max06r061

Var(using r1 = 1/2)

Var(using r0)=

9

8.

UpshotOnly a small effiiciency loss from weighting stage k proportionally to

√k.

K∑k=1

√kµk

/ K∑k=1

√k.

O & Zhou (1999)

Not efficient for exponential convergence (see below) but that is a rare circumstance.

Grid Science Winter School, January 2017

Page 79: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 79

CombinationPut weights ωk on µk, such as ωk ∝

√k.

Final estimate µ =∑Kk=1 ωkµk.

Assume

E(µk | data to steps k − 1) = µ

E(Var(µk) | data to steps k − 1) = Var(µk | data to steps k − 1)

Then by Martingale arguments

E(µ) = µ, and

Var(µ) ≡∑k

ω2kVar(µk), satisfies

E(Var(µ)) = Var(µ)

Grid Science Winter School, January 2017

Page 80: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 80

Exponential convergenceParametric setQ = qθ | θ ∈ Θ, Θ ⊂ Rd

Best case scenario

If q∗ = fp/µ ∈ Q then we can approach σ2q = 0.

Let q∗ = qθ∗ . If we can estimate θ∗ to within O(n−1/2) from a first estimate we get much

smaller variance at the second step.

The result can be a virtuous cycle of θ(k) → θ∗ and σ2k → 0.

Examples

1) Toy example with p = N (0, 1) and f(x) = exp(Ax).

2) Theory from Kollman (1993) and others for particle transport.

Grid Science Winter School, January 2017

Page 81: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 81

Toy examplep = N (0, 1), f(x) = exp(Ax) for A > 0 and qθ = N (θ, 1).

We actually know µ = E(f(x)) = exp(A2/2) so we don’t really need MC.

Suppose we also know θ∗ = A and Ep(f(x)) = exp(A2/2).

That is θ∗ =√

2 log(E(f(x))).

Cycle through

xi ∼ qθ(k) , i = 1, . . . , n

µ(k) ← 1

n

n∑i=1

f(xi)p(xi)

qθk(xi)

θ(k+1) ←√

2 log(max(1, µ(k))).

Without max(1, ·) the algorithm can fail.

Grid Science Winter School, January 2017

Page 82: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 82

Exponential convergence

5 10 15 20

1e−

161e

−08

1e+

00

Iteration

Abs

olut

e er

ror

Exponentially converging AIS error

Steps with n = 20.

Grid Science Winter School, January 2017

Page 83: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 83

Transport problemsBooth (1985) observed exponential convergence in some particle transport problems.

Kollman (1993) established exponential convergence under strong conditions:

start one simulation from each state of a Markov chain.

Extensions from Kollman, Baggerly, Cox, Picard (1999) and Baggerly, Cox, Picard (2000) and

Kong, Ambrose & Spanier (2009)

Story

Markovian particles P(xn+1 = x | x0, . . . ,xn−1) = P(xn+1 = x | xn−1)

Terminal state ∆ (absorption or left region of interest).

limn→∞ P(xn = ∆) = 1.

f(x1,x2, . . . ) =∞∑n=1

s(xn−1 → xn) =τ∑n=1

s(xn−1 → xn)

hitting time τ , score function s(· → ·) with s(∆→ ∆) = 0.

Grid Science Winter School, January 2017

Page 84: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 84

Particle transportThe particle follows a Markov chain with transitions P .

We use transitions Q instead, with Qij > 0 whenever Pij > 0.

Optimal Q

Qij =Pij(sij + µj)

Pi∆si∆ +∑d`=1 Pi`(si` + µ`)

Qi∆ =Pi∆si∆

Pi∆si∆ +∑d`=1 Pi`(si` + µ`)

, where

µi = EQ( τ∑n=1

s(xn−1 → xn)wn(x1, . . . ,xn)∣∣ x0 = i

), and

wn =n∏j=1

Pxj−1xj

Qxj−1xj

.

AlgorithmAlternates between estimating µ and sampling from Q(µ). Grid Science Winter School, January 2017

Page 85: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 85

Cross-entropyRubinstein (1997), Rubinstein & Kroese (2004)

µ =∫f(x)p(x) dx for f > 0 and µ > 0.

We use q(· ; θ) = qθ for θ ∈ Θ.

There is an optimal q ∝ fp but it is not usually in our family.

We would like

θ = arg minθ∈Θ

MS(θ) where MS(θ) =

∫(fp)2

qθdx

Variance based update

θ(k+1) = arg minθ∈Θ

1

nk

nk∑i=1

(f(xi)p(xi))2

qθ(xi), xi = x

(k)i ∼ qθ(k)

The optimization may be too hard. Switch to entropy.

Grid Science Winter School, January 2017

Page 86: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 86

Cross-entropyUse an exponential family

qθ = exp(θTx−A(x)− C(θ))

Replace variance by KL distance

D(q∗‖qθ) = Eq∗(

log(q∗(x)

qθ(x)

))This distance hasD(q∗‖qθ) > 0 andD(q∗‖q∗) = 0.

Seek θ to minimize

D(q∗‖qθ) = Eq∗(log(q∗(x))− log(q(x; θ)))

I.e., maximize

Eq∗(log(q(x; θ)))

Grid Science Winter School, January 2017

Page 87: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 87

Cross-entropyFor q > 0 whenever q∗ > 0,

Eq∗(log(q(x; θ))) = Eq( log(q(x; θ))q∗(x)

q(x)

)=

1

µEq( log(q(x; θ))f(x)p(x)

q(x)

)Choose θ(k+1) to maximize

G(k)(θ) =1

nk

nk∑i=1

p(xi)f(xi)

q(xi; θ(k))log(q(x; θ))

≡ 1

nk

k∑i=1

Hi log(q(xi; θ))

≡ 1

nk

k∑i=1

Hi

(θTxi −A(xi)− C(θ)

)C(θ) is a known convex function. For Hi > 0. Assume (for now) that

∑iHi > 0.

The update often takes a simple moment matching form. Grid Science Winter School, January 2017

Page 88: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 88

Cross-entropy update∑iHix

Ti∑

iHi=

∂θC(θ(θ+1))

For qθ = N (θ, I)

θ(k+1) ←∑iHixi∑iHi

For qθ = N (θ,Σ)

θ(k+1) ← Σ−1

∑iHixi∑iHi

Other exponential family updates are typically closed form functions of sample moments.

What if all Hi = 0?Consider f(x) = 1g(x)>τ for large τ .

Use some other τ (k) instead. For example 99’th percentile of g(xi).

Ideally τ (k) → τ .Grid Science Winter School, January 2017

Page 89: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 89

Cross-ent examples

−6 −4 −2 0 2 4 6

−5

05

Gaussian, Pr(min(x)>6)

−6 −4 −2 0 2 4 6−

50

5

Gaussian, Pr(max(x)>6)

θ1 = (0, 0)T.

Take K = 10 steps with n = 1000 each.

Grid Science Winter School, January 2017

Page 90: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 90

Cross-ent examples

−6 −4 −2 0 2 4 6

−5

05

Gaussian, Pr(min(x)>6)

−6 −4 −2 0 2 4 6−

50

5

Gaussian, Pr(max(x)>6)

θ1 = (0, 0)T.

For min(x), θk heads Northeast, and is ok.

For max(x), θk heads North, or East and underestimates µ by about 1/2.

Grid Science Winter School, January 2017

Page 91: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 91

Nonparametric AISFlexible / infinite dimensional familiesQ.

Often based on recursive rectangular partitioning of region like [0, 1]d.

1) Vegas, Lepage (1978)

2) Divonne, Friedman & Wright (1979, 1981)

3) Miser Press & Farrar (1990)

4) Split Vegas Zhou (1998)≈ Vegas within Miser

Grid Science Winter School, January 2017

Page 92: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 92

Vegas

q(x) =d∏j=1

qj(xj)

Each qj is piece-wise constant. Equal content, not equal width.

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

Vegas−style piecewise constant density

x

dens

ity

Very amenable to sampling spikes. Grid Science Winter School, January 2017

Page 93: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 93

Comments on VegasThe optimal qj given other q` is

qj(xj) ∝

√√√√∫(0,1)d−1

f(x)2∏` 6=j q`(x`)

∏6=j

dx`.

Update starts by iterating above. Some complex steps.

A problem with Vegas

Suppose f(x) has two modes in [0, 1]10 one near (0, 0, . . . , 0) and one near (1, 1, . . . , 1)

Each qj needs two modes.

So q has 210 modes, mostly irrelevant.

Grid Science Winter School, January 2017

Page 94: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 94

DivonneFriedman & Wright (1979, 1981)

p = U[0, 1]d and f has two continuous derivatives.

Local optimization has less curse of dimension than integration.

Newton steps cost O(d3) per iteration.

Recursive partitioning

Hyper rectangles R` ⊂ [0, 1]d with ∪L`=1R` = [0, 1]d and vol(Rj ∩Rk) = 0 for j 6= k.

∫[0,1]d

f(x) dx =

L∑`=1

∫R`

f(x) dx

Start with L = 1 and R1 = [0, 1]d.

Grid Science Winter School, January 2017

Page 95: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 95

Divonne splitsFind min and max of f in each rectangle.

f(R) = maxx∈R

f(x)

f(R) = minx∈R

f(x)

b(R) = vol(R)|f(R)− f(R)| (badness)

Split the rectangle with largest ‘badness’.

Non-binary splitsTo do the split decide which of max, min is farthest from average f in the rectangle.

Define a subrectangle to isolate that mode.

Partition the original rectangle into numerous other subrectangles to complement the one

around the mode.

An issue it is not designed for unbounded integrands and they will give it trouble.

Grid Science Winter School, January 2017

Page 96: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 96

MiserPress & Farrar (1990) see also Numerical Recipes

Estimates∫

(0,1)df(x) dx by recursive binary partitioning into rectangles.

Each rectangle can be split down the middle in d different directions.

The chosen direction is based on an estimate of which j will lead to the most accurate estimate.

They assume that Miser will be applied left and right, not Monte Carlo, and them employ an

assumption about the Miser convergence rate to decide.

Having chosen the split, they allocate sample sizes left and right.

For a mode at (1/2, 1/2, . . . , 1/2) then split at 1/2 + U(−δ/2, δ/2) to avoid it.

Grid Science Winter School, January 2017

Page 97: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 97

Split VegasDissertation of Zhou (1998).

Within a rectangular region, fit Vegas.

If one of the marginal distributions is strongly bimodal then split the rectangle on that direction.

It can be subtle to indentify whether a given margin has two or more modes.

Grid Science Winter School, January 2017

Page 98: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 98

Mixture of products of betaFor∫

[0,1]df(x) dx

Beta density b(x;α, β) ∝ xα−1(1− x)β−1 α, β > 0

Can have a bump near α/(α+ β) or a spike near 0 and / or near 1.

d dimensional product of beta densities∏dj=1 b(xj ;αj , βj)

Mixture of products of beta densities q(x) =∑Mm=1 γm

∏dj=1 b(xj ;αjm, βjm)

Adaptive partChoose γm and αjm, βjm by Levenberg-Marquardt nonlinear least squares to minimize

n∑i=1

(|f(xi)| − q(xi;α,β,γ)

)2

O & Zhou (1999)

Many more details in thesis Zhou (1998).

Grid Science Winter School, January 2017

Page 99: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 99

Particle physics integrandFrom Kinoshita (1970).

7 dimensional, positive and negative integrable singularities.

MC Vegas MISER ADBayes MixBeta

Err/µ 0.154 0.092 0.073 0.210 0.035

Err/Err 1.163 34.0 0.750 19.5 0.394

K = 20 runs of n = 105 points each.

Except ADBAYES.

Known that µ.= 0.371005292 as of 1991. (!!)

(Don’t recall non-sampling costs.)

Grid Science Winter School, January 2017

Page 100: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 100

AMISAdaptive Multiple Importance Sampling

Cornuet, Marin, Mira, Robert (2012)

1) Sample n1 observations using θ1

2) Estimate θ2 from data and sample more

3) Keep estimating new θk and sampling more

4) Combine rounds by multiple importance sampling methods

The weights on observations from one round can change later due to data from other rounds.

This breaks the Martingale property, so it is hard to get unbiased estimates.

SNIS is used throughout because the motivation is from Bayesian problems where p is usually

not normalized.

Grid Science Winter School, January 2017

Page 101: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 101

Optimal mixture weightsMSα,β ≡

∫(fp+

∑j βj(qj − p))2∑j αjqj

dx

is jointly convex in α (in simplex) and β.

So is a sample estimate of it.

Therefore we can estimate optimal mixture weights.

Also we can constrain αj > ε > 0 in a convex optimiation and variance does not increase

much.

He & Owen (2014) sequentially optimize both α and β.

Grid Science Winter School, January 2017

Page 102: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 102

M-PMCMixture-Population Monte Carlo

Cappe, Douc, Guillin, Marin & Robert (2008)

q(x) =J∑j=1

αjqj(x; θj)

They target Kullback-Leibler distance to target dist’n.

Also use SNIS.

Aim to optimize over both αj and θj (non-convex).

Use an EM algorithm for the θ updates.

Earlier Douc, Guillin, Marin & RObert (2007) target specific integrand.

Grid Science Winter School, January 2017

Page 103: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 103

APISMartino, Elvira, Luengo, Corander (2015)

Adaptive Population Importance Sampler

For an unnormalized p(·).

Choose n normalized distributions qi(· ; θi, Ci), mean θi, covariance Ci.

Sample xi ∼ qiGet SNIS weights wi ∝ p(xi)/

∑i′ qi′(xi; θi′ , Ci′).

Every m’th iteration, update the means θi but not the covariances Ci,

using previous m− 1 iterations’ data

Avoids “particle collapse”.

Empirical assessment.

Grid Science Winter School, January 2017

Page 104: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 104

Stochastic convex programmingRyu & Boyd (2015)

They take up the exponential family setting of cross-entropy.

They find in exponential families that the variance is a convex function.

So they don’t have to use Kullback-Leibler.

µ =1

n

n∑i=1

f(xi)p(xi)

q(xi; θi)

Potentially a new θi for each xi. (or small batches)

Stochastic gradient descentθn+1 = Proj(θn − αngn)

Sequence αn → 0 with∑αn =∞

Proj projects onto Θ

gn = gn(θn,xn) is n’th gradient estimate

µn attains optimal variance×(1 +O(n−1/2)). Grid Science Winter School, January 2017

Page 105: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 105

Kriging based estimatorsPicheny (2009), Baleadent, Morio & Marzat (2013), Dalbey & Swiler (2014)

Rare event φ(x) > τ . We get φ(xi), not just f(x) = 1φ(x)>τ .

Kriging

Given φ(x1), . . . φ(xn) it can predict φ(x) elsewhere by Gaussian process interpolation.

Swiler & West (2010) report that∫p(x)1φ(x)>τ dx does not work well.

Better to use∫p(x)Φ((τ − φ(x)/s(x))) dx where kriging gives φ and a posterior standard

deviation s. Chevalier, Ginsbourger, Picheny, Richet, Bect, Vazquez (2014)

Dalbey & Swiler (2014) use φ to update q.

Grid Science Winter School, January 2017

Page 106: Art B. Owen Stanford Universitystatweb.stanford.edu/~owen/pubtalks/AdaptiveISweb.pdfAdaptive Importance Sampling 3 Outline 1) Basic Importance Sampling. often our only chance to solve

Adaptive Importance Sampling 106

Thanks• Yi Zhou, Hera He, co-authors

• Minyong Lee, discussion

• Toichuro Kinoshita (particle physics code and integral)

• U.S. National Science Foundation DMS-1521145, DMS-1407397

• Michael Chertkov

• Kacy Hopwood

Grid Science Winter School, January 2017