art b. owen stanford universitystatweb.stanford.edu/~owen/pubtalks/adaptiveisweb.pdfadaptive...

Adaptive Importance Sampling 1

Adaptive Importance Sampling

Art B. Owen

Stanford University

Grid Science Winter School, January 2017


These are slides that I presented at Grid Science 2017 in Los Alamos NM, on Wednesday

January 11.

I have amended them correcting typos and adding a few more references. I also add a few

interstitial slides like this one to cover things that I either said or should have said during the

presentation.



Outline1) Basic Importance Sampling.

often our only chance to solve a problem.

2) Mixture Distributions.

the key to effective use of IS. Also an Oscar.

3) What-if Simulations.

using data from one sim to learn how another would have worked.

4) Adaptive Importance Sampling.

Still mostly intuitive and ad hoc.



What is missing1) Some paywalled articles.

2) Many articles from application areas.

3) Some very advanced / hard to explain developments.

Google Scholar, Dec 2016

“adaptive importance sampling”

About 3,040 results (0.07 sec)

Since 2012: About 1,070 results (0.10 sec)



At the meeting, several people told me about the importance sampling work of Ian Dobson on

estimating rare event probabilities for power system cascades. So I’m looking into those now.



The canonical problemApproximate µ =

∫Rdh(x) dx.

Use the first one that works of

1) Symbolic / closed form solutions

2) Numerical quadrature

3) Plain Monte Carlo

4) Importance Sampling

5) Adaptive Importance Sampling

Notes• Some problems have d =∞ and / or discrete x

• We may need Markov chain Monte Carlo (MCMC)

or sequential Monte Carlo (SMC).

• Sometimes we can use quasi-Monte Carlo (QMC). Grid Science Winter School, January 2017


There are problems where none of the five approaches on the previous slide will get the answer.

Quasi-Monte Carlo sampling can be used to great effect when the underlying integrands are just

a little bit more regular than Monte Carlo uses. With randomized QMC, RQMC, one gets an

answer that is seldom worse than plain MC or plain QMC and can be much better. This talk was

just about MC.

MCMC and SMC can be brought to bear on problems that are less regular/favorable than plain

MC uses. Those methods can handle even harder problems because they involve flexible

explorations of the problem space. It can be hard to tell if/when they work.



Closed formSometimes we can get a closed form.

Symbolic computation (Mathematica, Maple, Sage, etc.) help.

Unfortunately

We can seldom do this for problems with real-world complexity,

so we resort to numerical methods.



Numerical quadratureFor instance tensor products of Simpson’s rule.

Evaluate f at n points xi ∈ Rd.

Can attain an error rate of O(n−r/d) for h ∈ C(r)(Rd).

Unfortunately

Low smoothness (small r) or high dimension (large d) make the methods ineffective.



Monte CarloWrite h(x) = f(x)p(x) for a probability density function p(·). Then,

µ =

∫h(x) dx =

∫f(x)p(x) dx = E(f(x)), x ∼ p.

Estimate

µ =1

n

n∑i=1

f(xi), xi ∼ p

√n(µ− µ)

d→ N (0, σ2), CLT σ2 = E((f(x)− µ)2)

UpshotVery competitive rate O(n−1/2) when r/d 1/2. We get confidence intervals too.

Unfortunatelyσ/√n

µmight be large, or even infinite.



Places where MC has troubleRare events

• Bankruptcy / default for finance / insurance

• Energy grid failure Bent, Backhaus & Yamangil (2014),

cascade simulations Chertkov (2012)

• Product failure, e.g., battery

• Monograph by Rubino & Tuffin (2009)

Spiky/singular integrands

• High energy physics (Feynman diagram),

Aldins, Brodsky, Dufner, Kinoshita (1970)

multiple singularities in [0, 1]7 that dominate the integral

• Graphical rendering

Kollig & Keller (2006) Integrand includes factors like ‖x− x0‖−1

• Particle transport, e.g., T. BoothGrid Science Winter School, January 2017


Rare events

f(x) = 1A(x) =

1, x ∈ A0, else,

µ = P(x ∈ A) =

∫A

p(x) dx = ε

Coefficient of variation of µ

σ/√n

µ=

√ε(1− ε)√nε

≈ 1√nε

To get cv = 0.1 takes n > 100/ε.



Singularitiesf(x) ≈ ‖x− x0‖−r, x ∈ A ⊂ Rd

0 < a 6 p(x) 6 b <∞, for x ∈ A.

f is integrable over bounded A containing x0 iff r < d.

Then f2 is integrable iff 2r < d.

For r ∈ [d/2, d)

µ exists but σ2 =∞

Monte Carlo converges but slower than Op(1/√n).

Other singularities

‖x− x0‖p−r, min(x1, x2, . . . , xd)−r,

min(x1, 1− x1, x2, 1− x2, . . . , xd, 1− xd)−r



Very often singularities are simply not a big problem for MC and the interesting thing is that high

dimension can make singularities less harmful. This is a ‘concentration of measure’ result as

described below.

Let f be singular on some subset of the input space. For simplicity, take subsets of the form

Ωd,p(ρ) = x ∈ Rd | ‖x‖p 6 ρ for integer d > 1 and 1 6 p 6∞ and f(x) = ‖x‖−rp .

Let x be uniformly distributed on Ωd,p = Ωd,p(1) of volume |Ωd,p|. Then

µ =∫ 1

0yd−1−r dy|Ωd−1,p|/|Ωd,p| = (d− r)−1|Ωd−1,p|/|Ωd,p|. The singularity is severe

if, say, half of µ comes from a very tiny region around 0.

The radius ρ that captures half of the integral satisfies∫ ρ0y−r+d−1 dy = (1/2)

∫ 1

0y−r+d−1 dy, and it is ρ = 2−1/(d−r). The relative volume is

|Ωd,p(ρ)|/|Ωd,p(1)| = ρd = 2−d/(d−r). If r is close to d then half of the integral comes from

a very tiny fraction of the volume. If r = 1 and d grows then half of the integral comes from

about half of the volume, hence, not much of a spike. If r < d/2, so the variance is finite, then

this volume is larger than 2−d/(d−d/2) = 1/4. Once again, not a very prominent spike.

For singularities on a k dim manifold replace d by d− k.

Conclusion: rare events described by 0/1 random variables can be much harder to handle than

unbounded random variables. Grid Science Winter School, January 2017


Importance samplingFor rare events and (some) singularities there is a region A of relatively small P(x ∈ A) that

dominates the integral. Taking x ∼ p does not get enough data from the important region A.

Main ideaGet more data from A, and then correct the bias.

Kahn (1950), Kahn & Marshall (1953), Trotter & Tukey (1956)

Good news Importance sampling can solve these sampling problems.

Bad news It can also fail spectacularly.

For a density q

f(x)p(x) = f(x)p(x)

q(x)q(x)

µq =1

n

n∑i=1

f(xi)p(xi)

q(xi), xi ∼ q

For now, assume that p(x)/q(x) is computable.Grid Science Winter School, January 2017


UnbiasednessIf q(x) > 0 whenever f(x)p(x) 6= 0 then for xi ∼ q,

Eq(µq) =

∫f(x)

p(x)

q(x)q(x) dx =

∫f(x)p(x) dx = µ.

Sufficient condition

q(x) > 0 whenever p(x) > 0.

Yields unbiasedness for generic f .

Notation

Eq(·), Varq(·) etc. are for xi ∼ q.

w(x) ≡ p(x)

q(x), Ep(g(x)) = Eq(g(x)w(x)).



Variance of ISBased on O (2013), Ch 9.1

Varq(µq) =σ2q

n, where σ2

q = Varq((fp)/q)

σ2q = Eq

((fpq

)2)− µ2 =

∫(fp)2

qdx− µ2

After some algebra

σ2q =

∫(fp− µq)2

qdx

Lessons

1) q(x) ∝ f(x)p(x) is best, when f > 0.

2) q near 0 is dangerous.

3) Bounding (fp)2/q is useful theoretically.

Adaptive IS is about using sample data to select q.Grid Science Winter School, January 2017


ProportionalityFor f > 0, taking q = fp/µ yields σ2

q = 0. Specifically

µq =1

n

n∑i=1

f(xi)p(xi)

q(xi)=

1

n

n∑i=1

f(xi)p(xi)

f(xi)p(xi)/µ= µ

but we need to know µ to use it.

We look for q nearly proportional to fp.

More generally

q ∝ |f |p is always optimal, but not zero variance if f takes both positive and negative values.

For f > 0

q ∝ f is not best (unless p is flat)

q ∝ p is not best (unless f is flat)

Importance involves both f and p via fp.



Example

−5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

f

q

p

µ.= 1.74× 10−24, Var(µq) = 0

Gaussian p and f =⇒ Gaussian optimal q lies between Grid Science Winter School, January 2017


Here is a positivisation trick from O & Zhou (2000).

If f takes both positive and negative values then we can write

µ =

∫f(x)p(x) dx =

∫f+(x)p(x) dx−

∫f−(x)p(x) dx

for nonnegative functions f+(x) = max(0, f(x)) and f−(x) = max(0,−f(x)). Then we

can use

µ =1

n+

n+∑i=1

f+(xi)p(xi)

q+(xi)− 1

n−

n−∑i=1

f−(xi)p(xi)

q−(xi)

for xi from q±. There then exist zero variance sampling distributions q± that we might

approximate. More generally we can use

µ = c+1

n+

n+∑i=1

(f(xi)− c)+p(xi)

q+(xi)− 1

n−

n−∑i=1

(f(xi)− c)−p(xi)q−(xi)

or

µ = Ep(g(x)) +1

n+

n+∑i=1

(f(xi)− g(xi))+p(xi)

q+(xi)− 1

n−

n−∑i=1

(f(xi)− g(xi))−p(xi)

q−(xi)

for some g with known expectation. Grid Science Winter School, January 2017


Tails of q

σ2q =

∫(fp− µq)2

qdx =

∫(fp)2

qdx− µ2

The term depending on qRecall w(x) = p(x)/q(x).∫

(fp)2

qdx =

∫f2w2q dx = Eq(f2w2)

=

∫f2wpdx = Ep(f2w)

Consequenceµ2 + σ2

q = Ep(f2w) = Eq(f2w2)

∴ bounding w(x) =p(x)

q(x)is advantageous for many f .



Gaussian p with t distributed qMultivariate Gaussian, θ, Σ

p(x) ∝ exp(−(x− θ)TΣ−1(x− θ)/2)

Multivariate t , θ, Σ, degrees of freedom ν > 0

q(x) ∝ (1 + (x− θ)TΣ−1(x− θ))−(ν+d)/2

The t distribution has heavier tails than the Gaussian

supx p(x)/q(x) is bounded

supx p(x) exp(θTx)/q(x) is bounded

Geweke (1989) Split t (at 0)

Reversing roles of t and Gaussian

Leads to unbounded w



q nearly proportional to p

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

x

Den

sitie

s

−6 −4 −2 0 2 4 6

1e−

031e

−01

1e+

011e

+03

x

Den

sity

rat

ios

Nearly proportional densitiesGaussian and scaled t (25 df)

Light tailed q can make p/q diverge.

Can have σ2q =∞ for f(x) = 1.

Ironically it is the unimportant region that causes trouble.Grid Science Winter School, January 2017


Beyond varianceChatterjee & Diaconis (2015) show that the sample size required is roughly

n ≈ exp(KL distance p, q) for generic f .

This analysis involves Eq(|µq − µ|) and Pq(|µq − µ| > ε) instead of Varq(µq). It avoids

working with variances that can be harder to handle than means.

Taking ε = .025 in Theorem 1.2 (for 95% confidence of a small error) shows that we succeed

with n > 6.55× 1012 × exp(KL). Similarly, poor results are very likely for n much smaller

than exp(KL).

The range for n is not precisely determined by these considerations (yet).



Self-normalized importance samplingBased on O (2013, ch 9.2)

p(x) = cppu(x), q(x) = cqqu(x), 0 < cp, cq <∞

µq =cpcq

1

n

n∑i=1

f(xi)pu(xi)

qu(xi).

If pu, qu computable, but cp/cq unknown

wu(xi) = wu,q(xi) = pu(xi)/qu(xi)

wi = wi,q = wu(xi)/ n∑

i′=1

wu(xi′)

µq =n∑i=1

wif(xi)



Self-normalized importance sampling• µq has some bias, asymptotically negligible

• Self-normalized sampling competes with acceptance-rejection

• It does not attain zero variance for any q (and non-trivial f )

Adaptation

There are adaptive versions of both IS and SNIS.



SNIS is consistentby the Law of Large Numbers

µq =1

n

n∑i=1

f(xi)pu(xi)/qu(xi)/ 1

n

n∑i=1

pu(xi)/qu(xi)

→∫f(x)pu(x)

qu(x)q(x) dx

/ ∫ pu(x)

qu(x)q(x) dx, (LLN)

=cqcp

∫f(x)p(x) dx

/ cqcp

∫p(x) dx

= µ

Recall the weights

wi =pu(xi)

qu(xi)

/ n∑i′=1

pu(xi′)

qu(xi′)



SNIS varianceDelta method variance approximation for a ratio estimate:

Var(µq) =σ2q,sn

n

σ2q,sn = Eq(w(x)2(f(x)− µ)2)

Rewrite and compare

σ2q,sn =

∫p2

q(f − µ)2 dx

=

∫(fp− µp)2

qdx

vs σ2q =

∫(fp− µq)2

qdx

No q can make σ2q,sn = 0 (unless f is constant).



Exactness

µq =1

n

n∑i=1

f(xi)p(xi)

q(xi)

µq =1

n

n∑i=1

f(xi)pu(xi)/qu(xi)/ 1

n

n∑i=1

pu(xi)/qu(xi)

Constants f(x) = c

µq = c, while generally µq 6= c

We don’t usually try to average constants, but getting that wrong means:

1) P(x ∈ B) + P(x 6∈ B) 6= 1, and

2) E(f(x) + c) 6= E(f(x)) + c.



Optimal SNIS qopt

qopt ∝ |f(x)− µ|p(x) Hesterberg (1988)

∴ σ2q,sn > Ep(|f(x)− µ|)2

Best possible variance reduction from SNIS

σ2

σ2qopt,sn

=Ep((f(x)− µ)2)

Ep(|f(x)− µ|)2

Versus plain IS

For f > 0, best IS gets σ2q = 0.



Rare eventsf(x) = 1x∈A, µ = P(x ∈ A) = ε

Optimal q

qopt(x) ∝ qoptu (x) = |f(x)− µ|p(x) =

(1− ε)p(x) x ∈ Aεp(x) x 6∈ A.

Normalize q∫qoptu (x) dx = 2ε(1− ε)∫

A

qopt(x) dx =ε(1− ε)2ε(1− ε)

=1

2(!!)

Versus plain ISBest plain IS q = pf/µ puts all probability on A, not half.



Rare event efficiencyσqopt,sn = Ep(|f(x)− µ|) = ε(1− ε) + (1− ε)ε = 2ε(1− ε) ≈ 2ε

Best possible coefficient of variation σq,sn/µ is≈ 2.

Versus plain MC as µ = ε→ 0√Var(µqopt,sn)

µ≈ 2√

nOptimal SNIS√

ε(1− ε)/nµ

≈ 1√ε

1√n

plain MC

Optimal SNIS would be very good. Gets cv 0.1 for n = 400 vs 100/ε

Plain IS can be even better (lower bound 0).



When to use SNISIf we only have pu or qu then we cannot use IS and SNIS is available.

I.e., use SNIS when we can draw x ∼ q and evaluate pu/qu but not p/q.

Suppose we could do both?

We can then use∫p(x) dx = Eq(w(x)) = 1 to define a control variate.

(Much more later.)

The resulting estimate is then exact for q = fp/µ and for f(x) = c.

Asymptotically this combination beats both plain IS and SNIS.

That is, when we have normalized p and q, SNIS is not best.



Effects of dimensionp(x) =

d∏j=1

pj(xj | x1, . . . , xj−1)

q(x) =

d∏j=1

qj(xj | x1, . . . , xj−1)

w(x) =d∏j=1

pj(xj | x1, . . . xj−1)

qj(xj | x1, . . . xj−1)≡

d∏j=1

wj(xj | x1, . . . , xj−1).

We can write p and q as above even if we generate them some other way.

Each Eq(wj(· | ·)) = 1 and Eq(wj(· | ·)2) > 1.

As a result Eq(w(x)2) can grow exponentially with d affecting both IS and SNIS.

If qj gets closer to pj as j increases we might be ok.

Or if f is specially related to p/q.



Gaussian p and qFor p = N (0, I) and q = N (θ, I)

w(x) = exp(−θTx+ θTθ/2), Eq(w(x)2) = exp(θTθ).

For θ = (1, 1, 1, . . . , 1)T we get Eq(w(x)2) = exp(d).

Similar results from θ = (1, 0, 0, . . . , 0)T and θ = (1/√d, 1/√d, . . . , 1/

√d)T.

Variance change

Taking q = N (0, σ2I)

Eq(w(x)2) =

(

σ2

√2σ2 − 1

)d, σ2 > 1/2

∞ else.

We need σ2 closer to 1 as d grows, e.g., 1 +O(1/d).



Infinite dimensionx = (x1, x2, . . . ).

The function f(x) depends on only the first M = M(x) components of x.

Assume M is a stopping timeI.e., we can tell whether M(x) = m from x1, x2, . . . , xm. For example:

M = minm > 1 |

m∑i=1

xi > 100.

We must choose q(·) so that Pq(M <∞) = 1, so that we can compute f in finite time.

The estimate

µq =1

n

n∑i=1

f(xi)

M(x)∏j=1

p(xj | x1, . . . , xM(x))

q(xj | x1, . . . , xM(x))

Note: weight w(x) only uses∏j p/q for j 6M .

We don’t need the infinite product.Grid Science Winter School, January 2017


Infinite dimension ctd.Technicality

Eq(µq) = Ep(f(x)1M(x)<∞).

If Pp(M(x) <∞) = 1 then Eq(µq) = Ep(f(x)).

The technicality is useful

Suppose we want Pp(M(x) <∞). Choose f(x) = 1. Then

Eq(µ) = Pp(M(x) <∞).

Siegmund (1976)



Weights and effective sample sizeKong (1992)

µq =1

n

n∑i=1

wif(xi), wi =p(xi)

q(xi), W ≡

∑i

wi

If a small number of wi dominate the others then µq can be unstable.

An effectively smaller number of f(xi) are being averaged.

Effective sample size

ne =(∑i wi)

2∑i w

2i

ne = n if all wi are equal. ne = 1 if only one wi is positive

Slippery derivationIt follows from variance of a weighted sum of IID random variables.

In IS, the random variables and the weights are both functions of the same x.

ne is the same for IS and SNIS.Grid Science Winter School, January 2017


Effective sample sizene =

(∑i wi)

2∑i w

2i

≈ n

1 + cv(w)2

Population counterpart

n∗e =nEq(w)2

Eq(w2)=

n

Eq(w2)=

n

Ep(w)

Larger w are like getting fewer data points.

InterpretationPlain MC has ne = n and we don’t want that. IS has ne < n.

We just want to detect very badly imbalanced weights via small ne.

Integrand specific versionEvans & Swartz (1995)

Use wi(f) = |f(xi)|p(xi)/q(xi), W (f) =∑i wi(f), wi(f) = wi(f)/W (f).

ne(f) =1∑n

i=1 wi(f)2Grid Science Winter School, January 2017


PERT exampleTen steps in a sofware project. Some cannot start until previous are done.

j Task Predecessors Mean Duration

1 Planning None 4

2 Database Design 1 4

3 Module Layout 1 2

4 Database Capture 2 5

5 Database Interface 2 2

6 Input Module 3 3

7 Output Module 3 2

8 GUI Structure 3 3

9 I/O Interface Implementation 5,6,7 2

10 Final Testing 4,8,9 2

Table 1: A PERT problem from Chinneck (2009, Chapter 11)Grid Science Winter School, January 2017


PERT ctd

1

2

3

4

5

6

7

8

9 10

PERT graph (activities on nodes)



PERT ctdFor given times, the project completes in E = E10 = 15 days.

(They seem optimistic.)

Start time Time taken End time

Sj = maxk→j Ek Tj Ej = Sj + Tj

E.g., from the graph: S9 = max(E5, E6, E7).

For independent exponential random variables Tj , we get E(E10).= 18.

Suppose there’s a large penalty when E > 70.

Plain MC

Got E > 70 twice in n = 10,000 tries.

Let’s try importance sampling.



Change the exponential ratesChange Ti from exponential with mean θi to mean λi.

µλ =1

n

n∑i=1

1Ei,10>70

10∏j=1

exp(−Tij/θi)/θiexp(−Tij/λi)/λi

.

We can make Eλ(E10) ≈ 70 by taking λ = 4θ.

We get µλ.= 2.01× 10−5 with a standard error of 4.7× 10−6.

DiagnosticsWe also get ne

.= 4.90. This is quite low.

The largest weight was about 43% of the total.

The standard error did not warn us about this problem.

The standard error is more vulnerable to large weights than the mean is.

An effective sample size for the se is ne,σ.= 1.15. O (2013), Chapter 9.3



Try different λReplace λ = 4θ by λ = κθ and search for good κ. (None seemed good.)

Unequal scalingAdding a day to some tasks does not always add a day to the project.

Details: Tj = θj + 1, Tk = θk for k 6= j. Critical path: j ∈ 1, 2, 4, 10.

Even doubling some tasks’ time might not cause a delay.

Details: Tj = 2θj , Tk = θk for k 6= j. Changes for: j 6∈ 3, 7, 8.

Scale tasks 1, 2, 4, 10 by κ

κ 3.0 3.5 4.0 4.5 5.0

105 × µ 3.3 3.6 3.1 3.1 3.1

106 × se(µ) 1.8 2.1 1.6 1.5 1.6

ne 939 564 359 239 165

ne,σ 165 88 52 33 23Grid Science Winter School, January 2017


FollowupUsing κ = 4 on the critical path with n = 200,000 yields

µq = 3.18× 10−5, se = 3.62× 10−7 ne.= 7470, ne,σ

.= 992, ne(f)

.= 7400

Vs plain MCPlain MC has variance ε(1− ε)/n.

IS has variance about 1200-fold smaller here.

Vs SNISIn this case SNIS had about double the estimated variance of plain IS.

Further improvementsMuch more subtle tuning could be applied.

The 10’th random variable can be ‘integrated out’ in closed form, reducing variance.

Control variates could be applied.

Antithetics could be applied. Grid Science Winter School, January 2017


Summary of basic IS• IS helps with rare events or spiky integrands

• It becomes unbiased by reweighting

• A good importance sampler q is nearly proportional to fp

• Take extreme care that q is not small

• Self-normalized IS is more widely applicable, but less effective

• Dimension brings challenges

PERT example

This was adaptive IS done by hand.



Devising qWe can go through a storehouse of distributions to find q.

For instance Devroye (1986).

General techniques

• Exponential tilting

• Laplace approximations

• Mixtures



Exponential tiltingAlso called exponential twisting.

Exponential family

p(x; θ) = exp( θTx−A(x)− C(θ) ), x ∈ X

Here p(x) = p(x; θ0) and q(x) = p(x; θ).

Includes: N (θ, I), Poisson(eθ), Binomial, Gamma distributions.

More general case

p(x; θ) = exp( η(θ)TT (x)−A(x)− C(θ) ), x ∈ X

Here η(·) and T (·) are known functions.



IS with tilting

µθ =1

n

n∑i=1

f(xi)p(xi; θ0)

p(xi; θ), xi ∼ p(· ; θ)

=1

n

n∑i=1

f(xi)exp(xT

i θ0 −A(xi)− C(θ0))

exp(xTi θ −A(xi)− C(θ))

=c(θ)

c(θ0)

1

n

n∑i=1

f(xi)exTi (θ0−θ)

where c(θ) = exp(C(θ)). This often simplifies.

ForN (θ,Σ)

µθ =1

n

n∑i=1

f(xi) exTiΣ−1(θ0−θ) × e 1

2 θTΣ−1θ− 1

2 θT0Σ−1θ0︸︷︷︸

c(θ)/c(θ0)



Further tiltingDefine a family q(x; θ) ∝ p(x)eh(x)Tθ .

The function h(·) defines ’features’ of x.

θj > 0 makes higher values of hj(x) more probable.

q(x; 0) = p(x)

If p is not in the exponential family, then we might not be able to compute w(x).

If we can sample from q(x; θ) then we can use SNIS.



Hessian at the modeSuppose that we find the mode x∗ of p(x) or better yet, of h(x) ≡ p(x)f(x).

Taylor approximation

log(h(x)) ≈ log(h(x∗))−1

2(x− x∗)TH∗(x− x∗)

h(x) ≈ h(x∗) exp(−1

2(x− x∗)TH∗(x− x∗)

), suggests

q = N (x∗, H−1∗ ).

The Hessian of log(h) at x∗ is−H∗.Requires positive definite H∗.

For safety

Use a t distribution instead ofN .



MixturesWhat if there are multiple modes?

https://en.wikipedia.org/wiki/Multimodal_distribution

Closely sampling the highest mode might lead us to undersample or even miss some others.

A mixture of unimodal densities can capture all the modes (that we know of).



Mixture distributionsSample j randomly from 1, 2, . . . , J with probabilities α1, . . . , αJ .

Here αj > 0 and∑Jj=1 αj = 1.

Given j take x ∼ qj .

For exampleJ∑j=1

αj N (θj , σ2Id)

With large J , we get kernel density approximations.

These can approximate generic densities.

The approximation gets more difficult with large dimension.

West (1993), Oh & Berger (1993)



Multiple rare event setsSuppose that x describes a bridge and its load.

There is a failure if x ∈ A1 or x ∈ A2.

Oversampling A1 might lead us to miss A2.

Finance example

Portfolio might fail to meet its goal under:

A1: oil sector underperforms

A2: dollar becomes too weak

A3: interest rates in China are too high

A4: pharmaceutical stocks do too well (portfolio shorted them)

Mixtures

One component can oversample each failure mechanism (that we know of).



Multiple integrands of interestWe want µj =

∫fj(x)p(x) dx for j = 1, . . . , J .

Statistics: numerous posterior moments.

Graphics: red, green, blue for each pixel.

Grid science: J different designs.

Each fj(·)p(·) has its own peak.

Mixtures

One mixture component can be directed towards each integrand (that we have planned for).

We will have to take care that optimizing for f1 does not spoil things for f2. The same issue

arises for multiple modes and failure regions.



Multiple nominal distributionsWe want µj =

∫f(x)pj(x) dx for j = 1, . . . , J .

The pj may correspond to different scenarios (that we are aware of).

Mixtures

Multiple pj and multiple fk amount to multiple products (pf)jk.

We can use a component for each.



IS with mixturesqα(x) =

J∑j=1

αjqj(x)

µα =

n∑i=1

f(xi)p(xi)

qα(xi)=

n∑i=1

f(xi)p(xi)∑Jj=1 αjqj(xi)

. (∗∗)

(∗∗) Balance heuristic Veach & Guibas, also Horvitz-Thompson estimator.

AlternativeSuppose that xi came from component j(i). We could also use

1

n

n∑i=1

f(xi)p(xi)

qj(i)(xi).

It is also unbiased.

This alternative has higher variance.

But you don’t have to compute every qj for every xi. Grid Science Winter School, January 2017


Why higher variance?Argument by convexity. Simple case α = ( 1

2 ,12 ) for q1 and q2.

Recall: σ2q =

∫(fp)2/q dx− µ2

Horvitz Thompson

1

n

(∫(fp)2

(q1 + q2)/2dx− µ2

)Alternative

1

2n

(∫(fp)2

q1dx− µ2

)+

1

2n

(∫(fp)2

q2dx− µ2

)Horvitz-Thompson is better because 1/x is convex.

We gave the alternative a slight advantage by taking exactly n/2 observations from each qj .



Defensive mixturesFrom Hesterberg (1995)

For our best guess q(·) take q1 ≡ p and let q2 = q. Now,

w(x) =p(x)

qα(x)=

p(x)

α1p(x) + α2q2(x)6

1

α1, ∀x

Perhaps α1 = 1/10 or 1/2.

q does not need to be heavy tailed any more because qα is.

Variance bounds

Var(µqα) 61

nα1

(σ2p + α2µ

2)

Var(µqα,sn) 61

nα1σ2p,sn =

1

nα1σ2p

Var(µqα,sn) 61

nα2σ2q,sn

If however σ2q was very small, defensive IS can lose out.

We don’t have a bound vs σ2q .



Loss of proportionality

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

range(u)

f(x)

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

range(u)

c(0,

10)

Sampling densities

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

Likelihood ratios

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

c(0,

10)

Integrands

Defensive importance sampling

• f Gaussian near 0.7

• p Uniform

• q is Beta≈ f

• qα is dotted mixture

• α = (0.2, 0.8)

• fp/q unbounded

• qα not very proportional to fp = f

• Solution = control variates.



Control variatesSuppose that we know E(h(x)) = θ in closed form.

If h ≈ f we can Monte Carlo the difference

1

n

n∑i=1

(f(xi)− h(xi)

)+ θ

for greatly reduced variance.

and h is similar to f then we can use it to estimate E(f(x)).

More generally

For known E(h(x)) = θ ∈ Rd, let

µβ =1

n

n∑i=1

(f(xi)− βTh(xi)

)+ βTθ

Optimal β can be estimated from the same data.



It is easy to see that E(µβ) = µ (unbiased) for any β.

The optimal β minimizes Var(µβ). This βopt is unknown and could well be harder to estimate

than it is to estimate our ultimate goal µ.

A harder intermediate problem is usually to be avoided, but in this case we have a loop hole.

Getting β wrong by ε will mean our variance is too large by a factor of 1 +O(ε2). We do not

need a precise estimate of β. The typical ε is proportional to 1/√n. We can ordinarily analyze

control variates as if we were using the true unknown β.

[Aside: this asymptotic approximation is poor if the dimension of β is comparable to n.]



Control variates via regressionDefine Yi = f(xi) and Zij = hj(xi)− θj .

Fit Yi.= β0 +

J∑j=1

βjZij by least squares.

The intercept β0 is E(µ).

Regression code also gives a standard error for the estimate.

Details in O (2013), Algorithm 8.3.

Centering Zij is essential.

Using estimated β

β − βopt = Op(n−1/2), so we are about n−1/2 from the optimum.

σ2β is quadratic in β =⇒ σ2

β= σ2

βopt(1 +O(1/n)).



Control variates and mixture ISFor normalized qj(·) we know

∫qj(x) dx = 1, j = 1, . . . , J

Eqα( qj(x)

qα(x)

)=

∫qj(x) dx = 1

Unbiased estimate

µα,β =1

n

n∑i=1

f(xi)p(xi)−∑Jj=1 βjqj(xi)∑J

j=1 αjqj(xi)+

J∑j=1

βj

Additional control variates can be added too.

Via regressionRegress Yi = f(xi)p(xi)/qα(xi) on Zij = qj(xi)/qα(xi)− 1.

Get µ = β0 (intercept) and se.

∑j αjZij = 1 for all i, so drop one predictor.



Mixture IS resultsO & Zhou (2000)

Var(µα,βopt) 6 min16j6J

σ2j

nαj

Properties

1) µα,β is unbiased if qα > 0 whenever fp 6= 0

2) If any qj have σ2qj = 0 we get Var(µα,βopt) = 0

3) Even better to take exactly nαj observations from qj .

We could not expect better in general.

We might have σ2j =∞ for all but one j.

The bound is σ2j /(nαj) as if we had just used the good one.

(Without knowing which one it was. Indeed it might be a different one for each of several

different integrands.)



Multiple importance samplingVeach & Guibas (1995)

Widely used in computer graphics.

Eric Veach got an Oscar for it in 2014.

µ =

J∑j=1

1

nj

nj∑i=1

ωj(xij)f(xij)p(xij)

qj(xij)

qj = j ’th light path sampler

Partition of unity

ωj(x) > 0 and∑Jj=1 ωj(x) = 1, ∀x

Can attain Horvitz-Thompson by ‘balance heuristic’ ω(· ) ∝ njqj(· )

The weight on f(xij)p(xij) can depend on j in many ways.



Summary of mixturesUsing mixtures, we can

• bound the importance ratio

• place a distribution near each singularity

• place a distribution near each failure mode

• tune a distribution to each fj of interest

• tune a distribution to each pj of interest

• use control variates to be almost as good as the optimal component

The mixture components can be based on intuition, tilting, Hessians.



What-if simulationsWe estimated µ =

∫f(x)p(x) dx using xi ∼ q

From the sample values f(xi) we can estimate

1)∫f(x)p′(x) dx for a different density p′ 6= p

2) Varq(µq) for xi ∼ q

3) Varq′(µq′) for xi ∼ q′ for a different density q′ 6= q

Estimating what would have happened with a different q is the key to adaptive IS.



What-if simulationsFamily p(x; θ), θ ∈ Θ

We want µ(θ) = E(f(x); θ) =∫f(x)p(x; θ) dx, θ ∈ Θ

Sample from θ0 and reweight for the others

µ(θ) =1

n

n∑i=1

f(xi)p(xi; θ)

p(xi; θ0), xi ∼ p(· ; θ0).

We can recycle our f(xi) values

Common heavy-tailed q

µ(θ) =1

n

n∑i=1

f(xi)p(xi; θ)

q(xi), xi ∼ q(· )

Paraphrase (from memory) of Tukey and Trotter (1956)

“Any sample can come from any distribution.”



What-if simulationsThe may be accurate for θ near θ0 but otherwise may fail.

For p = N (θ0,Σ) and q = N (θ,Σ)

n∗e >n

100requires (θ − θ0)TΣ−1(θ − θ0) 6 log(10)

.= 2.30 O (2013), Chapter 9.14.

Examplef(xi; ν) = max(xi1, xi2, xi3, xi4, xi5) where xij ∼ Poi(ν).

Importance sample from λ instead of ν.

µν =1

n

n∑i=1

max(xi1, . . . , xi5)5∏j=1

e−ννxij

xij !

/ e−λλxijxij !

=1

n

n∑i=1

max(xi1, . . . , xi5)e5(λ−µ)(νλ

)∑j xij



Poisson example

0.5 1.0 1.5 2.0 2.5

1e−

011e

+01

1e+

03

**

***

* * * * * * * * * ***

** * * * * * *

Effective sample sizes

ν

λ

0.5 1.0 1.5 2.0 2.51

23

4

rang

e(va

ls[k

eep,

])

Means and 99% C.I.s

ν

λ

xij ∼ Poi(λ)

Solid curve is pop’n effective sample size. Asterisks are sample estimates.

When ν > λ then q has lighter tails than p. Hence wide CIs and poor ne.



What-if for varianceFor q(x; θ), σ2

θ =∫ (f(x)p(x))2

q(x;θ) dx− µ2 ≡ MSθ − µ2

Now

MSθ =

∫(f(x)p(x))2

q(x; θ)dx =

∫(f(x)p(x))2

q(x; θ)q(x; θ0)q(x; θ0) dx

Sample estimate

For xi ∼ q(x; θ0),

MSθ(θ0) =1

n

n∑i=1

(f(xi)p(xi))2

q(xi; θ)q(xi; θ0)



Summary of what-if simulations

• We can learn about alternative p’s and q’s

• Effectiveness is over a limited range



The next slides are a survey of a selected set of adaptive importance sampling methods.

These are the ones that I thought were interesting and potentially useful in grid science.

Of course there are many others.



Adaptive importance samplingOverview:

Use the xi sampled so far to change q and sample some more.

Two stageGet a pilot sample xi1 ∼ q1.

Use them to choose q2.

Take xi2 ∼ q2 to estimate µ.

Multi-stageGiven q1, for k = 1, . . . ,K − 1, use xik, i = 1, . . . , nk to choose qk+1.

Or use xij , 1 6 i 6 nj , 1 6 j 6 k to choose qk+1.

n-stageGiven q1, take xi from qi computed from x1, . . . ,xi−1.



AIS detailsWe have to pick a familyQ of proposal distributions.

We need an initial q ∈ Q.

We have to choose a termination criterion:

e.g., K steps, total N observations, confidence interval width below ε.

We need a way to choose qk+1 ∈ Q from prior data.

We need a way to combine information from all K stages.



CombinationSuppose we get µk for k = 1, . . . ,K .

Unbiased with variances σ2k.

Best linear unbiased combination

K∑k=1

µkVar(µk)

/ K∑k=1

1

Var(µk)

It is not safe to replace Var(µk) by Var(µk).

It is common for Var(µk) to be correlated with µk.

E.g., largest µk may coincide with largest Var(µk).

Plug-in estimates could bias the estimate down.

Deterministic weighting is less risky.



Deterministic weightsSuppose true variance improves: Var(µk) ∝ k−r0 for some 0 6 r0 6 1.

We should weight proportionally to kr0 but we might not know r0. So we use r1.

If we take r1 = 1/2 then we never raise variance by more than 12.5%.

sup16K<∞

max06r061

Var(using r1 = 1/2)

Var(using r0)=

9

8.

UpshotOnly a small effiiciency loss from weighting stage k proportionally to

√k.

K∑k=1

√kµk

/ K∑k=1

√k.

O & Zhou (1999)

Not efficient for exponential convergence (see below) but that is a rare circumstance.



CombinationPut weights ωk on µk, such as ωk ∝

√k.

Final estimate µ =∑Kk=1 ωkµk.

Assume

E(µk | data to steps k − 1) = µ

E(Var(µk) | data to steps k − 1) = Var(µk | data to steps k − 1)

Then by Martingale arguments

E(µ) = µ, and

Var(µ) ≡∑k

ω2kVar(µk), satisfies

E(Var(µ)) = Var(µ)



Exponential convergenceParametric setQ = qθ | θ ∈ Θ, Θ ⊂ Rd

Best case scenario

If q∗ = fp/µ ∈ Q then we can approach σ2q = 0.

Let q∗ = qθ∗ . If we can estimate θ∗ to within O(n−1/2) from a first estimate we get much

smaller variance at the second step.

The result can be a virtuous cycle of θ(k) → θ∗ and σ2k → 0.

Examples

1) Toy example with p = N (0, 1) and f(x) = exp(Ax).

2) Theory from Kollman (1993) and others for particle transport.



Toy examplep = N (0, 1), f(x) = exp(Ax) for A > 0 and qθ = N (θ, 1).

We actually know µ = E(f(x)) = exp(A2/2) so we don’t really need MC.

Suppose we also know θ∗ = A and Ep(f(x)) = exp(A2/2).

That is θ∗ =√

2 log(E(f(x))).

Cycle through

xi ∼ qθ(k) , i = 1, . . . , n

µ(k) ← 1

n

n∑i=1

f(xi)p(xi)

qθk(xi)

θ(k+1) ←√

2 log(max(1, µ(k))).

Without max(1, ·) the algorithm can fail.



Exponential convergence

5 10 15 20

1e−

161e

−08

1e+

00

Iteration

Abs

olut

e er

ror

Exponentially converging AIS error

Steps with n = 20.



Transport problemsBooth (1985) observed exponential convergence in some particle transport problems.

Kollman (1993) established exponential convergence under strong conditions:

start one simulation from each state of a Markov chain.

Extensions from Kollman, Baggerly, Cox, Picard (1999) and Baggerly, Cox, Picard (2000) and

Kong, Ambrose & Spanier (2009)

Story

Markovian particles P(xn+1 = x | x0, . . . ,xn−1) = P(xn+1 = x | xn−1)

Terminal state ∆ (absorption or left region of interest).

limn→∞ P(xn = ∆) = 1.

f(x1,x2, . . . ) =∞∑n=1

s(xn−1 → xn) =τ∑n=1

s(xn−1 → xn)

hitting time τ , score function s(· → ·) with s(∆→ ∆) = 0.



Particle transportThe particle follows a Markov chain with transitions P .

We use transitions Q instead, with Qij > 0 whenever Pij > 0.

Optimal Q

Qij =Pij(sij + µj)

Pi∆si∆ +∑d`=1 Pi`(si` + µ`)

Qi∆ =Pi∆si∆

Pi∆si∆ +∑d`=1 Pi`(si` + µ`)

, where

µi = EQ( τ∑n=1

s(xn−1 → xn)wn(x1, . . . ,xn)∣∣ x0 = i

), and

wn =n∏j=1

Pxj−1xj

Qxj−1xj

.

AlgorithmAlternates between estimating µ and sampling from Q(µ). Grid Science Winter School, January 2017


Cross-entropyRubinstein (1997), Rubinstein & Kroese (2004)

µ =∫f(x)p(x) dx for f > 0 and µ > 0.

We use q(· ; θ) = qθ for θ ∈ Θ.

There is an optimal q ∝ fp but it is not usually in our family.

We would like

θ = arg minθ∈Θ

MS(θ) where MS(θ) =

∫(fp)2

qθdx

Variance based update

θ(k+1) = arg minθ∈Θ

1

nk

nk∑i=1

(f(xi)p(xi))2

qθ(xi), xi = x

(k)i ∼ qθ(k)

The optimization may be too hard. Switch to entropy.



Cross-entropyUse an exponential family

qθ = exp(θTx−A(x)− C(θ))

Replace variance by KL distance

D(q∗‖qθ) = Eq∗(

log(q∗(x)

qθ(x)

))This distance hasD(q∗‖qθ) > 0 andD(q∗‖q∗) = 0.

Seek θ to minimize

D(q∗‖qθ) = Eq∗(log(q∗(x))− log(q(x; θ)))

I.e., maximize

Eq∗(log(q(x; θ)))



Cross-entropyFor q > 0 whenever q∗ > 0,

Eq∗(log(q(x; θ))) = Eq( log(q(x; θ))q∗(x)

q(x)

)=

1

µEq( log(q(x; θ))f(x)p(x)

q(x)

)Choose θ(k+1) to maximize

G(k)(θ) =1

nk

nk∑i=1

p(xi)f(xi)

q(xi; θ(k))log(q(x; θ))

≡ 1

nk

k∑i=1

Hi log(q(xi; θ))

≡ 1

nk

k∑i=1

Hi

(θTxi −A(xi)− C(θ)

)C(θ) is a known convex function. For Hi > 0. Assume (for now) that

∑iHi > 0.

The update often takes a simple moment matching form. Grid Science Winter School, January 2017


Cross-entropy update∑iHix

Ti∑

iHi=

∂

∂θC(θ(θ+1))

For qθ = N (θ, I)

θ(k+1) ←∑iHixi∑iHi

For qθ = N (θ,Σ)

θ(k+1) ← Σ−1

∑iHixi∑iHi

Other exponential family updates are typically closed form functions of sample moments.

What if all Hi = 0?Consider f(x) = 1g(x)>τ for large τ .

Use some other τ (k) instead. For example 99’th percentile of g(xi).

Ideally τ (k) → τ .Grid Science Winter School, January 2017


Cross-ent examples

−6 −4 −2 0 2 4 6

−5

05

Gaussian, Pr(min(x)>6)

−6 −4 −2 0 2 4 6−

50

5

Gaussian, Pr(max(x)>6)

θ1 = (0, 0)T.

Take K = 10 steps with n = 1000 each.



Cross-ent examples

−6 −4 −2 0 2 4 6

−5

05

Gaussian, Pr(min(x)>6)

−6 −4 −2 0 2 4 6−

50

5

Gaussian, Pr(max(x)>6)

θ1 = (0, 0)T.

For min(x), θk heads Northeast, and is ok.

For max(x), θk heads North, or East and underestimates µ by about 1/2.



Nonparametric AISFlexible / infinite dimensional familiesQ.

Often based on recursive rectangular partitioning of region like [0, 1]d.

1) Vegas, Lepage (1978)

2) Divonne, Friedman & Wright (1979, 1981)

3) Miser Press & Farrar (1990)

4) Split Vegas Zhou (1998)≈ Vegas within Miser



Vegas

q(x) =d∏j=1

qj(xj)

Each qj is piece-wise constant. Equal content, not equal width.

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

Vegas−style piecewise constant density

x

dens

ity

Very amenable to sampling spikes. Grid Science Winter School, January 2017


Comments on VegasThe optimal qj given other q` is

qj(xj) ∝

√√√√∫(0,1)d−1

f(x)2∏` 6=j q`(x`)

∏6=j

dx`.

Update starts by iterating above. Some complex steps.

A problem with Vegas

Suppose f(x) has two modes in [0, 1]10 one near (0, 0, . . . , 0) and one near (1, 1, . . . , 1)

Each qj needs two modes.

So q has 210 modes, mostly irrelevant.



DivonneFriedman & Wright (1979, 1981)

p = U[0, 1]d and f has two continuous derivatives.

Local optimization has less curse of dimension than integration.

Newton steps cost O(d3) per iteration.

Recursive partitioning

Hyper rectangles R` ⊂ [0, 1]d with ∪L`=1R` = [0, 1]d and vol(Rj ∩Rk) = 0 for j 6= k.

∫[0,1]d

f(x) dx =

L∑`=1

∫R`

f(x) dx

Start with L = 1 and R1 = [0, 1]d.



Divonne splitsFind min and max of f in each rectangle.

f(R) = maxx∈R

f(x)

f(R) = minx∈R

f(x)

b(R) = vol(R)|f(R)− f(R)| (badness)

Split the rectangle with largest ‘badness’.

Non-binary splitsTo do the split decide which of max, min is farthest from average f in the rectangle.

Define a subrectangle to isolate that mode.

Partition the original rectangle into numerous other subrectangles to complement the one

around the mode.

An issue it is not designed for unbounded integrands and they will give it trouble.



MiserPress & Farrar (1990) see also Numerical Recipes

Estimates∫

(0,1)df(x) dx by recursive binary partitioning into rectangles.

Each rectangle can be split down the middle in d different directions.

The chosen direction is based on an estimate of which j will lead to the most accurate estimate.

They assume that Miser will be applied left and right, not Monte Carlo, and them employ an

assumption about the Miser convergence rate to decide.

Having chosen the split, they allocate sample sizes left and right.

For a mode at (1/2, 1/2, . . . , 1/2) then split at 1/2 + U(−δ/2, δ/2) to avoid it.



Split VegasDissertation of Zhou (1998).

Within a rectangular region, fit Vegas.

If one of the marginal distributions is strongly bimodal then split the rectangle on that direction.

It can be subtle to indentify whether a given margin has two or more modes.



Mixture of products of betaFor∫

[0,1]df(x) dx

Beta density b(x;α, β) ∝ xα−1(1− x)β−1 α, β > 0

Can have a bump near α/(α+ β) or a spike near 0 and / or near 1.

d dimensional product of beta densities∏dj=1 b(xj ;αj , βj)

Mixture of products of beta densities q(x) =∑Mm=1 γm

∏dj=1 b(xj ;αjm, βjm)

Adaptive partChoose γm and αjm, βjm by Levenberg-Marquardt nonlinear least squares to minimize

n∑i=1

(|f(xi)| − q(xi;α,β,γ)

)2

O & Zhou (1999)

Many more details in thesis Zhou (1998).



Particle physics integrandFrom Kinoshita (1970).

7 dimensional, positive and negative integrable singularities.

MC Vegas MISER ADBayes MixBeta

Err/µ 0.154 0.092 0.073 0.210 0.035

Err/Err 1.163 34.0 0.750 19.5 0.394

K = 20 runs of n = 105 points each.

Except ADBAYES.

Known that µ.= 0.371005292 as of 1991. (!!)

(Don’t recall non-sampling costs.)



AMISAdaptive Multiple Importance Sampling

Cornuet, Marin, Mira, Robert (2012)

1) Sample n1 observations using θ1

2) Estimate θ2 from data and sample more

3) Keep estimating new θk and sampling more

4) Combine rounds by multiple importance sampling methods

The weights on observations from one round can change later due to data from other rounds.

This breaks the Martingale property, so it is hard to get unbiased estimates.

SNIS is used throughout because the motivation is from Bayesian problems where p is usually

not normalized.



Optimal mixture weightsMSα,β ≡

∫(fp+

∑j βj(qj − p))2∑j αjqj

dx

is jointly convex in α (in simplex) and β.

So is a sample estimate of it.

Therefore we can estimate optimal mixture weights.

Also we can constrain αj > ε > 0 in a convex optimiation and variance does not increase

much.

He & Owen (2014) sequentially optimize both α and β.



M-PMCMixture-Population Monte Carlo

Cappe, Douc, Guillin, Marin & Robert (2008)

q(x) =J∑j=1

αjqj(x; θj)

They target Kullback-Leibler distance to target dist’n.

Also use SNIS.

Aim to optimize over both αj and θj (non-convex).

Use an EM algorithm for the θ updates.

Earlier Douc, Guillin, Marin & RObert (2007) target specific integrand.



APISMartino, Elvira, Luengo, Corander (2015)

Adaptive Population Importance Sampler

For an unnormalized p(·).

Choose n normalized distributions qi(· ; θi, Ci), mean θi, covariance Ci.

Sample xi ∼ qiGet SNIS weights wi ∝ p(xi)/

∑i′ qi′(xi; θi′ , Ci′).

Every m’th iteration, update the means θi but not the covariances Ci,

using previous m− 1 iterations’ data

Avoids “particle collapse”.

Empirical assessment.



Stochastic convex programmingRyu & Boyd (2015)

They take up the exponential family setting of cross-entropy.

They find in exponential families that the variance is a convex function.

So they don’t have to use Kullback-Leibler.

µ =1

n

n∑i=1

f(xi)p(xi)

q(xi; θi)

Potentially a new θi for each xi. (or small batches)

Stochastic gradient descentθn+1 = Proj(θn − αngn)

Sequence αn → 0 with∑αn =∞

Proj projects onto Θ

gn = gn(θn,xn) is n’th gradient estimate

µn attains optimal variance×(1 +O(n−1/2)). Grid Science Winter School, January 2017


Kriging based estimatorsPicheny (2009), Baleadent, Morio & Marzat (2013), Dalbey & Swiler (2014)

Rare event φ(x) > τ . We get φ(xi), not just f(x) = 1φ(x)>τ .

Kriging

Given φ(x1), . . . φ(xn) it can predict φ(x) elsewhere by Gaussian process interpolation.

Swiler & West (2010) report that∫p(x)1φ(x)>τ dx does not work well.

Better to use∫p(x)Φ((τ − φ(x)/s(x))) dx where kriging gives φ and a posterior standard

deviation s. Chevalier, Ginsbourger, Picheny, Richet, Bect, Vazquez (2014)

Dalbey & Swiler (2014) use φ to update q.



Thanks• Yi Zhou, Hera He, co-authors

• Minyong Lee, discussion

• Toichuro Kinoshita (particle physics code and integral)

• U.S. National Science Foundation DMS-1521145, DMS-1407397

• Michael Chertkov

• Kacy Hopwood


art b. owen stanford universitystatweb.stanford.edu/~owen/pubtalks/adaptiveisweb.pdfadaptive...

Documents