art b. owen stanford universitystatweb.stanford.edu/~owen/pubtalks/adaptiveisweb.pdfadaptive...
TRANSCRIPT
Adaptive Importance Sampling 1
Adaptive Importance Sampling
Art B. Owen
Stanford University
Grid Science Winter School, January 2017
Adaptive Importance Sampling 2
These are slides that I presented at Grid Science 2017 in Los Alamos NM, on Wednesday
January 11.
I have amended them correcting typos and adding a few more references. I also add a few
interstitial slides like this one to cover things that I either said or should have said during the
presentation.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 3
Outline1) Basic Importance Sampling.
often our only chance to solve a problem.
2) Mixture Distributions.
the key to effective use of IS. Also an Oscar.
3) What-if Simulations.
using data from one sim to learn how another would have worked.
4) Adaptive Importance Sampling.
Still mostly intuitive and ad hoc.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 4
What is missing1) Some paywalled articles.
2) Many articles from application areas.
3) Some very advanced / hard to explain developments.
Google Scholar, Dec 2016
“adaptive importance sampling”
About 3,040 results (0.07 sec)
Since 2012: About 1,070 results (0.10 sec)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 5
At the meeting, several people told me about the importance sampling work of Ian Dobson on
estimating rare event probabilities for power system cascades. So I’m looking into those now.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 6
The canonical problemApproximate µ =
∫Rdh(x) dx.
Use the first one that works of
1) Symbolic / closed form solutions
2) Numerical quadrature
3) Plain Monte Carlo
4) Importance Sampling
5) Adaptive Importance Sampling
Notes• Some problems have d =∞ and / or discrete x
• We may need Markov chain Monte Carlo (MCMC)
or sequential Monte Carlo (SMC).
• Sometimes we can use quasi-Monte Carlo (QMC). Grid Science Winter School, January 2017
Adaptive Importance Sampling 7
There are problems where none of the five approaches on the previous slide will get the answer.
Quasi-Monte Carlo sampling can be used to great effect when the underlying integrands are just
a little bit more regular than Monte Carlo uses. With randomized QMC, RQMC, one gets an
answer that is seldom worse than plain MC or plain QMC and can be much better. This talk was
just about MC.
MCMC and SMC can be brought to bear on problems that are less regular/favorable than plain
MC uses. Those methods can handle even harder problems because they involve flexible
explorations of the problem space. It can be hard to tell if/when they work.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 8
Closed formSometimes we can get a closed form.
Symbolic computation (Mathematica, Maple, Sage, etc.) help.
Unfortunately
We can seldom do this for problems with real-world complexity,
so we resort to numerical methods.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 9
Numerical quadratureFor instance tensor products of Simpson’s rule.
Evaluate f at n points xi ∈ Rd.
Can attain an error rate of O(n−r/d) for h ∈ C(r)(Rd).
Unfortunately
Low smoothness (small r) or high dimension (large d) make the methods ineffective.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 10
Monte CarloWrite h(x) = f(x)p(x) for a probability density function p(·). Then,
µ =
∫h(x) dx =
∫f(x)p(x) dx = E(f(x)), x ∼ p.
Estimate
µ =1
n
n∑i=1
f(xi), xi ∼ p
√n(µ− µ)
d→ N (0, σ2), CLT σ2 = E((f(x)− µ)2)
UpshotVery competitive rate O(n−1/2) when r/d 1/2. We get confidence intervals too.
Unfortunatelyσ/√n
µmight be large, or even infinite.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 11
Places where MC has troubleRare events
• Bankruptcy / default for finance / insurance
• Energy grid failure Bent, Backhaus & Yamangil (2014),
cascade simulations Chertkov (2012)
• Product failure, e.g., battery
• Monograph by Rubino & Tuffin (2009)
Spiky/singular integrands
• High energy physics (Feynman diagram),
Aldins, Brodsky, Dufner, Kinoshita (1970)
multiple singularities in [0, 1]7 that dominate the integral
• Graphical rendering
Kollig & Keller (2006) Integrand includes factors like ‖x− x0‖−1
• Particle transport, e.g., T. BoothGrid Science Winter School, January 2017
Adaptive Importance Sampling 12
Rare events
f(x) = 1A(x) =
1, x ∈ A0, else,
µ = P(x ∈ A) =
∫A
p(x) dx = ε
Coefficient of variation of µ
σ/√n
µ=
√ε(1− ε)√nε
≈ 1√nε
To get cv = 0.1 takes n > 100/ε.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 13
Singularitiesf(x) ≈ ‖x− x0‖−r, x ∈ A ⊂ Rd
0 < a 6 p(x) 6 b <∞, for x ∈ A.
f is integrable over bounded A containing x0 iff r < d.
Then f2 is integrable iff 2r < d.
For r ∈ [d/2, d)
µ exists but σ2 =∞
Monte Carlo converges but slower than Op(1/√n).
Other singularities
‖x− x0‖p−r, min(x1, x2, . . . , xd)−r,
min(x1, 1− x1, x2, 1− x2, . . . , xd, 1− xd)−r
Grid Science Winter School, January 2017
Adaptive Importance Sampling 14
Very often singularities are simply not a big problem for MC and the interesting thing is that high
dimension can make singularities less harmful. This is a ‘concentration of measure’ result as
described below.
Let f be singular on some subset of the input space. For simplicity, take subsets of the form
Ωd,p(ρ) = x ∈ Rd | ‖x‖p 6 ρ for integer d > 1 and 1 6 p 6∞ and f(x) = ‖x‖−rp .
Let x be uniformly distributed on Ωd,p = Ωd,p(1) of volume |Ωd,p|. Then
µ =∫ 1
0yd−1−r dy|Ωd−1,p|/|Ωd,p| = (d− r)−1|Ωd−1,p|/|Ωd,p|. The singularity is severe
if, say, half of µ comes from a very tiny region around 0.
The radius ρ that captures half of the integral satisfies∫ ρ0y−r+d−1 dy = (1/2)
∫ 1
0y−r+d−1 dy, and it is ρ = 2−1/(d−r). The relative volume is
|Ωd,p(ρ)|/|Ωd,p(1)| = ρd = 2−d/(d−r). If r is close to d then half of the integral comes from
a very tiny fraction of the volume. If r = 1 and d grows then half of the integral comes from
about half of the volume, hence, not much of a spike. If r < d/2, so the variance is finite, then
this volume is larger than 2−d/(d−d/2) = 1/4. Once again, not a very prominent spike.
For singularities on a k dim manifold replace d by d− k.
Conclusion: rare events described by 0/1 random variables can be much harder to handle than
unbounded random variables. Grid Science Winter School, January 2017
Adaptive Importance Sampling 15
Importance samplingFor rare events and (some) singularities there is a region A of relatively small P(x ∈ A) that
dominates the integral. Taking x ∼ p does not get enough data from the important region A.
Main ideaGet more data from A, and then correct the bias.
Kahn (1950), Kahn & Marshall (1953), Trotter & Tukey (1956)
Good news Importance sampling can solve these sampling problems.
Bad news It can also fail spectacularly.
For a density q
f(x)p(x) = f(x)p(x)
q(x)q(x)
µq =1
n
n∑i=1
f(xi)p(xi)
q(xi), xi ∼ q
For now, assume that p(x)/q(x) is computable.Grid Science Winter School, January 2017
Adaptive Importance Sampling 16
UnbiasednessIf q(x) > 0 whenever f(x)p(x) 6= 0 then for xi ∼ q,
Eq(µq) =
∫f(x)
p(x)
q(x)q(x) dx =
∫f(x)p(x) dx = µ.
Sufficient condition
q(x) > 0 whenever p(x) > 0.
Yields unbiasedness for generic f .
Notation
Eq(·), Varq(·) etc. are for xi ∼ q.
w(x) ≡ p(x)
q(x), Ep(g(x)) = Eq(g(x)w(x)).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 17
Variance of ISBased on O (2013), Ch 9.1
Varq(µq) =σ2q
n, where σ2
q = Varq((fp)/q)
σ2q = Eq
((fpq
)2)− µ2 =
∫(fp)2
qdx− µ2
After some algebra
σ2q =
∫(fp− µq)2
qdx
Lessons
1) q(x) ∝ f(x)p(x) is best, when f > 0.
2) q near 0 is dangerous.
3) Bounding (fp)2/q is useful theoretically.
Adaptive IS is about using sample data to select q.Grid Science Winter School, January 2017
Adaptive Importance Sampling 18
ProportionalityFor f > 0, taking q = fp/µ yields σ2
q = 0. Specifically
µq =1
n
n∑i=1
f(xi)p(xi)
q(xi)=
1
n
n∑i=1
f(xi)p(xi)
f(xi)p(xi)/µ= µ
but we need to know µ to use it.
We look for q nearly proportional to fp.
More generally
q ∝ |f |p is always optimal, but not zero variance if f takes both positive and negative values.
For f > 0
q ∝ f is not best (unless p is flat)
q ∝ p is not best (unless f is flat)
Importance involves both f and p via fp.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 19
Example
−5 0 5 10 15
0.0
0.2
0.4
0.6
0.8
f
q
p
µ.= 1.74× 10−24, Var(µq) = 0
Gaussian p and f =⇒ Gaussian optimal q lies between Grid Science Winter School, January 2017
Adaptive Importance Sampling 20
Here is a positivisation trick from O & Zhou (2000).
If f takes both positive and negative values then we can write
µ =
∫f(x)p(x) dx =
∫f+(x)p(x) dx−
∫f−(x)p(x) dx
for nonnegative functions f+(x) = max(0, f(x)) and f−(x) = max(0,−f(x)). Then we
can use
µ =1
n+
n+∑i=1
f+(xi)p(xi)
q+(xi)− 1
n−
n−∑i=1
f−(xi)p(xi)
q−(xi)
for xi from q±. There then exist zero variance sampling distributions q± that we might
approximate. More generally we can use
µ = c+1
n+
n+∑i=1
(f(xi)− c)+p(xi)
q+(xi)− 1
n−
n−∑i=1
(f(xi)− c)−p(xi)q−(xi)
or
µ = Ep(g(x)) +1
n+
n+∑i=1
(f(xi)− g(xi))+p(xi)
q+(xi)− 1
n−
n−∑i=1
(f(xi)− g(xi))−p(xi)
q−(xi)
for some g with known expectation. Grid Science Winter School, January 2017
Adaptive Importance Sampling 21
Tails of q
σ2q =
∫(fp− µq)2
qdx =
∫(fp)2
qdx− µ2
The term depending on qRecall w(x) = p(x)/q(x).∫
(fp)2
qdx =
∫f2w2q dx = Eq(f2w2)
=
∫f2wpdx = Ep(f2w)
Consequenceµ2 + σ2
q = Ep(f2w) = Eq(f2w2)
∴ bounding w(x) =p(x)
q(x)is advantageous for many f .
Grid Science Winter School, January 2017
Adaptive Importance Sampling 22
Gaussian p with t distributed qMultivariate Gaussian, θ, Σ
p(x) ∝ exp(−(x− θ)TΣ−1(x− θ)/2)
Multivariate t , θ, Σ, degrees of freedom ν > 0
q(x) ∝ (1 + (x− θ)TΣ−1(x− θ))−(ν+d)/2
The t distribution has heavier tails than the Gaussian
supx p(x)/q(x) is bounded
supx p(x) exp(θTx)/q(x) is bounded
Geweke (1989) Split t (at 0)
Reversing roles of t and Gaussian
Leads to unbounded w
Grid Science Winter School, January 2017
Adaptive Importance Sampling 23
q nearly proportional to p
−6 −4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
x
Den
sitie
s
−6 −4 −2 0 2 4 6
1e−
031e
−01
1e+
011e
+03
x
Den
sity
rat
ios
Nearly proportional densitiesGaussian and scaled t (25 df)
Light tailed q can make p/q diverge.
Can have σ2q =∞ for f(x) = 1.
Ironically it is the unimportant region that causes trouble.Grid Science Winter School, January 2017
Adaptive Importance Sampling 24
Beyond varianceChatterjee & Diaconis (2015) show that the sample size required is roughly
n ≈ exp(KL distance p, q) for generic f .
This analysis involves Eq(|µq − µ|) and Pq(|µq − µ| > ε) instead of Varq(µq). It avoids
working with variances that can be harder to handle than means.
Taking ε = .025 in Theorem 1.2 (for 95% confidence of a small error) shows that we succeed
with n > 6.55× 1012 × exp(KL). Similarly, poor results are very likely for n much smaller
than exp(KL).
The range for n is not precisely determined by these considerations (yet).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 25
Self-normalized importance samplingBased on O (2013, ch 9.2)
p(x) = cppu(x), q(x) = cqqu(x), 0 < cp, cq <∞
µq =cpcq
1
n
n∑i=1
f(xi)pu(xi)
qu(xi).
If pu, qu computable, but cp/cq unknown
wu(xi) = wu,q(xi) = pu(xi)/qu(xi)
wi = wi,q = wu(xi)/ n∑
i′=1
wu(xi′)
µq =n∑i=1
wif(xi)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 26
Self-normalized importance sampling• µq has some bias, asymptotically negligible
• Self-normalized sampling competes with acceptance-rejection
• It does not attain zero variance for any q (and non-trivial f )
Adaptation
There are adaptive versions of both IS and SNIS.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 27
SNIS is consistentby the Law of Large Numbers
µq =1
n
n∑i=1
f(xi)pu(xi)/qu(xi)/ 1
n
n∑i=1
pu(xi)/qu(xi)
→∫f(x)pu(x)
qu(x)q(x) dx
/ ∫ pu(x)
qu(x)q(x) dx, (LLN)
=cqcp
∫f(x)p(x) dx
/ cqcp
∫p(x) dx
= µ
Recall the weights
wi =pu(xi)
qu(xi)
/ n∑i′=1
pu(xi′)
qu(xi′)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 28
SNIS varianceDelta method variance approximation for a ratio estimate:
Var(µq) =σ2q,sn
n
σ2q,sn = Eq(w(x)2(f(x)− µ)2)
Rewrite and compare
σ2q,sn =
∫p2
q(f − µ)2 dx
=
∫(fp− µp)2
qdx
vs σ2q =
∫(fp− µq)2
qdx
No q can make σ2q,sn = 0 (unless f is constant).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 29
Exactness
µq =1
n
n∑i=1
f(xi)p(xi)
q(xi)
µq =1
n
n∑i=1
f(xi)pu(xi)/qu(xi)/ 1
n
n∑i=1
pu(xi)/qu(xi)
Constants f(x) = c
µq = c, while generally µq 6= c
We don’t usually try to average constants, but getting that wrong means:
1) P(x ∈ B) + P(x 6∈ B) 6= 1, and
2) E(f(x) + c) 6= E(f(x)) + c.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 30
Optimal SNIS qopt
qopt ∝ |f(x)− µ|p(x) Hesterberg (1988)
∴ σ2q,sn > Ep(|f(x)− µ|)2
Best possible variance reduction from SNIS
σ2
σ2qopt,sn
=Ep((f(x)− µ)2)
Ep(|f(x)− µ|)2
Versus plain IS
For f > 0, best IS gets σ2q = 0.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 31
Rare eventsf(x) = 1x∈A, µ = P(x ∈ A) = ε
Optimal q
qopt(x) ∝ qoptu (x) = |f(x)− µ|p(x) =
(1− ε)p(x) x ∈ Aεp(x) x 6∈ A.
Normalize q∫qoptu (x) dx = 2ε(1− ε)∫
A
qopt(x) dx =ε(1− ε)2ε(1− ε)
=1
2(!!)
Versus plain ISBest plain IS q = pf/µ puts all probability on A, not half.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 32
Rare event efficiencyσqopt,sn = Ep(|f(x)− µ|) = ε(1− ε) + (1− ε)ε = 2ε(1− ε) ≈ 2ε
Best possible coefficient of variation σq,sn/µ is≈ 2.
Versus plain MC as µ = ε→ 0√Var(µqopt,sn)
µ≈ 2√
nOptimal SNIS√
ε(1− ε)/nµ
≈ 1√ε
1√n
plain MC
Optimal SNIS would be very good. Gets cv 0.1 for n = 400 vs 100/ε
Plain IS can be even better (lower bound 0).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 33
When to use SNISIf we only have pu or qu then we cannot use IS and SNIS is available.
I.e., use SNIS when we can draw x ∼ q and evaluate pu/qu but not p/q.
Suppose we could do both?
We can then use∫p(x) dx = Eq(w(x)) = 1 to define a control variate.
(Much more later.)
The resulting estimate is then exact for q = fp/µ and for f(x) = c.
Asymptotically this combination beats both plain IS and SNIS.
That is, when we have normalized p and q, SNIS is not best.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 34
Effects of dimensionp(x) =
d∏j=1
pj(xj | x1, . . . , xj−1)
q(x) =
d∏j=1
qj(xj | x1, . . . , xj−1)
w(x) =d∏j=1
pj(xj | x1, . . . xj−1)
qj(xj | x1, . . . xj−1)≡
d∏j=1
wj(xj | x1, . . . , xj−1).
We can write p and q as above even if we generate them some other way.
Each Eq(wj(· | ·)) = 1 and Eq(wj(· | ·)2) > 1.
As a result Eq(w(x)2) can grow exponentially with d affecting both IS and SNIS.
If qj gets closer to pj as j increases we might be ok.
Or if f is specially related to p/q.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 35
Gaussian p and qFor p = N (0, I) and q = N (θ, I)
w(x) = exp(−θTx+ θTθ/2), Eq(w(x)2) = exp(θTθ).
For θ = (1, 1, 1, . . . , 1)T we get Eq(w(x)2) = exp(d).
Similar results from θ = (1, 0, 0, . . . , 0)T and θ = (1/√d, 1/√d, . . . , 1/
√d)T.
Variance change
Taking q = N (0, σ2I)
Eq(w(x)2) =
(
σ2
√2σ2 − 1
)d, σ2 > 1/2
∞ else.
We need σ2 closer to 1 as d grows, e.g., 1 +O(1/d).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 36
Infinite dimensionx = (x1, x2, . . . ).
The function f(x) depends on only the first M = M(x) components of x.
Assume M is a stopping timeI.e., we can tell whether M(x) = m from x1, x2, . . . , xm. For example:
M = minm > 1 |
m∑i=1
xi > 100.
We must choose q(·) so that Pq(M <∞) = 1, so that we can compute f in finite time.
The estimate
µq =1
n
n∑i=1
f(xi)
M(x)∏j=1
p(xj | x1, . . . , xM(x))
q(xj | x1, . . . , xM(x))
Note: weight w(x) only uses∏j p/q for j 6M .
We don’t need the infinite product.Grid Science Winter School, January 2017
Adaptive Importance Sampling 37
Infinite dimension ctd.Technicality
Eq(µq) = Ep(f(x)1M(x)<∞).
If Pp(M(x) <∞) = 1 then Eq(µq) = Ep(f(x)).
The technicality is useful
Suppose we want Pp(M(x) <∞). Choose f(x) = 1. Then
Eq(µ) = Pp(M(x) <∞).
Siegmund (1976)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 38
Weights and effective sample sizeKong (1992)
µq =1
n
n∑i=1
wif(xi), wi =p(xi)
q(xi), W ≡
∑i
wi
If a small number of wi dominate the others then µq can be unstable.
An effectively smaller number of f(xi) are being averaged.
Effective sample size
ne =(∑i wi)
2∑i w
2i
ne = n if all wi are equal. ne = 1 if only one wi is positive
Slippery derivationIt follows from variance of a weighted sum of IID random variables.
In IS, the random variables and the weights are both functions of the same x.
ne is the same for IS and SNIS.Grid Science Winter School, January 2017
Adaptive Importance Sampling 39
Effective sample sizene =
(∑i wi)
2∑i w
2i
≈ n
1 + cv(w)2
Population counterpart
n∗e =nEq(w)2
Eq(w2)=
n
Eq(w2)=
n
Ep(w)
Larger w are like getting fewer data points.
InterpretationPlain MC has ne = n and we don’t want that. IS has ne < n.
We just want to detect very badly imbalanced weights via small ne.
Integrand specific versionEvans & Swartz (1995)
Use wi(f) = |f(xi)|p(xi)/q(xi), W (f) =∑i wi(f), wi(f) = wi(f)/W (f).
ne(f) =1∑n
i=1 wi(f)2Grid Science Winter School, January 2017
Adaptive Importance Sampling 40
PERT exampleTen steps in a sofware project. Some cannot start until previous are done.
j Task Predecessors Mean Duration
1 Planning None 4
2 Database Design 1 4
3 Module Layout 1 2
4 Database Capture 2 5
5 Database Interface 2 2
6 Input Module 3 3
7 Output Module 3 2
8 GUI Structure 3 3
9 I/O Interface Implementation 5,6,7 2
10 Final Testing 4,8,9 2
Table 1: A PERT problem from Chinneck (2009, Chapter 11)Grid Science Winter School, January 2017
Adaptive Importance Sampling 41
PERT ctd
1
2
3
4
5
6
7
8
9 10
PERT graph (activities on nodes)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 42
PERT ctdFor given times, the project completes in E = E10 = 15 days.
(They seem optimistic.)
Start time Time taken End time
Sj = maxk→j Ek Tj Ej = Sj + Tj
E.g., from the graph: S9 = max(E5, E6, E7).
For independent exponential random variables Tj , we get E(E10).= 18.
Suppose there’s a large penalty when E > 70.
Plain MC
Got E > 70 twice in n = 10,000 tries.
Let’s try importance sampling.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 43
Change the exponential ratesChange Ti from exponential with mean θi to mean λi.
µλ =1
n
n∑i=1
1Ei,10>70
10∏j=1
exp(−Tij/θi)/θiexp(−Tij/λi)/λi
.
We can make Eλ(E10) ≈ 70 by taking λ = 4θ.
We get µλ.= 2.01× 10−5 with a standard error of 4.7× 10−6.
DiagnosticsWe also get ne
.= 4.90. This is quite low.
The largest weight was about 43% of the total.
The standard error did not warn us about this problem.
The standard error is more vulnerable to large weights than the mean is.
An effective sample size for the se is ne,σ.= 1.15. O (2013), Chapter 9.3
Grid Science Winter School, January 2017
Adaptive Importance Sampling 44
Try different λReplace λ = 4θ by λ = κθ and search for good κ. (None seemed good.)
Unequal scalingAdding a day to some tasks does not always add a day to the project.
Details: Tj = θj + 1, Tk = θk for k 6= j. Critical path: j ∈ 1, 2, 4, 10.
Even doubling some tasks’ time might not cause a delay.
Details: Tj = 2θj , Tk = θk for k 6= j. Changes for: j 6∈ 3, 7, 8.
Scale tasks 1, 2, 4, 10 by κ
κ 3.0 3.5 4.0 4.5 5.0
105 × µ 3.3 3.6 3.1 3.1 3.1
106 × se(µ) 1.8 2.1 1.6 1.5 1.6
ne 939 564 359 239 165
ne,σ 165 88 52 33 23Grid Science Winter School, January 2017
Adaptive Importance Sampling 45
FollowupUsing κ = 4 on the critical path with n = 200,000 yields
µq = 3.18× 10−5, se = 3.62× 10−7 ne.= 7470, ne,σ
.= 992, ne(f)
.= 7400
Vs plain MCPlain MC has variance ε(1− ε)/n.
IS has variance about 1200-fold smaller here.
Vs SNISIn this case SNIS had about double the estimated variance of plain IS.
Further improvementsMuch more subtle tuning could be applied.
The 10’th random variable can be ‘integrated out’ in closed form, reducing variance.
Control variates could be applied.
Antithetics could be applied. Grid Science Winter School, January 2017
Adaptive Importance Sampling 46
Summary of basic IS• IS helps with rare events or spiky integrands
• It becomes unbiased by reweighting
• A good importance sampler q is nearly proportional to fp
• Take extreme care that q is not small
• Self-normalized IS is more widely applicable, but less effective
• Dimension brings challenges
PERT example
This was adaptive IS done by hand.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 47
Devising qWe can go through a storehouse of distributions to find q.
For instance Devroye (1986).
General techniques
• Exponential tilting
• Laplace approximations
• Mixtures
Grid Science Winter School, January 2017
Adaptive Importance Sampling 48
Exponential tiltingAlso called exponential twisting.
Exponential family
p(x; θ) = exp( θTx−A(x)− C(θ) ), x ∈ X
Here p(x) = p(x; θ0) and q(x) = p(x; θ).
Includes: N (θ, I), Poisson(eθ), Binomial, Gamma distributions.
More general case
p(x; θ) = exp( η(θ)TT (x)−A(x)− C(θ) ), x ∈ X
Here η(·) and T (·) are known functions.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 49
IS with tilting
µθ =1
n
n∑i=1
f(xi)p(xi; θ0)
p(xi; θ), xi ∼ p(· ; θ)
=1
n
n∑i=1
f(xi)exp(xT
i θ0 −A(xi)− C(θ0))
exp(xTi θ −A(xi)− C(θ))
=c(θ)
c(θ0)
1
n
n∑i=1
f(xi)exTi (θ0−θ)
where c(θ) = exp(C(θ)). This often simplifies.
ForN (θ,Σ)
µθ =1
n
n∑i=1
f(xi) exTiΣ−1(θ0−θ) × e 1
2 θTΣ−1θ− 1
2 θT0Σ−1θ0︸ ︷︷ ︸
c(θ)/c(θ0)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 50
Further tiltingDefine a family q(x; θ) ∝ p(x)eh(x)Tθ .
The function h(·) defines ’features’ of x.
θj > 0 makes higher values of hj(x) more probable.
q(x; 0) = p(x)
If p is not in the exponential family, then we might not be able to compute w(x).
If we can sample from q(x; θ) then we can use SNIS.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 51
Hessian at the modeSuppose that we find the mode x∗ of p(x) or better yet, of h(x) ≡ p(x)f(x).
Taylor approximation
log(h(x)) ≈ log(h(x∗))−1
2(x− x∗)TH∗(x− x∗)
h(x) ≈ h(x∗) exp(−1
2(x− x∗)TH∗(x− x∗)
), suggests
q = N (x∗, H−1∗ ).
The Hessian of log(h) at x∗ is−H∗.Requires positive definite H∗.
For safety
Use a t distribution instead ofN .
Grid Science Winter School, January 2017
Adaptive Importance Sampling 52
MixturesWhat if there are multiple modes?
https://en.wikipedia.org/wiki/Multimodal_distribution
Closely sampling the highest mode might lead us to undersample or even miss some others.
A mixture of unimodal densities can capture all the modes (that we know of).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 53
Mixture distributionsSample j randomly from 1, 2, . . . , J with probabilities α1, . . . , αJ .
Here αj > 0 and∑Jj=1 αj = 1.
Given j take x ∼ qj .
For exampleJ∑j=1
αj N (θj , σ2Id)
With large J , we get kernel density approximations.
These can approximate generic densities.
The approximation gets more difficult with large dimension.
West (1993), Oh & Berger (1993)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 54
Multiple rare event setsSuppose that x describes a bridge and its load.
There is a failure if x ∈ A1 or x ∈ A2.
Oversampling A1 might lead us to miss A2.
Finance example
Portfolio might fail to meet its goal under:
A1: oil sector underperforms
A2: dollar becomes too weak
A3: interest rates in China are too high
A4: pharmaceutical stocks do too well (portfolio shorted them)
Mixtures
One component can oversample each failure mechanism (that we know of).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 55
Multiple integrands of interestWe want µj =
∫fj(x)p(x) dx for j = 1, . . . , J .
Statistics: numerous posterior moments.
Graphics: red, green, blue for each pixel.
Grid science: J different designs.
Each fj(·)p(·) has its own peak.
Mixtures
One mixture component can be directed towards each integrand (that we have planned for).
We will have to take care that optimizing for f1 does not spoil things for f2. The same issue
arises for multiple modes and failure regions.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 56
Multiple nominal distributionsWe want µj =
∫f(x)pj(x) dx for j = 1, . . . , J .
The pj may correspond to different scenarios (that we are aware of).
Mixtures
Multiple pj and multiple fk amount to multiple products (pf)jk.
We can use a component for each.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 57
IS with mixturesqα(x) =
J∑j=1
αjqj(x)
µα =
n∑i=1
f(xi)p(xi)
qα(xi)=
n∑i=1
f(xi)p(xi)∑Jj=1 αjqj(xi)
. (∗∗)
(∗∗) Balance heuristic Veach & Guibas, also Horvitz-Thompson estimator.
AlternativeSuppose that xi came from component j(i). We could also use
1
n
n∑i=1
f(xi)p(xi)
qj(i)(xi).
It is also unbiased.
This alternative has higher variance.
But you don’t have to compute every qj for every xi. Grid Science Winter School, January 2017
Adaptive Importance Sampling 58
Why higher variance?Argument by convexity. Simple case α = ( 1
2 ,12 ) for q1 and q2.
Recall: σ2q =
∫(fp)2/q dx− µ2
Horvitz Thompson
1
n
(∫(fp)2
(q1 + q2)/2dx− µ2
)Alternative
1
2n
(∫(fp)2
q1dx− µ2
)+
1
2n
(∫(fp)2
q2dx− µ2
)Horvitz-Thompson is better because 1/x is convex.
We gave the alternative a slight advantage by taking exactly n/2 observations from each qj .
Grid Science Winter School, January 2017
Adaptive Importance Sampling 59
Defensive mixturesFrom Hesterberg (1995)
For our best guess q(·) take q1 ≡ p and let q2 = q. Now,
w(x) =p(x)
qα(x)=
p(x)
α1p(x) + α2q2(x)6
1
α1, ∀x
Perhaps α1 = 1/10 or 1/2.
q does not need to be heavy tailed any more because qα is.
Variance bounds
Var(µqα) 61
nα1
(σ2p + α2µ
2)
Var(µqα,sn) 61
nα1σ2p,sn =
1
nα1σ2p
Var(µqα,sn) 61
nα2σ2q,sn
If however σ2q was very small, defensive IS can lose out.
We don’t have a bound vs σ2q .
Grid Science Winter School, January 2017
Adaptive Importance Sampling 60
Loss of proportionality
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
range(u)
f(x)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
range(u)
c(0,
10)
Sampling densities
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
Likelihood ratios
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
c(0,
10)
Integrands
Defensive importance sampling
• f Gaussian near 0.7
• p Uniform
• q is Beta≈ f
• qα is dotted mixture
• α = (0.2, 0.8)
• fp/q unbounded
• qα not very proportional to fp = f
• Solution = control variates.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 61
Control variatesSuppose that we know E(h(x)) = θ in closed form.
If h ≈ f we can Monte Carlo the difference
1
n
n∑i=1
(f(xi)− h(xi)
)+ θ
for greatly reduced variance.
and h is similar to f then we can use it to estimate E(f(x)).
More generally
For known E(h(x)) = θ ∈ Rd, let
µβ =1
n
n∑i=1
(f(xi)− βTh(xi)
)+ βTθ
Optimal β can be estimated from the same data.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 62
It is easy to see that E(µβ) = µ (unbiased) for any β.
The optimal β minimizes Var(µβ). This βopt is unknown and could well be harder to estimate
than it is to estimate our ultimate goal µ.
A harder intermediate problem is usually to be avoided, but in this case we have a loop hole.
Getting β wrong by ε will mean our variance is too large by a factor of 1 +O(ε2). We do not
need a precise estimate of β. The typical ε is proportional to 1/√n. We can ordinarily analyze
control variates as if we were using the true unknown β.
[Aside: this asymptotic approximation is poor if the dimension of β is comparable to n.]
Grid Science Winter School, January 2017
Adaptive Importance Sampling 63
Control variates via regressionDefine Yi = f(xi) and Zij = hj(xi)− θj .
Fit Yi.= β0 +
J∑j=1
βjZij by least squares.
The intercept β0 is E(µ).
Regression code also gives a standard error for the estimate.
Details in O (2013), Algorithm 8.3.
Centering Zij is essential.
Using estimated β
β − βopt = Op(n−1/2), so we are about n−1/2 from the optimum.
σ2β is quadratic in β =⇒ σ2
β= σ2
βopt(1 +O(1/n)).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 64
Control variates and mixture ISFor normalized qj(·) we know
∫qj(x) dx = 1, j = 1, . . . , J
Eqα( qj(x)
qα(x)
)=
∫qj(x) dx = 1
Unbiased estimate
µα,β =1
n
n∑i=1
f(xi)p(xi)−∑Jj=1 βjqj(xi)∑J
j=1 αjqj(xi)+
J∑j=1
βj
Additional control variates can be added too.
Via regressionRegress Yi = f(xi)p(xi)/qα(xi) on Zij = qj(xi)/qα(xi)− 1.
Get µ = β0 (intercept) and se.
∑j αjZij = 1 for all i, so drop one predictor.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 65
Mixture IS resultsO & Zhou (2000)
Var(µα,βopt) 6 min16j6J
σ2j
nαj
Properties
1) µα,β is unbiased if qα > 0 whenever fp 6= 0
2) If any qj have σ2qj = 0 we get Var(µα,βopt) = 0
3) Even better to take exactly nαj observations from qj .
We could not expect better in general.
We might have σ2j =∞ for all but one j.
The bound is σ2j /(nαj) as if we had just used the good one.
(Without knowing which one it was. Indeed it might be a different one for each of several
different integrands.)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 66
Multiple importance samplingVeach & Guibas (1995)
Widely used in computer graphics.
Eric Veach got an Oscar for it in 2014.
µ =
J∑j=1
1
nj
nj∑i=1
ωj(xij)f(xij)p(xij)
qj(xij)
qj = j ’th light path sampler
Partition of unity
ωj(x) > 0 and∑Jj=1 ωj(x) = 1, ∀x
Can attain Horvitz-Thompson by ‘balance heuristic’ ω(· ) ∝ njqj(· )
The weight on f(xij)p(xij) can depend on j in many ways.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 67
Summary of mixturesUsing mixtures, we can
• bound the importance ratio
• place a distribution near each singularity
• place a distribution near each failure mode
• tune a distribution to each fj of interest
• tune a distribution to each pj of interest
• use control variates to be almost as good as the optimal component
The mixture components can be based on intuition, tilting, Hessians.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 68
What-if simulationsWe estimated µ =
∫f(x)p(x) dx using xi ∼ q
From the sample values f(xi) we can estimate
1)∫f(x)p′(x) dx for a different density p′ 6= p
2) Varq(µq) for xi ∼ q
3) Varq′(µq′) for xi ∼ q′ for a different density q′ 6= q
Estimating what would have happened with a different q is the key to adaptive IS.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 69
What-if simulationsFamily p(x; θ), θ ∈ Θ
We want µ(θ) = E(f(x); θ) =∫f(x)p(x; θ) dx, θ ∈ Θ
Sample from θ0 and reweight for the others
µ(θ) =1
n
n∑i=1
f(xi)p(xi; θ)
p(xi; θ0), xi ∼ p(· ; θ0).
We can recycle our f(xi) values
Common heavy-tailed q
µ(θ) =1
n
n∑i=1
f(xi)p(xi; θ)
q(xi), xi ∼ q(· )
Paraphrase (from memory) of Tukey and Trotter (1956)
“Any sample can come from any distribution.”
Grid Science Winter School, January 2017
Adaptive Importance Sampling 70
What-if simulationsThe may be accurate for θ near θ0 but otherwise may fail.
For p = N (θ0,Σ) and q = N (θ,Σ)
n∗e >n
100requires (θ − θ0)TΣ−1(θ − θ0) 6 log(10)
.= 2.30 O (2013), Chapter 9.14.
Examplef(xi; ν) = max(xi1, xi2, xi3, xi4, xi5) where xij ∼ Poi(ν).
Importance sample from λ instead of ν.
µν =1
n
n∑i=1
max(xi1, . . . , xi5)5∏j=1
e−ννxij
xij !
/ e−λλxijxij !
=1
n
n∑i=1
max(xi1, . . . , xi5)e5(λ−µ)(νλ
)∑j xij
Grid Science Winter School, January 2017
Adaptive Importance Sampling 71
Poisson example
0.5 1.0 1.5 2.0 2.5
1e−
011e
+01
1e+
03
**
***
* * * * * * * * * ***
** * * * * * *
Effective sample sizes
ν
λ
0.5 1.0 1.5 2.0 2.51
23
4
rang
e(va
ls[k
eep,
])
Means and 99% C.I.s
ν
λ
xij ∼ Poi(λ)
Solid curve is pop’n effective sample size. Asterisks are sample estimates.
When ν > λ then q has lighter tails than p. Hence wide CIs and poor ne.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 72
What-if for varianceFor q(x; θ), σ2
θ =∫ (f(x)p(x))2
q(x;θ) dx− µ2 ≡ MSθ − µ2
Now
MSθ =
∫(f(x)p(x))2
q(x; θ)dx =
∫(f(x)p(x))2
q(x; θ)q(x; θ0)q(x; θ0) dx
Sample estimate
For xi ∼ q(x; θ0),
MSθ(θ0) =1
n
n∑i=1
(f(xi)p(xi))2
q(xi; θ)q(xi; θ0)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 73
Summary of what-if simulations
• We can learn about alternative p’s and q’s
• Effectiveness is over a limited range
Grid Science Winter School, January 2017
Adaptive Importance Sampling 74
The next slides are a survey of a selected set of adaptive importance sampling methods.
These are the ones that I thought were interesting and potentially useful in grid science.
Of course there are many others.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 75
Adaptive importance samplingOverview:
Use the xi sampled so far to change q and sample some more.
Two stageGet a pilot sample xi1 ∼ q1.
Use them to choose q2.
Take xi2 ∼ q2 to estimate µ.
Multi-stageGiven q1, for k = 1, . . . ,K − 1, use xik, i = 1, . . . , nk to choose qk+1.
Or use xij , 1 6 i 6 nj , 1 6 j 6 k to choose qk+1.
n-stageGiven q1, take xi from qi computed from x1, . . . ,xi−1.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 76
AIS detailsWe have to pick a familyQ of proposal distributions.
We need an initial q ∈ Q.
We have to choose a termination criterion:
e.g., K steps, total N observations, confidence interval width below ε.
We need a way to choose qk+1 ∈ Q from prior data.
We need a way to combine information from all K stages.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 77
CombinationSuppose we get µk for k = 1, . . . ,K .
Unbiased with variances σ2k.
Best linear unbiased combination
K∑k=1
µkVar(µk)
/ K∑k=1
1
Var(µk)
It is not safe to replace Var(µk) by Var(µk).
It is common for Var(µk) to be correlated with µk.
E.g., largest µk may coincide with largest Var(µk).
Plug-in estimates could bias the estimate down.
Deterministic weighting is less risky.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 78
Deterministic weightsSuppose true variance improves: Var(µk) ∝ k−r0 for some 0 6 r0 6 1.
We should weight proportionally to kr0 but we might not know r0. So we use r1.
If we take r1 = 1/2 then we never raise variance by more than 12.5%.
sup16K<∞
max06r061
Var(using r1 = 1/2)
Var(using r0)=
9
8.
UpshotOnly a small effiiciency loss from weighting stage k proportionally to
√k.
K∑k=1
√kµk
/ K∑k=1
√k.
O & Zhou (1999)
Not efficient for exponential convergence (see below) but that is a rare circumstance.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 79
CombinationPut weights ωk on µk, such as ωk ∝
√k.
Final estimate µ =∑Kk=1 ωkµk.
Assume
E(µk | data to steps k − 1) = µ
E(Var(µk) | data to steps k − 1) = Var(µk | data to steps k − 1)
Then by Martingale arguments
E(µ) = µ, and
Var(µ) ≡∑k
ω2kVar(µk), satisfies
E(Var(µ)) = Var(µ)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 80
Exponential convergenceParametric setQ = qθ | θ ∈ Θ, Θ ⊂ Rd
Best case scenario
If q∗ = fp/µ ∈ Q then we can approach σ2q = 0.
Let q∗ = qθ∗ . If we can estimate θ∗ to within O(n−1/2) from a first estimate we get much
smaller variance at the second step.
The result can be a virtuous cycle of θ(k) → θ∗ and σ2k → 0.
Examples
1) Toy example with p = N (0, 1) and f(x) = exp(Ax).
2) Theory from Kollman (1993) and others for particle transport.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 81
Toy examplep = N (0, 1), f(x) = exp(Ax) for A > 0 and qθ = N (θ, 1).
We actually know µ = E(f(x)) = exp(A2/2) so we don’t really need MC.
Suppose we also know θ∗ = A and Ep(f(x)) = exp(A2/2).
That is θ∗ =√
2 log(E(f(x))).
Cycle through
xi ∼ qθ(k) , i = 1, . . . , n
µ(k) ← 1
n
n∑i=1
f(xi)p(xi)
qθk(xi)
θ(k+1) ←√
2 log(max(1, µ(k))).
Without max(1, ·) the algorithm can fail.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 82
Exponential convergence
5 10 15 20
1e−
161e
−08
1e+
00
Iteration
Abs
olut
e er
ror
Exponentially converging AIS error
Steps with n = 20.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 83
Transport problemsBooth (1985) observed exponential convergence in some particle transport problems.
Kollman (1993) established exponential convergence under strong conditions:
start one simulation from each state of a Markov chain.
Extensions from Kollman, Baggerly, Cox, Picard (1999) and Baggerly, Cox, Picard (2000) and
Kong, Ambrose & Spanier (2009)
Story
Markovian particles P(xn+1 = x | x0, . . . ,xn−1) = P(xn+1 = x | xn−1)
Terminal state ∆ (absorption or left region of interest).
limn→∞ P(xn = ∆) = 1.
f(x1,x2, . . . ) =∞∑n=1
s(xn−1 → xn) =τ∑n=1
s(xn−1 → xn)
hitting time τ , score function s(· → ·) with s(∆→ ∆) = 0.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 84
Particle transportThe particle follows a Markov chain with transitions P .
We use transitions Q instead, with Qij > 0 whenever Pij > 0.
Optimal Q
Qij =Pij(sij + µj)
Pi∆si∆ +∑d`=1 Pi`(si` + µ`)
Qi∆ =Pi∆si∆
Pi∆si∆ +∑d`=1 Pi`(si` + µ`)
, where
µi = EQ( τ∑n=1
s(xn−1 → xn)wn(x1, . . . ,xn)∣∣ x0 = i
), and
wn =n∏j=1
Pxj−1xj
Qxj−1xj
.
AlgorithmAlternates between estimating µ and sampling from Q(µ). Grid Science Winter School, January 2017
Adaptive Importance Sampling 85
Cross-entropyRubinstein (1997), Rubinstein & Kroese (2004)
µ =∫f(x)p(x) dx for f > 0 and µ > 0.
We use q(· ; θ) = qθ for θ ∈ Θ.
There is an optimal q ∝ fp but it is not usually in our family.
We would like
θ = arg minθ∈Θ
MS(θ) where MS(θ) =
∫(fp)2
qθdx
Variance based update
θ(k+1) = arg minθ∈Θ
1
nk
nk∑i=1
(f(xi)p(xi))2
qθ(xi), xi = x
(k)i ∼ qθ(k)
The optimization may be too hard. Switch to entropy.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 86
Cross-entropyUse an exponential family
qθ = exp(θTx−A(x)− C(θ))
Replace variance by KL distance
D(q∗‖qθ) = Eq∗(
log(q∗(x)
qθ(x)
))This distance hasD(q∗‖qθ) > 0 andD(q∗‖q∗) = 0.
Seek θ to minimize
D(q∗‖qθ) = Eq∗(log(q∗(x))− log(q(x; θ)))
I.e., maximize
Eq∗(log(q(x; θ)))
Grid Science Winter School, January 2017
Adaptive Importance Sampling 87
Cross-entropyFor q > 0 whenever q∗ > 0,
Eq∗(log(q(x; θ))) = Eq( log(q(x; θ))q∗(x)
q(x)
)=
1
µEq( log(q(x; θ))f(x)p(x)
q(x)
)Choose θ(k+1) to maximize
G(k)(θ) =1
nk
nk∑i=1
p(xi)f(xi)
q(xi; θ(k))log(q(x; θ))
≡ 1
nk
k∑i=1
Hi log(q(xi; θ))
≡ 1
nk
k∑i=1
Hi
(θTxi −A(xi)− C(θ)
)C(θ) is a known convex function. For Hi > 0. Assume (for now) that
∑iHi > 0.
The update often takes a simple moment matching form. Grid Science Winter School, January 2017
Adaptive Importance Sampling 88
Cross-entropy update∑iHix
Ti∑
iHi=
∂
∂θC(θ(θ+1))
For qθ = N (θ, I)
θ(k+1) ←∑iHixi∑iHi
For qθ = N (θ,Σ)
θ(k+1) ← Σ−1
∑iHixi∑iHi
Other exponential family updates are typically closed form functions of sample moments.
What if all Hi = 0?Consider f(x) = 1g(x)>τ for large τ .
Use some other τ (k) instead. For example 99’th percentile of g(xi).
Ideally τ (k) → τ .Grid Science Winter School, January 2017
Adaptive Importance Sampling 89
Cross-ent examples
−6 −4 −2 0 2 4 6
−5
05
Gaussian, Pr(min(x)>6)
−6 −4 −2 0 2 4 6−
50
5
Gaussian, Pr(max(x)>6)
θ1 = (0, 0)T.
Take K = 10 steps with n = 1000 each.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 90
Cross-ent examples
−6 −4 −2 0 2 4 6
−5
05
Gaussian, Pr(min(x)>6)
−6 −4 −2 0 2 4 6−
50
5
Gaussian, Pr(max(x)>6)
θ1 = (0, 0)T.
For min(x), θk heads Northeast, and is ok.
For max(x), θk heads North, or East and underestimates µ by about 1/2.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 91
Nonparametric AISFlexible / infinite dimensional familiesQ.
Often based on recursive rectangular partitioning of region like [0, 1]d.
1) Vegas, Lepage (1978)
2) Divonne, Friedman & Wright (1979, 1981)
3) Miser Press & Farrar (1990)
4) Split Vegas Zhou (1998)≈ Vegas within Miser
Grid Science Winter School, January 2017
Adaptive Importance Sampling 92
Vegas
q(x) =d∏j=1
qj(xj)
Each qj is piece-wise constant. Equal content, not equal width.
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80
Vegas−style piecewise constant density
x
dens
ity
Very amenable to sampling spikes. Grid Science Winter School, January 2017
Adaptive Importance Sampling 93
Comments on VegasThe optimal qj given other q` is
qj(xj) ∝
√√√√∫(0,1)d−1
f(x)2∏` 6=j q`(x`)
∏6=j
dx`.
Update starts by iterating above. Some complex steps.
A problem with Vegas
Suppose f(x) has two modes in [0, 1]10 one near (0, 0, . . . , 0) and one near (1, 1, . . . , 1)
Each qj needs two modes.
So q has 210 modes, mostly irrelevant.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 94
DivonneFriedman & Wright (1979, 1981)
p = U[0, 1]d and f has two continuous derivatives.
Local optimization has less curse of dimension than integration.
Newton steps cost O(d3) per iteration.
Recursive partitioning
Hyper rectangles R` ⊂ [0, 1]d with ∪L`=1R` = [0, 1]d and vol(Rj ∩Rk) = 0 for j 6= k.
∫[0,1]d
f(x) dx =
L∑`=1
∫R`
f(x) dx
Start with L = 1 and R1 = [0, 1]d.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 95
Divonne splitsFind min and max of f in each rectangle.
f(R) = maxx∈R
f(x)
f(R) = minx∈R
f(x)
b(R) = vol(R)|f(R)− f(R)| (badness)
Split the rectangle with largest ‘badness’.
Non-binary splitsTo do the split decide which of max, min is farthest from average f in the rectangle.
Define a subrectangle to isolate that mode.
Partition the original rectangle into numerous other subrectangles to complement the one
around the mode.
An issue it is not designed for unbounded integrands and they will give it trouble.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 96
MiserPress & Farrar (1990) see also Numerical Recipes
Estimates∫
(0,1)df(x) dx by recursive binary partitioning into rectangles.
Each rectangle can be split down the middle in d different directions.
The chosen direction is based on an estimate of which j will lead to the most accurate estimate.
They assume that Miser will be applied left and right, not Monte Carlo, and them employ an
assumption about the Miser convergence rate to decide.
Having chosen the split, they allocate sample sizes left and right.
For a mode at (1/2, 1/2, . . . , 1/2) then split at 1/2 + U(−δ/2, δ/2) to avoid it.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 97
Split VegasDissertation of Zhou (1998).
Within a rectangular region, fit Vegas.
If one of the marginal distributions is strongly bimodal then split the rectangle on that direction.
It can be subtle to indentify whether a given margin has two or more modes.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 98
Mixture of products of betaFor∫
[0,1]df(x) dx
Beta density b(x;α, β) ∝ xα−1(1− x)β−1 α, β > 0
Can have a bump near α/(α+ β) or a spike near 0 and / or near 1.
d dimensional product of beta densities∏dj=1 b(xj ;αj , βj)
Mixture of products of beta densities q(x) =∑Mm=1 γm
∏dj=1 b(xj ;αjm, βjm)
Adaptive partChoose γm and αjm, βjm by Levenberg-Marquardt nonlinear least squares to minimize
n∑i=1
(|f(xi)| − q(xi;α,β,γ)
)2
O & Zhou (1999)
Many more details in thesis Zhou (1998).
Grid Science Winter School, January 2017
Adaptive Importance Sampling 99
Particle physics integrandFrom Kinoshita (1970).
7 dimensional, positive and negative integrable singularities.
MC Vegas MISER ADBayes MixBeta
Err/µ 0.154 0.092 0.073 0.210 0.035
Err/Err 1.163 34.0 0.750 19.5 0.394
K = 20 runs of n = 105 points each.
Except ADBAYES.
Known that µ.= 0.371005292 as of 1991. (!!)
(Don’t recall non-sampling costs.)
Grid Science Winter School, January 2017
Adaptive Importance Sampling 100
AMISAdaptive Multiple Importance Sampling
Cornuet, Marin, Mira, Robert (2012)
1) Sample n1 observations using θ1
2) Estimate θ2 from data and sample more
3) Keep estimating new θk and sampling more
4) Combine rounds by multiple importance sampling methods
The weights on observations from one round can change later due to data from other rounds.
This breaks the Martingale property, so it is hard to get unbiased estimates.
SNIS is used throughout because the motivation is from Bayesian problems where p is usually
not normalized.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 101
Optimal mixture weightsMSα,β ≡
∫(fp+
∑j βj(qj − p))2∑j αjqj
dx
is jointly convex in α (in simplex) and β.
So is a sample estimate of it.
Therefore we can estimate optimal mixture weights.
Also we can constrain αj > ε > 0 in a convex optimiation and variance does not increase
much.
He & Owen (2014) sequentially optimize both α and β.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 102
M-PMCMixture-Population Monte Carlo
Cappe, Douc, Guillin, Marin & Robert (2008)
q(x) =J∑j=1
αjqj(x; θj)
They target Kullback-Leibler distance to target dist’n.
Also use SNIS.
Aim to optimize over both αj and θj (non-convex).
Use an EM algorithm for the θ updates.
Earlier Douc, Guillin, Marin & RObert (2007) target specific integrand.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 103
APISMartino, Elvira, Luengo, Corander (2015)
Adaptive Population Importance Sampler
For an unnormalized p(·).
Choose n normalized distributions qi(· ; θi, Ci), mean θi, covariance Ci.
Sample xi ∼ qiGet SNIS weights wi ∝ p(xi)/
∑i′ qi′(xi; θi′ , Ci′).
Every m’th iteration, update the means θi but not the covariances Ci,
using previous m− 1 iterations’ data
Avoids “particle collapse”.
Empirical assessment.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 104
Stochastic convex programmingRyu & Boyd (2015)
They take up the exponential family setting of cross-entropy.
They find in exponential families that the variance is a convex function.
So they don’t have to use Kullback-Leibler.
µ =1
n
n∑i=1
f(xi)p(xi)
q(xi; θi)
Potentially a new θi for each xi. (or small batches)
Stochastic gradient descentθn+1 = Proj(θn − αngn)
Sequence αn → 0 with∑αn =∞
Proj projects onto Θ
gn = gn(θn,xn) is n’th gradient estimate
µn attains optimal variance×(1 +O(n−1/2)). Grid Science Winter School, January 2017
Adaptive Importance Sampling 105
Kriging based estimatorsPicheny (2009), Baleadent, Morio & Marzat (2013), Dalbey & Swiler (2014)
Rare event φ(x) > τ . We get φ(xi), not just f(x) = 1φ(x)>τ .
Kriging
Given φ(x1), . . . φ(xn) it can predict φ(x) elsewhere by Gaussian process interpolation.
Swiler & West (2010) report that∫p(x)1φ(x)>τ dx does not work well.
Better to use∫p(x)Φ((τ − φ(x)/s(x))) dx where kriging gives φ and a posterior standard
deviation s. Chevalier, Ginsbourger, Picheny, Richet, Bect, Vazquez (2014)
Dalbey & Swiler (2014) use φ to update q.
Grid Science Winter School, January 2017
Adaptive Importance Sampling 106
Thanks• Yi Zhou, Hera He, co-authors
• Minyong Lee, discussion
• Toichuro Kinoshita (particle physics code and integral)
• U.S. National Science Foundation DMS-1521145, DMS-1407397
• Michael Chertkov
• Kacy Hopwood
Grid Science Winter School, January 2017