chapter iii simulating random variables...generation from non-u(0;1) we have a sequence u 1;u 2;u...
TRANSCRIPT
MAS472/6004: Computational Inference
Chapter IIISimulating random variables
1 / 70
3.1 Generating Random Variables
I Inference techniques used so far have been based onsimulation
I We now consider how to simulate X from fX(x).
I In semester 1 used MCMC - but simpler methods areneeded in order to do MCMC.
I Starting point: generate U from U [0, 1] distribution
I Then consider transformation g(U) to obtain a randomdraw from fX(x).
How could we generate U [0, 1] r.v.s with coin tosses?
2 / 70
Sampling from U(0, 1)
Need to simulate independent random variables uniformlydistributed on [0, 1].
Definition: A sequence of pseudo-random numbers {ui} is adeterministic sequence of numbers in [0, 1] having the samestatistical properties as a similar sequence of random numbers.Ripley 1987.
The sequence {ui} is reproducible provided u1 is known.
A good sequence would be “unpredictable to the uninitiated”.
3 / 70
Congruential generators (D.H. Lehmer, 1949)
The general form of a congruential generator is
Ni = (aNi−1 + c) mod M,
Ui = Ni/M, where integers a, c ∈ [0,M − 1]
If c = 0, it is called a multiplicative congruential generator(otherwise, mixed).These numbers are restricted to the M possible values
0,1
M,
2
M, . . . ,
M − 1
M.
Clearly, they are rational numbers, but if M is large they willpractically cover the reals in [0, 1].
N1: the seed. Can be re-set so you can reproduce same set ofuniform random numbers. In R, use set.seed(i), where i aninteger.
4 / 70
As soon as some Ni repeats, say, Ni = Ni+T , then the wholesubsequence repeats, i.e. Ni+t = Ni+T+t, t = 1, 2, . . ..The least such T is called the period.A good generator will have a long period.
The period cannot be longer than M and also depends on a andc.Several useful Theorems exist concerning periods of congruentialgenerators. For example, for c > 0, T = M if and only if
1. c and M have no common factors (except 1),
2. 1 = a (mod p) for every prime number that divides M ,
3. 1 = a (mod 4) if 4 divides M .
5 / 70
Usually M is chosen to make the modulus operation efficient,and then a and c are chosen to make the period as long aspossible. Ripley suggests c = 0 or c = 1 is usually a good choice.
The NAG Fortran Library G05CAF
M = 259 a = 1313 c = 0
Another recommended one is
M = 232 a = 69069 c = 1.
so thatNi = (69069Ni−1 + 1)mod 232
andUi = 2−32Ni
6 / 70
Lattice structure
Notice that for a congruential generator
Ni − aNi−1 = c− bM,
where b > 0 is an integer. Therefore,
Ui − aUi−1 =c
M− b.
The LHS lies in (−a, 1) since Ui ∈ [0, 1).Therefore, b can take at most a+ 1 distinct values.
If we plot points (Ui−1, Ui), all the points will lie on at mosta+ 1 parallel lines.
7 / 70
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
0 10 20 30 40 50
010
2030
4050
M=2^11, a=51, c=1
U_i
U_{
i+1}
All linear congruential generators exhibit this kind of latticestructure, not just for pairs (Ui−1, Ui), but also for triples(Ui−2, Ui−1, Ui), and in higher dimensions.
A good generator is expected to have fine lattice structure, thatis, points (Ui−k+1, . . . , Ui−1, Ui) ∈ [0, 1)k must lie on manyhyperplanes in Rk for all small k (k �M).
8 / 70
RANDU - lattice structure
M = 231, a = 216 + 3 = 65539, and c = 0.
Once very popular, RANDU has eventually been found out tobe a rather poor generator.
9 / 70
RANDU - lattice structure II
Let Ui = Ni/m then for this generator
Ui+2 − 6Ui+1 + 9Ui = k an integer.
Since 0 ≤ Ui < 1
−6 < Ui+2 − 6Ui+1 + 9Ui < 10.
Therefore k = −5,−4, . . . ,−1, 0,+1, . . . , 9.Hence k can take on 15 integer values only, and subsequently(Ui−2, Ui−1, Ui) must lie on at most 15 parallel planes.
This is an example of coarse lattice structure, unsatisfactorycoverage of [0, 1)3.
10 / 70
Generation from non-U(0, 1)
We have a sequence U1, U2, U3, . . . of independent uniformrandom numbers in [0, 1].
We want X1, X2, . . . distributed independently and identicallyfrom some specified distribution.The answer is to transform the U1, U2, . . . sequence intoX1, X2, . . . sequence.The idea is to find a function g(U1, U2, U3, . . .) that has therequired distribution.There are always many ways of doing this. A good algorithmshould be quick because millions of random numbers may berequired.
11 / 70
3.2 The inversion methodLet X be any continuous random variable and defineY = FX(X), where FX is the distribution function of X:FX(x) = P (X ≤ x).Claim: Y ∼ U [0, 1].Proof Y ∈ [0, 1] and the distribution function of Y is
FY (y) = P (Y ≤ y) = P(FX(X) ≤ y
)= P
(X ≤ F−1X (y)
)= FX
(F−1X (y)
)= y
which is the distribution function of a uniform random variableon [0, 1].
So whatever the distribution of X, Y = FX(X) is uniformlydistributed on [0, 1]. The inversion method turns thisbackwards. Let U = FX(X), then X = F−1X (U).
I So to generate X ∼ FX take a single uniform variable U ,and set X = F−1X (U).
12 / 70
Example: exponential distribution
Let X ∼ Exp(1/λ) (mean λ), i.e.
f(x) = λ−1e−x/λ (x ≥ 0)
F (x) =
∫ x
0λ−1e−z/λ dz = [−e−z/λ]x0 = 1− e−x/λ.
Set U = 1− e−X/λ and solve for X
X = −λ ln(1− U).
Note that 1− U is uniformly distributed on [0, 1], so we mightas well use
X = −λ lnU.
Question: What are the limitations of the inversion method?
13 / 70
Discrete distributionsThe inversion method works for discrete random variables inthe following sense.Let X be discretely distributed with possible values xi havingprobabilities pi. So
P (X = xi) = pi,
k∑i=1
pi = 1.
Then FX(x) =∑xi≤x
pi is a step function.
Inversion gives X = xi if∑
xj<xi
pj < U ≤∑
xj≤xipj which clearly
gives the right probability values.
I Think of this as splitting [0, 1] into intervals of length pi.The interval in which U falls is the value of X.
Question: What problems might we face using this method?Eg Consider a Poisson(100) distribution.
14 / 70
Discrete distributions - exampleLet X ∼ Bin(4, 0.3). The probabilities are
P (X = 0) = .2401, P (X = 1) = .4116, P (X = 2) = .2646
P (X = 3) = .0756, P (X = 4) = .0081.
The algorithm says X = 0 if 0 ≤ U ≤ .2401,
X = 1 if .2401 < U ≤ .6517,
X = 2 if .6517 < U ≤ .9163,
X = 3 if .9163 < U ≤ .9919,
X = 4 if .9919 < U ≤ 1.
Carrying out the binomial algorithm means the following. LetU ∼ U(0, 1).
1. Test U ≤ .2401. If true, return X = 0.2. If false, test U ≤ .6517. If true, return X = 1.3. If false, test U ≤ .9163. If true, return X = 2.4. If false, test U ≤ .9919. If true, return X = 3.5. If false, return X = 4.
15 / 70
Discrete distributions - example
Consider the speed of this. The expected number of steps(which roughly equates to speed) is
1× .2401 + 2× .4116 + 3× .2646 + 4× .0756 + 4× .0081
= 1 + E(X)− 0.0081 = 2.1919
To speed things up we can rearrange the order so that the latersteps are less likely.
1. Test U ≤ .4116. If true return X = 1.
2. If false, test U ≤ .6762. If true return X = 2.
3. If false, test U ≤ .9163. If true return X = 0.
4. and 5. as before.
Expected number of steps:1× .4116 + 2× .2646 + 3× .2401 + 4× (0.0956 + 0.0081) = 1.9959.Approximate 10% speed increase.
16 / 70
3.3 Other Transformations
(a) If U ∼ U(0, 1) set V = (b− a)U + a then V ∼ U(a, b) wherea < b.
(b) If Yi are iid exponential with parameter λ then
X =
n∑i=1
Yi = − 1
λ
n∑i=1
logUi = − 1
λlog
(n∏i=1
Ui
)has a Ga(n, λ) distribution.
(c) If X1 ∼ Ga(p, 1), X2 ∼ Ga(q, 1), X1 and X2 independentthen Y = X1/(X1 +X2) ∼ Be(p, q).
(d) Composition: if
f =
r∑i=1
pifi
where∑pi = 1 and each fi is a density, then we can
sample from f by first sampling I from the discretedistribution p = {p1, . . . , pr} and then taking a samplefrom fI .
17 / 70
The Box-Muller algorithm for the normal distribution
We cannot generate a normal random variable by inversion,because FX is not known in closed form (nor its inverse).The Box–Muller method (1958). Let U1, U2 ∼ U [0, 1].Calculate
X1 =√−2 lnU1 cos(2πU2),
X2 =√−2 lnU1 sin(2πU2).
Then X1 and X2 are independent N(0, 1) variables.The method is not particularly fast, but is easy to program andquite memorable.
18 / 70
3.4 Rejection Algorithm
Fundamental Theorem of Simulation:Simulating
X ∼ f(x)
is equivalent to simulating
(X,U) ∼ U{(x, u) : 0 < u < f(x)}.
Note that f(x, u) = I0<u<f(x) so that∫f(x, u)du =
∫ f(x)
0du = f(x)
as required.
Hence, f is the marginal density of the joint distribution(X,U) ∼ U{(x, u) : 0 < u < f(x)}.
19 / 70
Rejection Algorithm ExplainedThe problem with this result is that simulating uniformly fromthe set
{(x, u) : 0 < u < f(x)}may not be possible. A solution is to simulate the pair (X,U)in a bigger set, where simulation is easier, and then take thepair if the constraint is satisfied.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
x
Den
sity
20 / 70
Rejection: Uniform bounding boxSuppose that f(x) is zero outside the interval [a, b] (so that∫ ba f(x)dx = 1) and that f is bounded above by m.
I Simulate the pair (Y,U) ∼ U [a, b]× [0,m](Y ∼ U [a, b], U ∼ U [0,m] independently).
I Accept the pair if the constraint 0 < U < f(Y ) is satisfied.
This results in the correct distribution for the accepted Y value,call it X.
P(X ≤ x) = P(Y ≤ x|U < f(Y ))
=
∫ xa
∫ f(y)0 dudy∫ b
a
∫ f(y)0 dudy
=
∫ x
af(y)dy.
Note: we can use the rejection algorithm even if we only know fupto a normalising constant (as is often the case in Bayesianstatistics - see chapter 4).
21 / 70
Example: Sampling from a beta distributionConsider sampling from X ∼ Beta(α, β) for α, β > 1 which haspdf
f(x) =Γ(α+ β)
Γ(α)Γ(β)xα−1(1− x)β−1 0 < x < 1.
We note
f(x) ∝ f1(x) = xα−1(1− x)β−1 0 < x < 1
and that M = sup0<x<1
xα−1(1− x)β−1 occurs at x =α− 1
α+ β − 2(mode) and hence
M =(α− 1)α−1(β − 1)β−1
(α+ β − 2)α+β−2.
The rejection algorithm is
1. Generate Y ∼ U(0, 1) and U ∼ U(0,M).
2. If U ≤ f1(Y ) = Y α−1(1− Y )β−1 then let X = Y (accept)else go to 1 (reject).
22 / 70
Generalising the Rejection IdeaIf the support of f is not finite, then bounding it within arectangle will not work. Instead of using a box to bound thedensity f(x) (ie requiring f(x) < m for some constant m) wecan use a function m(x) such that f(x) ≤ m(x) for all x.
Suppose the larger bounding set is
L = {(y, u) : 0 < u < m(y)}
then all we require is that simulation of a uniform from L isfeasible. Note
I The closer m is to f the more efficient our algorithm.
I Because m(x) ≥ f(x), m cannot be a probability density.We write
m(x) = Mg(x) where
∫m(x)dx =
∫Mg(x)dx = M
for some density g.23 / 70
Generalising the Rejection Idea IIThis suggests a more general implementation of thefundamental theorem:
Corollary: Let X ∼ f(x) and let g(x) be a density functionthat satisfies f(x) ≤Mg(x) for some constantM ≥ 1. Then,to simulate X ∼ f , it is sufficient to generate
Y ∼ g and U |Y = y ∼ U(0,Mg(y))
and set X = Y if U ≤ f(Y ).
Proof:
P(X ∈ A) = P(Y ∈ A|U ≤ f(Y ))
=
∫A
∫ f(y)0
duMg(y)g(y)dy∫ ∫ f(y)
0du
Mg(y)g(y)dy
=
∫Af(y)dy
24 / 70
The Rejection Algorithm
The rejection algorithm is usually stated in a slightly modifiedform:
Rejection Algorithm
If g is such that f/g is bounded, so there exists Msuch that Mg(x) ≥ f(x) for all x then
1. Generate Y from density g, and U from U(0, 1).
2. If U ≤ f(Y )/Mg(Y ) set X = Y . Otherwise, return tostep 1.
produces simulations from f
We keep sampling new Y and U until the condition is satisfied.
Exercise: Convince yourself that these two descriptions of therejection algorithm are the same.
25 / 70
Example: Sampling from a beta distribution revisited
Use rejection to sample from X ∼ Beta(α, β). Letg(y) = αyα−1, 0 < y < 1, then
f1(x)
g(x)=
(1− x)β−1
αis bounded if and only if β ≥ 1
Then M = supx
{f1(x)
g(x)
}=
1
αoccurs at x = 0.
1. Simulate Y with pdf g(y) = αyα−1, 0 < y < 1 andU ∼ U(0, 1).
2. If U ≤ f1(Y )
Mg(Y )=
(1− Y )β−1(1α
)α
= (1− Y )β−1 then set
X = Y else go to 1.
26 / 70
How to simulate Y with pdf g(y) = αyα−1?I We note that the cdf of Y is G(y) = yα, 0 < y < 1.I Therefore we can use inversion. Let Z ∼ U(0, 1) then solve
Z = G(Y ) = Y α and so Y = Z1α .
Full algorithm is:
1. Generate U ∼ U(0, 1) and Z ∼ U(0, 1). Let Y = Z1α .
2. If U ≤ (1− Y )β−1 then set X = Y else go to 1.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
Den
sity
27 / 70
Efficiency of the rejection method
Each time we generate a (Y,U) pair,
Prob(Reject) = P(U ≥ f(Y )/Mg(Y )
)= 1− 1
M, Prob(Accept) =
1
M.
The number of tries until we accept Y is a geometric random variablewith expectation M .
Note that M here must be calculated with the normalised density f ,
i.e., M = sup f(x)g(x) .
If we used an unnormalised density f1(x), where∫f1(x)dx = c, so
that f(x) = 1cf1(x), then if we used
M = supf1(x)
g(x)
the acceptance rate is
P(Accept) =c
M
28 / 70
For maximum efficiency, we want M as small as possible, i.e.sup f(x)/g(x) as small as possible. This means finding a g that
(a) we can sample from efficiently, and
(b) mimics f as closely as possible.
There are many good generators based on rejection from awell-chosen g.
29 / 70
Rejection Example IIILet θ have von Mises distribution with pdf
f(θ) =exp(k cos θ)
2πI(k)0 < θ < 2π (k ≥ 0)
where I(k) is the normalising constant.
Let f1(θ) =1
2πexp(k cos θ), 0 < θ < 2π.
Y ∼ U(0, 2π) so that g(y) =1
2π, 0 < y < 2π.
Then
M = supθ
{f1(θ)
g(θ)
}= sup
θ{exp(k cos θ)} = exp k.
Let U ∼ U(0, 1).If
U ≤ f1(Y )
Mg(Y )=
exp(k cosY )
2π · 1
2π· exp k
= exp(k(cosY − 1)
)we accept θ = Y otherwise reject.
30 / 70
Truncated distributions
Suppose we wish to sample X from the following distribution:
fX(x) ∝{gX(x) for x ∈ A
0 otherwise
where gX(x) is a known density that we can sample from, e.g.gX(x) is the N(0, 1) density, and A = [0,∞).
fX(x) =
{k gX(x) for x ∈ A
0 otherwise
where k is a normalising constant, given by
k−1 =
∫AgX(x)dx
31 / 70
fX(x) ∝{gX(x) for x ∈ A
0 otherwise
Consider using rejection method to sample X from fX(x). Wesample Y from the full (non-truncated) density gX(x).
fX(x)
gX(x)=
{k if x ∈ A0 otherwise
32 / 70
So M = supxfX(x)gX(x) = k.
Rejection algorithm: sample u from U [0, 1] and y from gY (y),
and accept X = y if u ≤ fX(y)M gY (y) .
But since
fX(x)
M gX(x)=
{fX(x)k gX(x) = 1 if x ∈ A
0 otherwise
we will always have u ≤ fX(y)M gY (y) if y ∈ A , and u ≥ fX(y)
M gY (y) if
y /∈ A.So we don’t need to sample u. Can just do
1. generate y from gY (y)
2. if y ∈ A, accept X = y
3. otherwise, return to step 1.
As usual, acceptance probability will be high if M is small, i.e∫A gY (y) dy is near 1. So if the truncated region is large,
rejection sampling will be inefficient.
33 / 70
3.5 Multivariate generators
Now suppose we want to generate a random vectorX = (X1, . . . , Xp) from density f(x). We can note the followingsimple points.
1. If the elements of X are to be independent, i.e.
f(x) = f1(x1)f2(x2) . . . fp(xp),
then we can separately generate X1 from f1, X2 fromf2, . . . , Xp from fp using different uniforms.
2. Inversion no longer works as the theorem can’t begeneralised.
3. Rejection does work. If we can generate from g(x) (and gmay be a product of independent components) and find
M ≥ supx
f(x)
g(x)and otherwise reject.
34 / 70
Sequential methods
We can obviously write
f(x) = f1(x1)f2(x2|x1)f3(x3|x1, x2) . . . .
So we can first generate X1 from f1. Then for that given valueof X1, generate X2 from f2, and so on.
35 / 70
Example
Suppose we wish to sample {x1, x2} from the density function
f(θ, φ) ∝ x−12
2 x−(α+1)2 e
− 2β+λ(x1−µ)2
2x2
Firstly, consider the marginal distribution of x1
f(x1|x2) ∝ e−λ(x1−µ)
2
2x2
as we can ignore factors not depending on x1.
Thus we can recognise that
f(x1|x2) ∼ N(µ,x2λ
)
36 / 70
Next consider the marginal of x2
f(x2) ∝∫f(x1, x2)dx1
∝ x−12
2 x−(α+1)2 e
− βx2
(x2λ
) 12
∝ x−(α+1)2 e
− βx2
where the term on the right in rd is the missing constant fromthe N(µ, x2λ ) distribution.
We can recognise this as an inverse gamma distributionx2 ∼ Γ−1(α, β).
So to simulate random variables from f we can first simulate x2from an inverse-Gamma distribution (e.g. by rejectionsampling) and then simulate x1 ∼ N(µ, x2λ ) using, e.g.,Box-Muller.
37 / 70
Multivariate normal distributions
How can we generate X from a N(m, V ) distribution, for somenon-diagonal matrix V ?
We know how to generate iid N(0, 1) rvs from the Box-Mulleralgorithm, so perhaps we can take a sequence of independentstandard normal random variables Z1, Z2, . . . and transformthese in some way?
One technique involves the use of the Cholesky square rootof the matrix V . For any (symmetric, square) positive definitematrix V , we can find a square root U (called the Choleskydecomposition), such that UTU = V.
To find the Cholesky square root of a matrix V in R, typechol(V).
38 / 70
Multivariate normal distributions II
Set Z =
Z1...Zn
where Zi ∼ N(0, 1) and n = dim X.
ConsiderY = m + UTZ.
Then Y must have a multivariate normally distribution (why?),and
E(m + UTZ) = m,
Var(m + UTZ) = UT InU = V,
(with In the n× n identity matrix = VarZ).Hence to generate X, we generate independent standard normalrandom variables Z, and then transform them by m + UTZ toobtain X.
39 / 70
3.6 Importance sampling
In order to estimate an integral of the form∫h(x)f(x)dx we
find that it is sometimes better to generate values not from thedistribution f(x), but instead from some other distribution g(x)and to then account for this by using a weighting. This is theidea behind importance sampling.
To introduce the idea we consider a simple example.
40 / 70
Example of Monte Carlo/Importance SamplingLet X be Cauchy f(x) =
1
π(1 + x2), −∞ < x <∞.
Let θ = P (X > 2) = I =
∫ ∞2
1
π(1 + x2)dx (= 0.1476).
Use Monte Carlo Methods to estimate θ.
(i) Generate n Cauchy variates, X1, . . . , Xn.Let Y1 be the number that are greater than 2,Y1 =
∑IXi>2. Then Y1 ∼ B(n, θ) so that
E(Y1) = nθ, V (Y1) = nθ(1− θ)
θ1 =Y1n
E(θ1) =E(Y1)
n=nθ
n= θ
and
V (θ1) =V (Y1)
n2=nθ(1− θ)
n2=θ(1− θ)
n=
0.126
n.
41 / 70
Example of Monte Carlo/Importance Sampling - II
(ii) Note that θ = 12P (|X| > 2) - we want to use this to reduce
the variance of our estimator θ.Generate n Cauchy variates.Let Y2 be the number that are greater than 2 in modulusthen Y2 ∼ B(n, 2θ)
and θ2 = 12
Y2n
=⇒ E(θ2) = 12
E(Y2)
n= 1
2 ·n2θ
n= θ
and
V (θ2) =V (Y2)
22n2=n2θ(1− 2θ)
22n2=θ(1− 2θ)
2n=
0.052
n.
42 / 70
Example of Monte Carlo/Importance Sampling - III
(iii) The relative inefficiency of these methods is due togeneration of values outside the domain of interest [2,∞).Alternatively note we can write
θ =1
2−∫ 2
0
1
π(1 + x2)dx.
This integral can be considered the expectation ofh(X) = 2
π(1+x2)where X ∼ U [0, 2] as the density of U [0, 2]
is g(x) = 1/2.An alternative method of evaluation of θ is therefore
θ3 =1
2− 1
n
n∑i=1
h(Ui)
where Ui ∼ U [0, 2].
43 / 70
Example of Monte Carlo/Importance Sampling - IVWe can see that
E(θ3) =1
2− 1
n
n∑i=1
∫ 2
0
2
π(1 + x2)dx =
1
2− P(0 < X < 2)
where X ∼ Cauchy, so that it too is an unbiased estimator.
The variance of θ3 is Var(h(U))/n and we can see that
Eh(U) =
∫ 2
0h(x)
1
2dx = 0.5− 0.1475 = 0.3525
Eh(U)2 =
∫ 2
0h(x)2
1
2dx =
∫ 2
0
2
π2(1 + x2)2dx
=1
π2
[x
x2 + 1+ tan−1(x)
]20
= 0.1527
Hence Var(h(x)) = 0.1527− 0.35252 = 0.02851 and thus
Var(θ3) =0.02851
n
44 / 70
Example of Monte Carlo/Importance Sampling - V
(iv) Finally, note that another possibility is to note that if
y =1
x
θ =
∫ ∞+2
1
π(1 + x2)dx =
∫ 12
0
y−2 dy
π(1 + y−2)=
∫ 12
0h(y) dy.
This can be seen as the expectation of h(X) = X−2
2π(1+X−2)
where X ∼ U [0, 12 ]. We can estimate this as
θ4 =1
n
n∑i=1
h(Ui)
where U1, . . . , Un ∼ U [0, 1/2].Again, we have Eθ4 = θ and now
Eh(U)2 =
∫ 1/2
0h(x)2·2dx =
1
4π2
[x
x2 + 1+ tan−1(x)
]1/20
= 0.02188
Hence Var(θ4) =0.02188− 0.14762
n=
0.0000955
n45 / 70
Summary of Example
We found 4 unbiased estimators of θ, each with a differentvariance.
Var(θ1) =0.126
nVar(θ2) =
0.052
n
Var(θ3) =0.02851
nVar(θ4) =
0.0000955
n
The best estimator is the one with the smallest variance,namely θ4.Compared with θ1, the evaluation of θ4 requires√
(0.126/0.0000955) ≈ 36 times fewer simulations to achieve thesame precision.By carefully considering our simulation method we can hope toget more accurate estimates.Estimate θ2 and θ4 are both types of importance sampling.
46 / 70
Importance Sampling
Consider calculating the integral
I = Efh(X) =
∫h(x)f(x) dx.
Importance samplingLet X1,X2, . . . ,Xn be independently and identically distributedrandom variables with common density g(x).Define w(x) = f(x)/g(x), so that
Eg{h(Xi)w(Xi)} =
∫h(x)w(x)g(x) dx =
∫h(x)f(x) dx = I.
Therefore
I =1
n
n∑i=1
w(Xi)h(Xi) (1)
is an unbiased estimator of I.
47 / 70
Some comments:
I g(x) is called the importance function, and w(Xi) arecalled the importance weights.
I The sum (1) will converge for the same reasons the MonteCarlo sum does.
I Notice that this sum is valid for any choice of thedistribution g, as long as supp(f) ⊆supp(g).
I This is a very general representation that expresses the factthat a given integral is not intrinsically associated with agiven distribution.
I Because very little restriction is put on the choice g, we canchoose a distribution which is easy to sample from, and onewhich gives nice properties for the sum.
48 / 70
Cauchy example revisitedWe can now understand the estimator θ4 in the Cauchyexample. Recall that we want to estimate
EIX>2 =
∫h(x)f(x)dx
where h(x) = Ix>2 and f(x) = 1π(1+x2)
.
Noticing that for large x, f(x) is similar to the density
g(x) = 2/x2 for x > 2.
suggests g() might be a good importance density. We cansample from g by letting Xi = 1/Ui where Ui ∼ U [0, 12 ](inversion method). Thus our estimator is
θ =1
n
n∑i=1
h(xi)f(xi)
g(xi)=
1
n
n∑i=1
x2i2π(1 + x2i )
=1
n
n∑i=1
u−2i2π(1 + u−2i )
= θ4
49 / 70
The variance of the estimator
Since the Xis are iid, Var(I) =σ2
n, where
σ2 = Varg{h(X)w(X)} = E{h(X)2w(X)2} − E{h(X)w(X)}2
=
∫h(x)2w(x)2g(x) dx− I2
=
∫h(x)2f(x)2
g(x)dx− I2 since g(x) =
f(x)
w(x).
We do not of course know σ2 in practice, but we can see that Iwill be a better estimator if we can make w(X) less variable.Our objective, therefore, is to find a distribution g(x) that weknow how to obtain independent samples from, and whichmimics h(x)f(x) as closely as possible.
50 / 70
Optimal choice of g
Theorem The choice of g = g∗ = |h(x)|f(x)∫|h(z)|f(z)dz minimises the
variance of the estimator (1).Proof We’ve seen that it is sufficient to minimise∫
h2(x)f2(x)
g(x)dx = Eg
(h2(X)f2(X)
g2(X)
)and using Jensen’s inequality we can see that
Eg(h2(X)f2(X)
g2(X)
)≥(Eg[|h(X)|f(X)
g(X)
])2
=
(∫|h(x)|f(x)dx
)2
and that this lower bound is achieved by choosing g = g∗.NB: We won’t be able to calculate g∗! But the theorem suggeststhat choosing g to look like hf will be a good choice.
51 / 70
Unnormalised densitesSuppose we only know f upto a normalising constant, i.e., weknow
f(x) =f1(x)
cwhere c =
∫f1(x)dx
We can still use importance sampling
Importance sampling with unnormalised densitesLet X1,X2, . . . ,Xn be independently and identically dis-tributed random variables with common density g(x).Define w(x) = f1(x)/g(x). Estimate I by
I =
∑ni=1 w(Xi)h(Xi)∑n
i=1 w(Xi)
Alternatively, we can write this as
I =
n∑i=1
wih(Xi) where wi =w(Xi)∑w(Xi)
52 / 70
1n
∑w(Xi) is an unbiased estimator of c as
Egw(X) =
∫f1(x)
g(x)g(x)dx =
∫f1(x)dx = c.
When we use unnormalised densities, I is a biased estimator ofI, however it is possible to prove that we still have I → Ialmost surely as n→∞.
This will be important when we use importance sampling toestimate Bayesian quantities.
53 / 70
Effective sample sizeHow variable the weights are tells us how efficient our choice of g is.
In the best case, where g = f , then w(X) = 1 so that wi = 1n , which
is the case in plain Monte Carlo. In this case Var(w(X)) = 0.
If f and g are very different, then the weights will be very variable,and we can find that one or two particles (Xi) dominate the sum.
We often calculate the effective sample size
ESS =1∑w2
i
I In the best case, wi = 1n and ESS= n - so we have an effective
sample size equal to the true sample size.
I The worst case is when one of the wi = 1 and all the others areequal to zero. Then ESS= 1, i.e., we effectively have only asingle sample.
We want to choose g so that the ESS is large.
54 / 70
3.7 Variance reduction techniquesAntithetic variables
The method of antithetic variables uses two correlatedestimators and combines them to get an estimator with a lowervariance (i.e. a better estimator).Suppose we have two different estimators θ1 and θ2 of θ,
I with the same mean and varianceI but which are negatively correlated
Define θ3 = 12(θ1 + θ2). Then
Var(θ3) =1
4(Var(θ1) + Var(θ2) + 2Cov(θ1, θ2))
=1
2(Var(θ1) + Cov(θ1, θ2))
<1
2Var(θ1)
This is twice the cost of computing θ1 but the variance is morethan halved!
55 / 70
Antithetic variables - II
We need to find two estimators which are negatively correlated.This can be done as follows:
I If U ∼ U [0, 1] then 1− U ∼ U [0, 1] also.
I If F is the distribution function of X then X1 = F−1(U)and X2 = F−1(1− U) are both distributed according to F
I and Cov(X1, X2) < 0.
Proof (non-examinable):Let h(u) = F−1(u). Then h(u) is a non-decreasing function.We need to show
Eh(U)h(1− U) ≤ (Eh(U))2
Let Q = Eh(U). The since h is non-decreasing on [0, 1]
h(0) ≤ Q ≤ h(1)
56 / 70
Let f(y) =∫ y
0h(1− x)dx−Qy on [0, 1]
Then f(0) = f(1) = 0 and
f ′(y) = h(1− y)−Q
is also a non-increasing function.Since f ′(0) = h(1)−Q ≥ 0 and f ′(1) = h(0)−Q ≤ 0 we must have
f(u) ≥ 0 on [0, 1]
Therefore
0 ≤∫ 1
0
f(y)h′(y)dy = [fh]10 −∫f ′h(y)dy
= −∫ 1
0
f ′(y)h(y)dy
Therefore∫ 1
0
f ′(y)h(y) =
∫ 1
0
h(y)(h(1−y)−Q)dy =
∫ 1
0
h(y)h(1−y)dy−Q2 ≤ 0
Hence∫ 1
0h(y)h(1− y)dy ≤ Q2 as required.
57 / 70
Cauchy Example RevisitedAbove we used
θ3 =1
2− 2
n
n∑i=1
[1
π(1 + u2i )
]as an estimator of P(X > 2) where X ∼ Cauchy.An estimator with a smaller variance can be found usingantithetic variables
1
2
(1
2− 2
n
n∑i=1
[1
π(1 + u2i )
]+
1
2− 2
n
n∑i=1
[1
π(1 + (2− ui)2)
])which gives
θantithetic =1
2− 1
n
n∑i=1
[1
π(1 + u2i )+
1
π(1 + (2− ui)2)
]The for n = 10 we find the variance of θ3 is 2.7× 10−4 whereasthe variance of θantithetic is 5.5× 10−6 - a substantialimprovement.
58 / 70
3.8 Bayesian inferenceUnnormalised densities frequently occur when we are doingBayesian inference.
Suppose we are interested in some posterior expectation, forexample, the posterior mean:
I = E(θ|x) =
∫θf(θ|x)dθ
where
f(θ|x) =f(θ)f(x|θ)
f(x)by Bayes theorem.
The denominator f(x) =∫f(θ)f(x|θ)dx is often intractable
and unknown, and so we instead work with the unnormaliseddensity
f1(θ|x) = f(θ)f(x|θ) = prior × likelihood
59 / 70
Rejection sampling for Bayesian inference
You may have seen in MAS364 (or Autumn of MAS6004) howto sample from a posterior distribution using MCMC. We canalso use rejection sampling, or estimate posterior expectationsusing importance sampling.
So to sample posterior samples of θ from f(θ|x), using proposaldensity g (assuming f1(θ|x)/g(θ) ≤M for all θ), we can do
1. Simulate θ ∼ g(·)2. Accept θ with probability
f(θ)f(x|θ)Mg(θ)
otherwise reject θ.
If we use g(θ) = f(θ), ie, use the prior as the proposal, then this
reduces to accept θ with probability f(x|θ)M , but this is usually
inefficient (ie, M is large, so the acceptance rate 1/M is small).
60 / 70
Importance sampling for Bayesian inferenceSuppose we wish to estimate the posterior expectation
E(r(θ)|x) =
∫r(θ)f(θ|x)dθ
We could use importance sampling, using the prior distributionas the importance distribution, ie, g = f .If we do not know f(x) then we can use the followingimportance sampling approach:
I Simulate θθθ1, . . . , θθθn from the prior f(θθθ)I Set wi = f(x|θθθ)I Set wi = wi/
∑wi and estimate E(r(θθθ)|x) by
n∑i=1
wir(θθθi)
This is inefficient if the prior is very different to the posterior aswe will spend too much time sampling θθθi where the likelihood isvery small, and so the weights w(θθθi) will also be very small.
If this is the case, then the effective sample size will be small.and our estimates of E(θθθ|x) will be dominated by just a few ofthe θθθ samples.
61 / 70
Choice of g and the normal approximation
A more efficient alternative to using the prior distribution for g,is to build a normal approximation to the posterior and use thisas g
Let h(θθθ) = log f(θθθ|x). Now define m to be posterior mode of θθθ,so m maximises both f(θθθ|x) and h(θθθ).
We may need to use numerical optimisation to find m, e.g.using the optim command in R.We can then use a Taylor expansion of h(θθθ) around m
h(θθθ) = h(m) + (θθθ −m)Th′(m) +1
2(θθθ −m)TM(θθθ −m) + . . .
to build a Gaussian approximation to the posterior (known asthe Laplace approximation).Here, h′(m) the vector of first derivatives of h(θθθ), and M thematrix of second derivatives of h(θθθ), both evaluated at θθθ = m.
62 / 70
Since m maxmises h(m) we have h′(m) = 0. Hence
f(θθθ|x) = exp{h(θθθ)} ' exp{h(m)} exp
{−1
2(θθθ −m)TV −1(θθθ −m)
},
(2)where −V −1 = M .
Thus, our approximation of f(θθθ|x) is a multivariate normaldistribution, mean vector m, variance matrix −M−1. This willbe a good approximation if posterior mass is concentratedaround m.
NB: We do not need f(x) to obtain M , since
h(θθθ) = log f(θθθ|x) = log f(θθθ) + log f(x|θθθ)− log f(x),
so log f(x) will disappear when we differentiate h(θθθ).
63 / 70
Assessing convergence
Suppose we wish to estimate E{r(θθθ)|x} for some r(θθθ). If f(x)known, then
E{r(θθθ)|x} =1
n
n∑i=1
r(θθθi)w(θθθi),
and can use central limit theorem to obtain a confidenceinterval for E{r(θθθ)|x}, as in MC integration.We can check our estimate by1) Increasing the sample size n to check the stability of anyestimate.2) Increasing the standard deviation in the g(θθθ) density, tocheck stability to the choice of g, e.g., if we’re using a normalapproximation, we could multiply V by 4 etc.
64 / 70
Example: leukaemia dataPatients suffering from leukaemia are given a drug,6-mercaptopurine (6-MP), and the number of days xi untilfreedom from symptoms is recorded of patient i:
6∗, 6, 6, 6, 7, 9∗, 10∗, 10, 11∗, 13, 16, 17∗,
19∗, 20∗, 22, 23, 25∗, 32∗, 32∗, 34∗, 35∗.
A * denotes censored observation.Will suppose that time x to the event of interest follows aWeibull distribution:
f(x|α, β) = αβ(βx)α−1 exp{−(βx)α}
for x > 0.For censored observations, we have
P (x > t|α, β) = exp{−(βt)α}.
65 / 70
Example: leukaemia dataLikelihood
Define
I d: number of uncensored observations,
I∑
u log xi: sum of logs of all uncensored observations.
Writing θθθ = (α, β)T , the log likelihood is then given by
log f(x|θθθ) = d logα+ αd log β + (α− 1)∑u
log xi − βαn∑i=1
xαi .
Suppose our prior distributions for α and β are bothexponential with
f(α) = 0.001 exp(−0.001α),
f(β) = 0.001 exp(−0.001β).
66 / 70
Example: leukaemia dataBuilding an approximation to the posterior
1) Obtain the posterior mode of θθθ. Maximise log posteior,i.e.
h(θθθ) = d logα+αd log β+(α−1)∑u
log xi−βαn∑i=1
xαi −0.001α−0.001β+K,
for some constant K.
In R, we can find the mode to be m = (1.354, 0.030) using theoptim command.
67 / 70
2) Derive the matrix of second derivatives of h(θθθ).
M =
(∂2
∂α2h(θθθ) ∂2
∂α∂βh(θθθ)∂2
∂α∂βh(θθθ) ∂2
∂β2h(θθθ)
),
evaluated at θθθ = m.
∂2
∂α2h(θθθ) = − d
α2−∑
(βxi)α(log(βxi))
2
∂2
∂β2h(θθθ) =
1
β2
{βαα(1− α)
n∑i=1
xαi − dα
},
∂2
∂α∂βh(θθθ) =
1
β
[d− βα
{α log β
n∑i=1
xαi +
n∑i=1
xαi + α
n∑i=1
xαi log xi
}],
M =
(−31.618 175.442175.442 −18806.085
).
68 / 70
3) Obtain the normal approximation to use as g(θθθ).g(θθθ): bivariate normal, mean m, variance matrix V = −M−1:
θθθ ∼ N{(
1.3540.030
),
(0.0334 0.00030.0003 0.00006
)}4) Sample θθθ1, . . . , θθθn from g(θθθ) and compute theimportance weights w(θθθ1), . . . , w(θθθn). The weights are givenby
w(θθθi) =w(θθθi)∑ni=1 w(θθθi)
, with w(θθθi) =f(θθθi)f(x|θθθi)
g(θθθi)
NB the Gaussian approximation may give us negative samples.Since α > 0 and β > 0, we shoould simply discard negative θθθvalues, i.e., use a truncated normal density for g(θ).
Note that when we compute w(θθθi), it is not necessary to rescaleg(θ) so that it integrates to 1, as any normalising constant ing(θ) will cancel.
69 / 70
5) Estimate the posterior mean of θθθWe compute the estimate
E(θθθ|x) =
n∑i=1
θθθiw(θθθi).
In R, with n = 100000, this gives E(θθθ|x) = (1.346, 0.031)T .6) Check for convergenceWe repeat steps 4 and 5 with more dispersion in g(θθθ):
g(θθθ) E(θθθ|x)
N(m, V ) (1.346, 0.031)T
N(m, 4V ) (1.384, 0.031)T
N(m, 16V ) (1.380, 0.031)T
Finally, double the sample size (no effect observed).For percentiles, we can do resampling in R.
See computer class 5 for more details and code to implementthis approach.
70 / 70