probabilities, bayes rule, markov chain monte carlo
TRANSCRIPT
High Performance Computing for Science and Engineering II Pantelis Vlachas
Computational Science and Engineering Lab ETH Zürich
Probabilities, Bayes Rule, Markov Chain Monte Carlo
Structure
Bayes Rule
Markov Chain Monte Carlo
Computing the posterior (coin toss)
Conjugate Priors
Structure
Bayes Rule
Markov Chain Monte Carlo
Computing the posterior (coin toss)
Conjugate Priors
Bayes Rule
Bayes Rule
P(A |B) =P(B |A)P(A)
P(B)
• We assume a model (usually omitted/self-explained and absorbed by )
• We look for a parametrisation of that “explains” the data
• is some observed data • is the likelihood of observing the data
given that we have a model of the reality • / is the prior • is the data evidence
M(θ)θ
θ M
DP(D |θ)
p(θ) p(θ |M)p(D)/p(D |M)
ORp(D |M) = ∫θ′
p(D |θ′ , M)p(θ′ |M)dθ′
p(D) = ∫θ′
p(D |θ′ )p(θ′ )dθ′
DATA EVIDENCE (does not depend on )θ
p(θ |D) =p(D |θ) p(θ)
p(D)
MODEL ABSORBED in θ
p(θ |D, M) =p(D |θ, M) p(θ |M)
p(D |M)
GENERAL FORM
Bayes Rule
=p(D |θ, M)
LIKELIHOOD ℒ
p(θ |M)
PRIOR π(θ)
p(θ |D, M)
POSTERIOR
p(D |M)DATA EVIDENCE
(does not depend on )θ
Bayes Rule
Speagle, J. S. (2021). “A conceptual introduction to markov chain monte carlo methods”, arXiv Preprint arXiv:1909.12313.
THE POSTERIOR IS A COMPORMISE OVER
PRIOR AND THE DATA (LIKELIHOOD)
p(θ) = π(θ) p(D |θ) = ℒ(D |θ)
p(θ |D)
Update Belief Based on Data
p(θ |x1) =p(x1 |θ) p0(θ)
p(x1)
Computation of the posteriorp0(θ)
Prior
x1
Data
Experiment 1
Computation of the posterior
p1(θ) =̂ p(θ |x1)Todays’ posterior
is the prior of tomorrow
p1(θ)
Prior
x2
Data
Experiment 2
p(θ |x2) =p(x2 |θ) p1(θ)
p(x2)
Accurate estimate of p(θ |x1, …, xN)
Structure
Bayes Rule
Markov Chain Monte Carlo
Computing the posterior (coin toss)
Conjugate Priors
Conjugate Priors
p(θ |D) =p(D |θ) p(θ)
p(D)
=p(D |θ) p(θ)
∫θ′
p(D |θ′ ) p(θ′ ) dθ′
=p(D |θ) p(θ)
∫θ′
p(D, θ′ ) dθ′
• Given some prior knowledge of the “data generating process” (model , etc.) the form of the likelihood is fixed and well-defined
• The choice (form) of the prior affects both the nominator and the denominator and determines the form of the posterior
• In applications, we need either to (1) have an analytic form of
the posterior (resolve ), or (2) be able to sample from it. • For certain choices of the prior , the posterior has the same
form (belongs to the same family) ! (i.e. different parameters) A. Then is conjugate to the likelihood B. The normal distribution is conjugate prior to a normal
likelihood ! C. Conjugate priors make bayesian update rule easy, else
numerical integration is needed
M p(D |θ)
p(θ′ ′ )
p(θ |x)
Zp(θ)
p(θ) p(D |θ)
=1Z
p(D |θ) p(θ)
Conjugate Priors
Structure
Bayes Rule
Markov Chain Monte Carlo
Computing the posterior (coin toss)
Conjugate Priors
Coin Toss Experiment• You are given a coin which probably is counterfeit and you perform
experiments by flipping it • Repeated runs (sampling) from a Bernoulli distribution • Suppose the probability of a head toss is (unknown) • If you knew , what is the probability of tosses in trials
(Likelihood) ? • head tosses of probability • tail tosses with probability
• number of permutations of the total tosses that have
head tosses
P(H) = θθ NH N
NH θN − NH (1 − θ)
( NNH) N NH
Bernulli Distribution
Bern(θ)
p(NH) = ( NNH) θNH (1 − θ)N−NH Binomial Distribution
H T
0.30.7
LIKELIHOOD
How to select a prior ?• In this case: Conjugate prior to the Binomial likelihood ? • Random variable on which the prior is defined: parametrization of model
• Support ? • Initially we might assume that we do not know anything about the coin
(Uninformative prior) • Uninformative prior - Uniform • Special case of the Beta distribution
P(H) = θθ ∈ [0,1]
U[0,1]
An Informative Prior
• Suppose that we do have information about the coin, we know that most probably it is a fair coin (why shouldn’t it ?)
• We want to incorporate this information into the prior belief
• Selection of a Prior belief of around
• The beta distribution is flexible enough to allow this !
• The magnitude of shape parameters controls our confidence
p(θ)θ = P(H) = 0.5
α = β
Conjugate to Binomial Likelihood
• Beta function is defined as
• How to choose prior ? Support: • Prior for selected as the Beta distribution
•
• Beta distribution : A distribution of a parametrisation of another distribution !How likely a random variable (probability) can take a value . Parametrized by (shape parameters).
• Assume that you conduct the experiment and get and
B(x, y) = ∫1
0tx−1 (1 − t)y−1 dt
P(H) = θ ∈ [0,1]q =̂ P(H)
p(θ) =̂ Beta(θ; α, β) =̂θα−1 (1 − θ)β−1
B(α, β)Beta(α, β)
[0,1] α, βNH NT = N − NH
p(N, NH⏟data
| θ = x⏟model
) = ( NNH) xNH (1 − x)N−NH
Likelihood
Binomial Distribution
BETAp(θ = x) =̂
xα−1 (1 − x)β−1
B(α, β)
prior
Conjugate to Binomial Likelihood
=p(N, NH |θ = x) p(θ = x)
∫ 1y=0
p(N, NH |θ = y) p(θ = y)dy
=( N
NH) xNH (1 − x)N−NHxα−1 (1 − x)β−1
B(α, β)
∫ 1y=0 ( N
NH) yNH (1 − y)N−NHyα−1 (1 − y)β−1
B(α, β) dy
=xNH+α−1 (1 − x)N−NH+β−1
∫ 1y=0
yNH+α−1 (1 − y)N−NH+β−1dy
=xNH+α−1 (1 − x)N−NH+β−1
B(α + NH, β + N − NH)
= Beta(α + NH, β + N − NH)
p(N, NH⏟data
| θ = x⏟model
) = ( NNH) xNH (1 − x)N−NH
LIKELIHOODp(θ = x) =̂
xα−1 (1 − x)β−1
B(α, β)
prior
PRIOR
B(x, y) = ∫1
0tx−1 (1 − t)y−1 dt
Beta function
p( θ = x⏟model
| N, NH⏟data
)
posterior
=p(N, NH |θ = x) p(θ = x)
p(N, NH)POSTERIOR:
Conjugate to Binomial Likelihood - BETA
Posterior Prior x Likelihood∝
x p(θ |N, NH) ∝ p(θ) p(NH, N |θ)
x p(θ |N, NH) ∝ Beta(α, β) Binomial(N, NH)
p(θ |N, NH) = Beta(α + NH, β + N − NH)
The BETA distribution is a conjugate distribution
to the Binomial Likelihood !
Coding …
Structure
Bayes Rule
Markov Chain Monte Carlo
Computing the posterior (coin toss)
Conjugate Priors
Markov Chain Monte Carlo
=p(D |θ, M)
LIKELIHOOD ℒ
p(θ |M)
PRIOR π(θ)
p(θ |D, M)
POSTERIOR
p(D |M)DATA EVIDENCE
(does not depend on )θIn practice: • Conjugate priors only for simple/academic examples • In MCMC we estimate/sample from the posterior without the normalization factor • Very important factor: SELECTION OF PRIOR (prior knowledge, selection of the distr., range, many issues, “informative
priors”) • Numerical estimation of model parameters and their uncertainty • Calculate high dimensional integrals in complex surfaces • e.g. particle moving on a potential , probability of location is , normalisation constant difficult to
evaluate, goal is to calculate by integration physical quantities, mean position, etc. How ? Use the
simulated values (markov chain) for posterior analysis.
Z = p(D |M)
V(x) p(x) ∝ exp(−V(x))
∫xf(x) p(x) dx
Markov Change Monte Carlo
θ1
θ2
GIVEN • MODEL • LIKELIHOOD • PRIOR • We can evaluate
(Bayes rule)
M(θ)L(D |θ)
p(θ) = π(θ)∝ p(θ |D)
∝ ℒ(D |θ, M) π(θ |M)p(θ |D, M)
High prop. region
True answer somewhere here • high likelihood • low posterior p(θ |D)θ⋆+1, p(θ⋆+1 |D) = P⋆
METROPOLIS SAMPLING • Sample from a proposal
distribution • If accept the jump,
• If accept with
probability
θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk
θk+1 = θ⋆
P⋆ ≤ Pk
P⋆/Pk
Evidence does not matter !
Initial guess from • • low likelihood • low posterior
πθ0 = (θ0
1 , θ02)
P0 = p(θ |D)
θ0, P0
Markov Change Monte Carlo
θ1
θ2
GIVEN • MODEL • LIKELIHOOD • PRIOR • We can evaluate
(Bayes rule)
M(θ)L(D |θ)
p(θ) = π(θ)∝ p(θ |D)
High prop. region
True answer somewhere here • high likelihood • low posterior p(θ |D)
METROPOLIS SAMPLING • Sample from a proposal
distribution • If accept the jump,
• If accept with
probability
θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk
θk+1 = θ⋆
P⋆ ≤ Pk
P⋆/Pk
Evidence does not matter !
Initial guess from • • low likelihood • low posterior
πθ0 = (θ0
1 , θ02)
P0 = p(θ |D)
θ0, P0
1. The endless jumps form a chain
∝ ℒ(D |θ, M) π(θ |M)p(θ |D, M)
Markov Change Monte Carlo
θ1
θ2
GIVEN • MODEL • LIKELIHOOD • PRIOR • We can evaluate
(Bayes rule)
M(θ)L(D |θ)
p(θ) = π(θ)∝ p(θ |D)
High prop. region
True answer somewhere here • high likelihood • low posterior p(θ |D)
METROPOLIS SAMPLING • Sample from a proposal
distribution • If accept the jump,
• If accept with
probability
θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk
θk+1 = θ⋆
P⋆ ≤ Pk
P⋆/Pk
Evidence does not matter !
Initial guess from • • low likelihood • low posterior
πθ0 = (θ0
1 , θ02)
P0 = p(θ |D)
θ0, P0
1. The endless jumps form a chain
2. Initial burn-in steps should be
removed
∝ ℒ(D |θ, M) π(θ |M)p(θ |D, M)
Markov Change Monte Carlo
θ1
θ2
GIVEN • MODEL • LIKELIHOOD • PRIOR • We can evaluate
(Bayes rule)
M(θ)L(D |θ)
p(θ) = π(θ)∝ p(θ |D)
High prop. region
True answer somewhere here • high likelihood • low posterior p(θ |D)
METROPOLIS SAMPLING • Sample from a proposal
distribution • If accept the jump,
• If accept with
probability
θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk
θk+1 = θ⋆
P⋆ ≤ Pk
P⋆/Pk
Evidence does not matter !
θ0, P0
1. The endless jumps form a chain
2. Initial burn-in steps should be
removed
• Elaborate explanation • Form of proposal distribution ? • More sophisticated algorithms ? • Convergence ?
LECTURE
Initial guess from • • low likelihood • low posterior
πθ0 = (θ0
1 , θ02)
P0 = p(θ |D)