variational inference

Variational Inference

Note: Much (meaning almost all) of this has been liberated from John Winn and Matthew Beal’s theses, and David McKay’s book.

Overview

• Probabilistic models & Bayesian inference

• Variational Inference

• Univariate Gaussian Example

• GMM Example

• Variational Message Passing

Bayesian networks

• Directed graph• Nodes represent

variables• Links show dependencies• Conditional distribution at

each node• Defines a joint

distribution:

.P(C,L,S,I)=P(L) P(C) P(S|C) P(I|L,S)

Lighting color

Surface color

Image color

Object class

P(S|C)

P(I|L,S)

Lighting color

Hidden

Bayesian inference

Observed

• Observed variables D and hidden variables H.

• Hidden variables includeparameters and latent variables.

• Learning/inference involves finding:

• P(H1, H2…| D), or• P(H,|D,M) - explicitly for

generative model.

Surface color

Image color

Object class

Bayesian inference vs. ML/MAP• Consider learning one parameter θ

• How should we represent this posterior distribution?

)()|( PDP

Bayesian inference vs. ML/MAP

Maximum of P(V| θ) P(θ)

• Consider learning one parameter θ

P(D| θ) P(θ)

High probability massHigh probability density

θSamples

P(D| θ) P(θ)

θVariational

approximation

P(D| θ) P(θ)

Variational Inference

1. Choose a family of variational distributions Q(H).

2. Use Kullback-Leibler divergence KL(Q||P) as a measure of ‘distance’ between P(H|D) and Q(H).

3. Find Q which minimizes divergence.

(in three easy steps…)

Choose Variational Distribution

• P(H|D) ~ Q(H).• If P is so complex how do we choose Q?• Any Q is better than an ML or MAP point

estimate.• Choose Q so it can “get” close to P and is

tractable – factorize, conjugate.

Kullback-Leibler Divergence

• Derived from Variational Free Energy by Feynman and Bobliubov

• Relative Entropy between two probability distributions• KL(Q||P) > 0 , for any Q (Jensen’s inequality)• KL(Q||P) = 0 iff P = Q.

• Not true distance measure, not symmetric

xQxQPQKL)()(ln)()||(

Minimising KL(Q||P)

Q Exclusive

HQHQ)|(

)(ln)(

Minimising KL(P||Q) P

DHPDHP)(

)|(ln)|(

Inclusive

HQHQPQKL)|(

)(ln)()||(

DPHQDHP

HQHQPQKL )(ln)(),(

)(ln)()||(

DPHQHQPQKL),(

)()(ln)()||(

DHPDHPQPK)(

)|(ln)|()||(

HQHQPQKL )(ln),(

)(ln)()||(

HQHQPQKL)|(

)(ln)()||(

Bayes Rules

Log property

Sum over H

HQHQDHPHQQL )(ln)(),(ln)()( DEFINE

• L is the difference between: expectation of the marginal likelihood with respect to Q, and the entropy of Q

• Maximize L(Q) is equivalent to minimizing KL Divergence

• We could not do the same trick for KL(P||Q), thus we will approximate likelihood with a function that has it’s mass where the likelihood is most probable (exclusive).

HQHQPQKL )(ln),(

)(ln)()||(

)()(ln)||( QLDPPQKL

Summarize

• For arbitrary Q(H)

• We choose a family of Q distributions where L(Q) is tractable to compute.

maximisefixed minimise

Still difficult in general to calculate

Minimising the KL divergence

KL(Q || P)

ln P(D)maximise

KL(Q || P)

ln P(D)maximise

KL(Q || P)

ln P(D)

maximise

KL(Q || P)

ln P(D)

maximise

KL(Q || P)

ln P(D)

maximise

Factorised Approximation

• Assume Q factorises

• Optimal solution for one factor given by

• Given the form of Q, find the best H in KL sense• Choose conjugate priors P(H) to give from of Q• Do it iteratively of each Qi(Hi)

iiijji

DHPHQZ

HQ )),(ln)(exp(1)(*

Derivation

DHPHQZ

HQ )),(ln)(exp(1)(*

HQHQDHPHQQL )(ln)(),(ln)()(

ii HQHQDHPHQ )(ln)(),(ln)(

H H i j

ii HQHQDHPHQ )(ln)(),(ln)(

HQHQDHPHQ )(ln)(),(ln)(

H ji H

jjjjji

iijjij

HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(

Log property

Substitution

Factor one term Qj

Not a Function of Qj

Idea: Use factoring of Q to isolate Qj and maximize L wrt Qj

ZQQKL jj log)||( *

Example: Univariate Gaussian• Normal distribution• Find P(| x)• Conjugate prior • Factorized variational

distribution• Q distribution same form as

prior distributions• Inference involves updating

these hidden parameters

Example: Univariate Gaussian• Use Q* to derive:

• Where <> is the expectation over Q function• Iteratively solve

Example: Univariate Gaussian

• Estimate of log evidence can be found by calculating L(Q):

• Where <.> are expectations wrt to Q(.)

Example

Take four data samples form Gaussian (Thick Line) to find posterior. Dashed lines distribution from sampled variational.

Variational and True posterior from Gaussian given four samples. P() = N(0,1000). P() = Gamma(.001,.001).

VB with Image Segmentation

20 40 60 80 100 120 140 160 180

0 100 200 3000

RGB histogram of two pixel locations.

“VB at the pixel level will give better results.”

Feature vector (x,y,Vx,Vy,r,g,b) - will have issues with data association.

VB with GMM will be complex – doing this in real time will be execrable.

Lower Bound for GMM-Ugly

Variational Equations for GMM-Ugly

Brings Up VMP – Efficient Computation

Lighting color

Surface color

Image color

Object class

P(S|C)

P(I|L,S)

variational inference

Technology

variational inference -...

functional variational bayesian neural...

variational inference: a review for statisticians ·...

variational inference

variational inference for computational imaging inverse...

lecture 5: variational estimation and inference

wasserstein variational inference › paper ›...

variational inference over combinatorial spaces

stochastic variational inference - columbia...

variational inference for crowdsourcing - uci

variational methods for lda stochastic variational...

stochastic variational inference -...

morphogenesis as bayesian inference: a variational

online variational inference for the hierarchical dirichlet...

deep variational inference - university of texas at...

variational algorithms for approximate bayesian inference

variational learning and variational inference · the...

variational inference for computational imaging inverse

variational inference for dirichlet process mixture

adavi: automatic dual amortized variational inference