approximate inference: variational inferencevariational inference cmsc 678 umbc outline recap of...

Approximate Inference:Variational Inference

CMSC 678UMBC

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Recap from last time…

Graphical Models

𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, … , 𝑥𝑥𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑥𝑥𝑖𝑖 𝜋𝜋(𝑥𝑥𝑖𝑖))

Directed Models (Bayesian networks)

Undirected Models (Markov random fields)

𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, … , 𝑥𝑥𝑁𝑁 =1𝑍𝑍�𝐶𝐶

𝜓𝜓𝐶𝐶 𝑥𝑥𝑐𝑐

Markov Blanket

Markov blanket of a node x is its parents, children, and

children's parents

𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗≠𝑖𝑖 =𝑝𝑝(𝑥𝑥1, … , 𝑥𝑥𝑁𝑁)

∫ 𝑝𝑝 𝑥𝑥1, … , 𝑥𝑥𝑁𝑁 𝑑𝑑𝑥𝑥𝑖𝑖

=∏𝑘𝑘 𝑝𝑝(𝑥𝑥𝑘𝑘|𝜋𝜋 𝑥𝑥𝑘𝑘 )

∫ ∏𝑘𝑘 𝑝𝑝 𝑥𝑥𝑘𝑘 𝜋𝜋 𝑥𝑥𝑘𝑘 )𝑑𝑑𝑥𝑥𝑖𝑖factor out terms not dependent on xi

factorization of graph

=∏𝑘𝑘:𝑘𝑘=𝑖𝑖 or 𝑖𝑖∈𝜋𝜋 𝑥𝑥𝑘𝑘 𝑝𝑝(𝑥𝑥𝑘𝑘|𝜋𝜋 𝑥𝑥𝑘𝑘 )

∫ ∏𝑘𝑘:𝑘𝑘=𝑖𝑖 or 𝑖𝑖∈𝜋𝜋 𝑥𝑥𝑘𝑘 𝑝𝑝 𝑥𝑥𝑘𝑘 𝜋𝜋 𝑥𝑥𝑘𝑘 )𝑑𝑑𝑥𝑥𝑖𝑖

the set of nodes needed to form the complete conditional for a variable xi

Markov Random Fields withFactor Graph Notation

x: original pixel/state

y: observed (noisy)

pixel/state

factor nodes are added

according to maximal cliques

unaryfactor

variable

factor graphs are bipartite

binaryfactor

Two Problems for Undirected Models

Finding the normalizer

𝑍𝑍 = �𝑥𝑥

�𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥𝑐𝑐)

Computing the marginals

𝑍𝑍𝑛𝑛(𝑣𝑣) = �𝑥𝑥:𝑥𝑥𝑛𝑛=𝑣𝑣

�𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥𝑐𝑐)Q: Why are these difficult?

A: Many different combinations

Sum over all variable combinations, with the xn

coordinate fixed

𝑍𝑍2(𝑣𝑣) = �𝑥𝑥1

�𝑥𝑥3

�𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥 = 𝑥𝑥1, 𝑣𝑣, 𝑥𝑥3 )

Example: 3 variables, fix the

2nd dimensionBelief propagation algorithms

• sum-product (forward-backward in HMMs)

• max-product/max-sum (Viterbi)

Sum-ProductFrom variables to factors

𝑞𝑞𝑛𝑛→𝑚𝑚 𝑥𝑥𝑛𝑛 = �𝑚𝑚′∈𝑀𝑀(𝑛𝑛)\𝑚𝑚

𝑟𝑟𝑚𝑚′→𝑛𝑛 𝑥𝑥𝑛𝑛

From factors to variables

𝑟𝑟𝑚𝑚→𝑛𝑛 𝑥𝑥𝑛𝑛= �

𝒘𝒘𝑚𝑚\𝑛𝑛

𝑓𝑓𝑚𝑚 𝒘𝒘𝑚𝑚 �𝑛𝑛′∈𝑁𝑁(𝑚𝑚)\𝑛𝑛

𝑞𝑞𝑛𝑛′→𝑚𝑚(𝑥𝑥𝑛𝑛𝑛)

set of variables that the mth factor depends on

set of factors in which variable n participates

sum over configuration of variables for the mth factor,

with variable n fixed

default value of 1 if empty product

Outline

Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

we’re going to be Bayesian (perform Bayesian inference)

Posterior Classification vs.Posterior Inference

“Frequentist” methods

prior over labels (maybe), not weights

Bayesian methods

Θ includes weight parameters

pα( Θ | )pα,w ( y| )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

next class

what we’ve already covered

Outline

Exponential Family Form

Support function• Formally necessary, in practice

irrelevant

Distribution Parameters• Natural parameters• Feature weights

Feature function(s)• Sufficient statistics

Log-normalizer

Why? Capture Common Distributions

Discrete (Finite distributions)

• Gaussian

https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png

Dirichlet (Distributions over (finite) distributions)

Discrete (Finite distributions)

Dirichlet (Distributions over (finite) distributions)

Gaussian

Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,…

Why? “Easy” Gradients

Observed feature countsCount w.r.t. empirical distribution

Expected feature countsCount w.r.t. current model parameters

(we’ve already seen this with maxent models)

Why? “Easy” Expectations

expectation of the sufficient

statistics

gradient of the log normalizer

Why? “Easy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

All exponential family models have a conjugate prior (in theory)

Posterior Likelihood Prior

Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta)

Normal Normal (fixed var.) Normal

Gamma Exponential Gamma

Outline

Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

next class

what we’ve already covered

Variational Inference

Difficult to compute

Minimize the “difference”

by changing λ

Easy(ier) to compute

q(θ): controlled by parameters λ

Easy(ier) to compute

Minimize the “difference”

by changing λ

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λt

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Set t = 0Pick a starting value λt

Variational Inference:The Function to Optimize

Posterior of desired model

Any easy-to-compute distribution

Posterior of desired model

Any easy-to-compute distribution

Find the best distribution (calculus of variations)

Find the best distribution

Parameters for desired model

Variational parameters for θ

Variationalparameters for θ

KL-Divergence (expectation)

DKL 𝑞𝑞 𝜃𝜃 || 𝑝𝑝(𝜃𝜃|𝑥𝑥) =

𝔼𝔼𝑞𝑞 𝜃𝜃 log𝑞𝑞 𝜃𝜃𝑝𝑝(𝜃𝜃|𝑥𝑥)

Variational parameters for θ

Exponential Family Recap: “Easy” Expectations

Exponential Family Recap: “Easy” Posterior Inference

p is the conjugate prior for π

When p and q are the same exponential family form, the variational update q(θ) is (often) computable (in closed form)

Set t = 0Pick a starting value λtLetF(q(•;λt)) = KL[q(•;λt) || p(•)]

Variational Inference:Maximization or Minimization?

Evidence Lower Bound (ELBO)

log𝑝𝑝 𝑥𝑥 = log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃 𝑑𝑑𝜃𝜃

= log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃𝑞𝑞(𝜃𝜃)

𝑑𝑑𝜃𝜃

= log𝔼𝔼𝑞𝑞 𝜃𝜃𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃

𝑑𝑑𝜃𝜃

= log𝔼𝔼𝑞𝑞 𝜃𝜃𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃

≥ 𝔼𝔼𝑞𝑞 𝜃𝜃 𝑝𝑝 𝑥𝑥,𝜃𝜃 − 𝔼𝔼𝑞𝑞 𝜃𝜃 𝑞𝑞 𝜃𝜃= ℒ(𝑞𝑞)

Outline

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . …

p( ) Three: 1,people: 2,attack: 2,

…p( )=Unigram counts

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . …

p( ) Three: 1,people: 2,attack: 2,

…pφ,ω( )=

Unigram counts

Global (corpus-level) parameters interact with local (document-level) parameters

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (unigram) word counts

Count of word j in document i

Per-document (latent) topic usage

Per-topic word usage

K topics

Per-document (latent) topic usage

~ Multinomial ~ Dirichlet ~ Dirichlet

(regularize/place priors)

K topics

Per-document

(latent) topic usage

Per-document

Variational Inference: LDirA

Topic usage

Topic words

p: True model

𝜙𝜙𝑘𝑘 ∼ Dirichlet(𝜷𝜷)𝑤𝑤(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜙𝜙𝑧𝑧 𝑑𝑑,𝑛𝑛 )

𝜃𝜃(𝑑𝑑) ∼ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜃𝜃(𝑑𝑑))

Topic usage

Topic words

p: True model q: Mean-field approximation

𝜙𝜙𝑘𝑘 ∼ Dirichlet(𝜷𝜷)𝑤𝑤(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜙𝜙𝑧𝑧 𝑑𝑑,𝑛𝑛 )

𝜙𝜙𝑘𝑘 ∼ Dirichlet(𝝀𝝀𝒌𝒌)

𝜃𝜃(𝑑𝑑) ∼ Dirichlet(𝜸𝜸𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛼𝛼

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛼𝛼 =

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

exponential family form of Dirichlet

𝑝𝑝 𝜃𝜃 =Γ(∑𝑘𝑘 𝛼𝛼𝑘𝑘)∏𝑘𝑘 Γ 𝛼𝛼𝑘𝑘

�𝑘𝑘

𝜃𝜃𝑘𝑘𝛼𝛼𝑘𝑘−1

params = 𝛼𝛼𝑘𝑘 − 1 𝑘𝑘suff. stats.= log𝜃𝜃𝑘𝑘 𝑘𝑘

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

expectation of sufficient statistics of q distribution

params = 𝛾𝛾𝑘𝑘 − 1 𝑘𝑘

suff. stats. = log𝜃𝜃𝑘𝑘 𝑘𝑘

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛼𝛼 − 1 𝑇𝑇𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝐶𝐶

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛼𝛼 = 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝐶𝐶

ℒ �𝛾𝛾𝑑𝑑

= 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝑀𝑀 𝛾𝛾𝑑𝑑there’s more math

to do!

ℒ �𝛾𝛾𝑑𝑑

= 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝑀𝑀 𝛾𝛾𝑑𝑑

𝛻𝛻𝛾𝛾𝑑𝑑ℒ �𝛾𝛾𝑑𝑑= 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑

2 𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝛻𝛻𝛾𝛾𝑑𝑑𝑀𝑀 𝛾𝛾𝑑𝑑

approximate inference: variational inferencevariational inference cmsc 678 umbc outline recap of...

Documents

cmsc 341 binary search trees. 8/3/2007 umbc cmsc 341 bst 2...

cmsc 611: advanced computer...

principles of programming languages umbc cmsc 331 03, fall...

1 lecture 1.2: systems engineering and architecting...

geospatial semantic web harry chen...

cmsc 435 introductory computer graphics display...

cmsc 611: advanced computer architecture performance some...

umbc emergency response plan - umbc police department

some material adapted from mohamed younis, umbc cmsc 611...

me, vangogh, and game development. me started at umbc in...

cmsc 426 principles of computer security · all materials...

cmsc 611: advanced computer...

cmsc 611: advanced computer architecture pipelining some...

cmsc 611: advanced computer architecture instruction set...

umbc introduction to compilers cmsc 431 shon vick 01/28/02

cmsc 341 graphs. 8/3/2007 umbc cmsc 341 graphs 2 basic graph...

cmsc 341 disjoint sets. 8/3/2007 umbc cmsc 341 disjointsets...

1 lecture 2.1a: dod acquisition model dr. john maccarthy...

harry chen harryc@imagemattersllc.com image matters llc ·...

1 umbc cmsc 104, section 0801 -- fall 2002 functions, part 1...