approximate inference: variational inferencevariational inference cmsc 678 umbc outline recap of...
Post on 08-Aug-2020
7 Views
Preview:
TRANSCRIPT
Approximate Inference:Variational Inference
CMSC 678UMBC
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Recap from last timeâŠ
Graphical Models
ðð ð¥ð¥1, ð¥ð¥2, ð¥ð¥3, ⊠, ð¥ð¥ðð = ï¿œðð
ðð ð¥ð¥ðð ðð(ð¥ð¥ðð))
Directed Models (Bayesian networks)
Undirected Models (Markov random fields)
ðð ð¥ð¥1, ð¥ð¥2, ð¥ð¥3, ⊠, ð¥ð¥ðð =1ððï¿œð¶ð¶
ððð¶ð¶ ð¥ð¥ðð
Markov Blanket
x
Markov blanket of a node x is its parents, children, and
children's parents
ðð ð¥ð¥ðð ð¥ð¥ððâ ðð =ðð(ð¥ð¥1, ⊠, ð¥ð¥ðð)
â« ðð ð¥ð¥1, ⊠, ð¥ð¥ðð ððð¥ð¥ðð
=âðð ðð(ð¥ð¥ðð|ðð ð¥ð¥ðð )
â« âðð ðð ð¥ð¥ðð ðð ð¥ð¥ðð )ððð¥ð¥ððfactor out terms not dependent on xi
factorization of graph
=âðð:ðð=ðð or ððâðð ð¥ð¥ðð ðð(ð¥ð¥ðð|ðð ð¥ð¥ðð )
â« âðð:ðð=ðð or ððâðð ð¥ð¥ðð ðð ð¥ð¥ðð ðð ð¥ð¥ðð )ððð¥ð¥ðð
the set of nodes needed to form the complete conditional for a variable xi
Markov Random Fields withFactor Graph Notation
x: original pixel/state
y: observed (noisy)
pixel/state
factor nodes are added
according to maximal cliques
unaryfactor
variable
factor graphs are bipartite
binaryfactor
Two Problems for Undirected Models
Finding the normalizer
ðð = ï¿œð¥ð¥
ï¿œðð
ðððð(ð¥ð¥ðð)
Computing the marginals
ðððð(ð£ð£) = ï¿œð¥ð¥:ð¥ð¥ðð=ð£ð£
ï¿œðð
ðððð(ð¥ð¥ðð)Q: Why are these difficult?
A: Many different combinations
Sum over all variable combinations, with the xn
coordinate fixed
ðð2(ð£ð£) = ï¿œð¥ð¥1
ï¿œð¥ð¥3
ï¿œðð
ðððð(ð¥ð¥ = ð¥ð¥1, ð£ð£, ð¥ð¥3 )
Example: 3 variables, fix the
2nd dimensionBelief propagation algorithms
⢠sum-product (forward-backward in HMMs)
⢠max-product/max-sum (Viterbi)
Sum-ProductFrom variables to factors
ððððâðð ð¥ð¥ðð = ï¿œððâ²âðð(ðð)\ðð
ððððâ²âðð ð¥ð¥ðð
From factors to variables
ððððâðð ð¥ð¥ðð= ï¿œ
ðððð\ðð
ðððð ðððð ï¿œððâ²âðð(ðð)\ðð
ððððâ²âðð(ð¥ð¥ððð)
n
m
n
m
set of variables that the mth factor depends on
set of factors in which variable n participates
sum over configuration of variables for the mth factor,
with variable n fixed
default value of 1 if empty product
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Goal: Posterior Inference
Hyperparameters αUnknown parameters ÎData:
Likelihood model:
p( | Î )
pα( Π| )
weâre going to be Bayesian (perform Bayesian inference)
Posterior Classification vs.Posterior Inference
âFrequentistâ methods
prior over labels (maybe), not weights
Bayesian methods
Î includes weight parameters
pα( Π| )pα,w ( y| )
(Some) Learning Techniques
MAP/MLE: Point estimation, basic EM
Variational Inference: Functional Optimization
Sampling/Monte Carlo
today
next class
what weâve already covered
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Exponential Family Form
Exponential Family Form
Support function⢠Formally necessary, in practice
irrelevant
Exponential Family Form
Distribution Parameters⢠Natural parameters⢠Feature weights
Exponential Family Form
Feature function(s)⢠Sufficient statistics
Exponential Family Form
Log-normalizer
Exponential Family Form
Log-normalizer
Why? Capture Common Distributions
Discrete (Finite distributions)
Why? Capture Common Distributions
⢠Gaussian
https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png
Why? Capture Common Distributions
Dirichlet (Distributions over (finite) distributions)
Why? Capture Common Distributions
Discrete (Finite distributions)
Dirichlet (Distributions over (finite) distributions)
Gaussian
Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,âŠ
Why? âEasyâ Gradients
Observed feature countsCount w.r.t. empirical distribution
Expected feature countsCount w.r.t. current model parameters
(weâve already seen this with maxent models)
Why? âEasyâ Expectations
expectation of the sufficient
statistics
gradient of the log normalizer
Why? âEasyâ Posterior Inference
Why? âEasyâ Posterior Inference
p is the conjugate prior for q
Why? âEasyâ Posterior Inference
p is the conjugate prior for q
Posterior p has same form as prior p
Why? âEasyâ Posterior Inference
p is the conjugate prior for q
Posterior p has same form as prior p
All exponential family models have a conjugate prior (in theory)
Why? âEasyâ Posterior Inference
p is the conjugate prior for q
Posterior p has same form as prior p
Posterior Likelihood Prior
Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta)
Normal Normal (fixed var.) Normal
Gamma Exponential Gamma
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Goal: Posterior Inference
Hyperparameters αUnknown parameters ÎData:
Likelihood model:
p( | Î )
pα( Π| )
(Some) Learning Techniques
MAP/MLE: Point estimation, basic EM
Variational Inference: Functional Optimization
Sampling/Monte Carlo
today
next class
what weâve already covered
Variational Inference
Difficult to compute
Variational Inference
Difficult to compute
Minimize the âdifferenceâ
by changing λ
Easy(ier) to compute
q(Ξ): controlled by parameters λ
Variational Inference
Difficult to compute
Easy(ier) to compute
Minimize the âdifferenceâ
by changing λ
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value λt
Until converged:1. Get value y t = F(q(â¢;λt))2. Get gradient g t = Fâ(q(â¢;λt))3. Get scaling factor Ï t4. Set λt+1 = λt + Ït*g t5. Set t += 1
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value λt
Until converged:1. Get value y t = F(q(â¢;λt))2. Get gradient g t = Fâ(q(â¢;λt))3. Get scaling factor Ï t4. Set λt+1 = λt + Ït*g t5. Set t += 1
Variational Inference:The Function to Optimize
Posterior of desired model
Any easy-to-compute distribution
Variational Inference:The Function to Optimize
Posterior of desired model
Any easy-to-compute distribution
Find the best distribution (calculus of variations)
Variational Inference:The Function to Optimize
Find the best distribution
Parameters for desired model
Variational Inference:The Function to Optimize
Find the best distribution
Variational parameters for Ξ
Parameters for desired model
Variational Inference:The Function to Optimize
Find the best distribution
Variationalparameters for Ξ
Parameters for desired model
KL-Divergence (expectation)
DKL ðð ðð || ðð(ðð|ð¥ð¥) =
ðŒðŒðð ðð logðð ðððð(ðð|ð¥ð¥)
Variational Inference
Find the best distribution
Variational parameters for Ξ
Parameters for desired model
Exponential Family Recap: âEasyâ Expectations
Exponential Family Recap: âEasyâ Posterior Inference
p is the conjugate prior for Ï
Variational Inference
Find the best distribution
When p and q are the same exponential family form, the variational update q(Ξ) is (often) computable (in closed form)
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value λtLetF(q(â¢;λt)) = KL[q(â¢;λt) || p(â¢)]
Until converged:1. Get value y t = F(q(â¢;λt))2. Get gradient g t = Fâ(q(â¢;λt))3. Get scaling factor Ï t4. Set λt+1 = λt + Ït*g t5. Set t += 1
Variational Inference:Maximization or Minimization?
Evidence Lower Bound (ELBO)
logðð ð¥ð¥ = logâ« ðð ð¥ð¥,ðð ðððð
Evidence Lower Bound (ELBO)
logðð ð¥ð¥ = logâ« ðð ð¥ð¥,ðð ðððð
= logâ« ðð ð¥ð¥,ðððð ðððð(ðð)
ðððð
Evidence Lower Bound (ELBO)
logðð ð¥ð¥ = logâ« ðð ð¥ð¥,ðð ðððð
= logâ« ðð ð¥ð¥,ðððð ðððð(ðð)
ðððð
= logðŒðŒðð ðððð ð¥ð¥,ðððð ðð
Evidence Lower Bound (ELBO)
logðð ð¥ð¥ = logâ« ðð ð¥ð¥,ðð ðððð
= logâ« ðð ð¥ð¥,ðððð ðððð(ðð)
ðððð
= logðŒðŒðð ðððð ð¥ð¥,ðððð ðð
⥠ðŒðŒðð ðð ðð ð¥ð¥,ðð â ðŒðŒðð ðð ðð ðð= â(ðð)
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Bag-of-Items Models
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . âŠ
p( ) Three: 1,people: 2,attack: 2,
âŠp( )=Unigram counts
Bag-of-Items Models
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . âŠ
p( ) Three: 1,people: 2,attack: 2,
âŠpÏ,Ï( )=
Unigram counts
Global (corpus-level) parameters interact with local (document-level) parameters
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (unigram) word counts
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (unigram) word counts
Count of word j in document i
j
i
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
Count of word j in document i
j
i
K topics
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
~ Multinomial ~ Dirichlet ~ Dirichlet
(regularize/place priors)
Count of word j in document i
j
i
K topics
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Variational Inference: LDirA
Topic usage
Per-document (unigram) word counts
Topic words
p: True model
ðððð ⌠Dirichlet(ð·ð·)ð€ð€(ðð,ðð) ⌠Discrete(ððð§ð§ ðð,ðð )
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
Variational Inference: LDirA
Topic usage
Per-document (unigram) word counts
Topic words
p: True model q: Mean-field approximation
ðððð ⌠Dirichlet(ð·ð·)ð€ð€(ðð,ðð) ⌠Discrete(ððð§ð§ ðð,ðð )
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðððð ⌠Dirichlet(ðððð)
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value λtLetF(q(â¢;λt)) = KL[q(â¢;λt) || p(â¢)]
Until converged:1. Get value y t = F(q(â¢;λt))2. Get gradient g t = Fâ(q(â¢;λt))3. Get scaling factor Ï t4. Set λt+1 = λt + Ït*g t5. Set t += 1
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
ðŒðŒðð(ðð(ðð)) logðð ðð(ðð) | ðŒðŒ
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
ðŒðŒðð(ðð(ðð)) logðð ðð(ðð) | ðŒðŒ =
ðŒðŒðð(ðð(ðð)) ðŒðŒ â 1 ðð logðð(ðð) + ð¶ð¶
exponential family form of Dirichlet
ðð ðð =Î(âðð ðŒðŒðð)âðð Î ðŒðŒðð
ï¿œðð
ðððððŒðŒððâ1
params = ðŒðŒðð â 1 ððsuff. stats.= logðððð ðð
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
ðŒðŒðð(ðð(ðð)) logðð ðð(ðð) | ðŒðŒ =
ðŒðŒðð(ðð(ðð)) ðŒðŒ â 1 ðð logðð(ðð) + ð¶ð¶
expectation of sufficient statistics of q distribution
params = ðŸðŸðð â 1 ðð
suff. stats. = logðððð ðð
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
ðŒðŒðð(ðð(ðð)) logðð ðð(ðð) | ðŒðŒ =
ðŒðŒðð(ðð(ðð)) ðŒðŒ â 1 ðð logðð(ðð) + ð¶ð¶ =expectation of the
sufficient statistics is the gradient of the
log normalizer
ðŒðŒ â 1 ðððŒðŒðð(ðð(ðð)) logðð(ðð) + ð¶ð¶
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
ðŒðŒðð(ðð(ðð)) logðð ðð(ðð) | ðŒðŒ =
ðŒðŒðð(ðð(ðð)) ðŒðŒ â 1 ðð logðð(ðð) + ð¶ð¶ =expectation of the
sufficient statistics is the gradient of the
log normalizer
ðŒðŒ â 1 ððð»ð»ðŸðŸðððŽðŽ ðŸðŸðð â 1 + ð¶ð¶
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
ðŒðŒðð(ðð(ðð)) logðð ðð(ðð) | ðŒðŒ = ðŒðŒ â 1 ððð»ð»ðŸðŸðððŽðŽ ðŸðŸðð â 1 + ð¶ð¶
â ï¿œðŸðŸðð
= ðŒðŒ â 1 ððð»ð»ðŸðŸðððŽðŽ ðŸðŸðð â 1 + ðð ðŸðŸððthereâs more math
to do!
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value λtLetF(q(â¢;λt)) = KL[q(â¢;λt) || p(â¢)]
Until converged:1. Get value y t = F(q(â¢;λt))2. Get gradient g t = Fâ(q(â¢;λt))3. Get scaling factor Ï t4. Set λt+1 = λt + Ït*g t5. Set t += 1
Variational Inference: LDirA
ðð(ðð) ⌠Dirichlet(ð¶ð¶)ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð))
ðð(ðð) ⌠Dirichlet(ðžðžð ð )ð§ð§(ðð,ðð) ⌠Discrete(ðð(ðð,ðð))
p: True model q: Mean-field approximation
â ï¿œðŸðŸðð
= ðŒðŒ â 1 ððð»ð»ðŸðŸðððŽðŽ ðŸðŸðð â 1 + ðð ðŸðŸðð
ð»ð»ðŸðŸððâ ï¿œðŸðŸðð= ðŒðŒ â 1 ððð»ð»ðŸðŸðð
2 ðŽðŽ ðŸðŸðð â 1 + ð»ð»ðŸðŸðððð ðŸðŸðð
top related