approximate inference: variational inferencevariational inference cmsc 678 umbc outline recap of...

Post on 08-Aug-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Approximate Inference:Variational Inference

CMSC 678UMBC

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Recap from last time


Graphical Models

𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, 
 , 𝑥𝑥𝑁𝑁 = ᅵ𝑖𝑖

𝑝𝑝 𝑥𝑥𝑖𝑖 𝜋𝜋(𝑥𝑥𝑖𝑖))

Directed Models (Bayesian networks)

Undirected Models (Markov random fields)

𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, 
 , 𝑥𝑥𝑁𝑁 =1𝑍𝑍ᅵ𝐶𝐶

𝜓𝜓𝐶𝐶 𝑥𝑥𝑐𝑐

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗≠𝑖𝑖 =𝑝𝑝(𝑥𝑥1, 
 , 𝑥𝑥𝑁𝑁)

∫ 𝑝𝑝 𝑥𝑥1, 
 , 𝑥𝑥𝑁𝑁 𝑑𝑑𝑥𝑥𝑖𝑖

=∏𝑘𝑘 𝑝𝑝(𝑥𝑥𝑘𝑘|𝜋𝜋 𝑥𝑥𝑘𝑘 )

∫ ∏𝑘𝑘 𝑝𝑝 𝑥𝑥𝑘𝑘 𝜋𝜋 𝑥𝑥𝑘𝑘 )𝑑𝑑𝑥𝑥𝑖𝑖factor out terms not dependent on xi

factorization of graph

=∏𝑘𝑘:𝑘𝑘=𝑖𝑖 or 𝑖𝑖∈𝜋𝜋 𝑥𝑥𝑘𝑘 𝑝𝑝(𝑥𝑥𝑘𝑘|𝜋𝜋 𝑥𝑥𝑘𝑘 )

∫ ∏𝑘𝑘:𝑘𝑘=𝑖𝑖 or 𝑖𝑖∈𝜋𝜋 𝑥𝑥𝑘𝑘 𝑝𝑝 𝑥𝑥𝑘𝑘 𝜋𝜋 𝑥𝑥𝑘𝑘 )𝑑𝑑𝑥𝑥𝑖𝑖

the set of nodes needed to form the complete conditional for a variable xi

Markov Random Fields withFactor Graph Notation

x: original pixel/state

y: observed (noisy)

pixel/state

factor nodes are added

according to maximal cliques

unaryfactor

variable

factor graphs are bipartite

binaryfactor

Two Problems for Undirected Models

Finding the normalizer

𝑍𝑍 = ᅵ𝑥𝑥

ᅵ𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥𝑐𝑐)

Computing the marginals

𝑍𝑍𝑛𝑛(𝑣𝑣) = ᅵ𝑥𝑥:𝑥𝑥𝑛𝑛=𝑣𝑣

ᅵ𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥𝑐𝑐)Q: Why are these difficult?

A: Many different combinations

Sum over all variable combinations, with the xn

coordinate fixed

𝑍𝑍2(𝑣𝑣) = ᅵ𝑥𝑥1

ᅵ𝑥𝑥3

ᅵ𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥 = 𝑥𝑥1, 𝑣𝑣, 𝑥𝑥3 )

Example: 3 variables, fix the

2nd dimensionBelief propagation algorithms

• sum-product (forward-backward in HMMs)

• max-product/max-sum (Viterbi)

Sum-ProductFrom variables to factors

𝑞𝑞𝑛𝑛→𝑚𝑚 𝑥𝑥𝑛𝑛 = ᅵ𝑚𝑚′∈𝑀𝑀(𝑛𝑛)\𝑚𝑚

𝑟𝑟𝑚𝑚′→𝑛𝑛 𝑥𝑥𝑛𝑛

From factors to variables

𝑟𝑟𝑚𝑚→𝑛𝑛 𝑥𝑥𝑛𝑛= ï¿œ

𝒘𝒘𝑚𝑚\𝑛𝑛

𝑓𝑓𝑚𝑚 𝒘𝒘𝑚𝑚 ᅵ𝑛𝑛′∈𝑁𝑁(𝑚𝑚)\𝑛𝑛

𝑞𝑞𝑛𝑛′→𝑚𝑚(𝑥𝑥𝑛𝑛𝑛)

n

m

n

m

set of variables that the mth factor depends on

set of factors in which variable n participates

sum over configuration of variables for the mth factor,

with variable n fixed

default value of 1 if empty product

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

we’re going to be Bayesian (perform Bayesian inference)

Posterior Classification vs.Posterior Inference

“Frequentist” methods

prior over labels (maybe), not weights

Bayesian methods

Θ includes weight parameters

pα( Θ | )pα,w ( y| )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

today

next class

what we’ve already covered

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Exponential Family Form

Exponential Family Form

Support function• Formally necessary, in practice

irrelevant

Exponential Family Form

Distribution Parameters• Natural parameters• Feature weights

Exponential Family Form

Feature function(s)• Sufficient statistics

Exponential Family Form

Log-normalizer

Exponential Family Form

Log-normalizer

Why? Capture Common Distributions

Discrete (Finite distributions)

Why? Capture Common Distributions

• Gaussian

https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png

Why? Capture Common Distributions

Dirichlet (Distributions over (finite) distributions)

Why? Capture Common Distributions

Discrete (Finite distributions)

Dirichlet (Distributions over (finite) distributions)

Gaussian

Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,


Why? “Easy” Gradients

Observed feature countsCount w.r.t. empirical distribution

Expected feature countsCount w.r.t. current model parameters

(we’ve already seen this with maxent models)

Why? “Easy” Expectations

expectation of the sufficient

statistics

gradient of the log normalizer

Why? “Easy” Posterior Inference

Why? “Easy” Posterior Inference

p is the conjugate prior for q

Why? “Easy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

Why? “Easy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

All exponential family models have a conjugate prior (in theory)

Why? “Easy” Posterior Inference

p is the conjugate prior for q

Posterior p has same form as prior p

Posterior Likelihood Prior

Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta)

Normal Normal (fixed var.) Normal

Gamma Exponential Gamma

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

today

next class

what we’ve already covered

Variational Inference

Difficult to compute

Variational Inference

Difficult to compute

Minimize the “difference”

by changing λ

Easy(ier) to compute

q(Ξ): controlled by parameters λ

Variational Inference

Difficult to compute

Easy(ier) to compute

Minimize the “difference”

by changing λ

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λt

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λt

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Variational Inference:The Function to Optimize

Posterior of desired model

Any easy-to-compute distribution

Variational Inference:The Function to Optimize

Posterior of desired model

Any easy-to-compute distribution

Find the best distribution (calculus of variations)

Variational Inference:The Function to Optimize

Find the best distribution

Parameters for desired model

Variational Inference:The Function to Optimize

Find the best distribution

Variational parameters for Ξ

Parameters for desired model

Variational Inference:The Function to Optimize

Find the best distribution

Variationalparameters for Ξ

Parameters for desired model

KL-Divergence (expectation)

DKL 𝑞𝑞 𝜃𝜃 || 𝑝𝑝(𝜃𝜃|𝑥𝑥) =

𝔌𝔌𝑞𝑞 𝜃𝜃 log𝑞𝑞 𝜃𝜃𝑝𝑝(𝜃𝜃|𝑥𝑥)

Variational Inference

Find the best distribution

Variational parameters for Ξ

Parameters for desired model

Exponential Family Recap: “Easy” Expectations

Exponential Family Recap: “Easy” Posterior Inference

p is the conjugate prior for π

Variational Inference

Find the best distribution

When p and q are the same exponential family form, the variational update q(Ξ) is (often) computable (in closed form)

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λtLetF(q(•;λt)) = KL[q(•;λt) || p(•)]

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Variational Inference:Maximization or Minimization?

Evidence Lower Bound (ELBO)

log𝑝𝑝 𝑥𝑥 = log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃 𝑑𝑑𝜃𝜃

Evidence Lower Bound (ELBO)

log𝑝𝑝 𝑥𝑥 = log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃 𝑑𝑑𝜃𝜃

= log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃𝑞𝑞(𝜃𝜃)

𝑑𝑑𝜃𝜃

Evidence Lower Bound (ELBO)

log𝑝𝑝 𝑥𝑥 = log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃 𝑑𝑑𝜃𝜃

= log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃𝑞𝑞(𝜃𝜃)

𝑑𝑑𝜃𝜃

= log𝔌𝔌𝑞𝑞 𝜃𝜃𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃

Evidence Lower Bound (ELBO)

log𝑝𝑝 𝑥𝑥 = log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃 𝑑𝑑𝜃𝜃

= log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃𝑞𝑞(𝜃𝜃)

𝑑𝑑𝜃𝜃

= log𝔌𝔌𝑞𝑞 𝜃𝜃𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃

≥ 𝔌𝔌𝑞𝑞 𝜃𝜃 𝑝𝑝 𝑥𝑥,𝜃𝜃 − 𝔌𝔌𝑞𝑞 𝜃𝜃 𝑞𝑞 𝜃𝜃= ℒ(𝑞𝑞)

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . 


p( ) Three: 1,people: 2,attack: 2,


p( )=Unigram counts

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . 


p( ) Three: 1,people: 2,attack: 2,


pφ,ω( )=

Unigram counts

Global (corpus-level) parameters interact with local (document-level) parameters

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (unigram) word counts

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (unigram) word counts

Count of word j in document i

j

i

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

Count of word j in document i

j

i

K topics

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

~ Multinomial ~ Dirichlet ~ Dirichlet

(regularize/place priors)

Count of word j in document i

j

i

K topics

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document

(latent) topic usage

Per-document (unigram) word counts

Per-topic word usage

d

Variational Inference: LDirA

Topic usage

Per-document (unigram) word counts

Topic words

p: True model

𝜙𝜙𝑘𝑘 ∌ Dirichlet(𝜷𝜷)𝑀𝑀(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜙𝜙𝑧𝑧 𝑑𝑑,𝑛𝑛 )

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

Variational Inference: LDirA

Topic usage

Per-document (unigram) word counts

Topic words

p: True model q: Mean-field approximation

𝜙𝜙𝑘𝑘 ∌ Dirichlet(𝜷𝜷)𝑀𝑀(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜙𝜙𝑧𝑧 𝑑𝑑,𝑛𝑛 )

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜙𝜙𝑘𝑘 ∌ Dirichlet(𝝀𝝀𝒌𝒌)

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λtLetF(q(•;λt)) = KL[q(•;λt) || p(•)]

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛌𝛌

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛌𝛌 =

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛌𝛌 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

exponential family form of Dirichlet

𝑝𝑝 𝜃𝜃 =Γ(∑𝑘𝑘 𝛌𝛌𝑘𝑘)∏𝑘𝑘 Γ 𝛌𝛌𝑘𝑘

ᅵ𝑘𝑘

𝜃𝜃𝑘𝑘𝛌𝛌𝑘𝑘−1

params = 𝛌𝛌𝑘𝑘 − 1 𝑘𝑘suff. stats.= log𝜃𝜃𝑘𝑘 𝑘𝑘

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛌𝛌 =

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛌𝛌 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

expectation of sufficient statistics of q distribution

params = 𝛟𝛟𝑘𝑘 − 1 𝑘𝑘

suff. stats. = log𝜃𝜃𝑘𝑘 𝑘𝑘

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛌𝛌 =

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛌𝛌 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛌𝛌 − 1 𝑇𝑇𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛌𝛌 =

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛌𝛌 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛌𝛌 − 1 𝑇𝑇𝛻𝛻𝛟𝛟𝑑𝑑𝐎𝐎 𝛟𝛟𝑑𝑑 − 1 + 𝐶𝐶

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

𝔌𝔌𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛌𝛌 = 𝛌𝛌 − 1 𝑇𝑇𝛻𝛻𝛟𝛟𝑑𝑑𝐎𝐎 𝛟𝛟𝑑𝑑 − 1 + 𝐶𝐶

ℒ ᅵ𝛟𝛟𝑑𝑑

= 𝛌𝛌 − 1 𝑇𝑇𝛻𝛻𝛟𝛟𝑑𝑑𝐎𝐎 𝛟𝛟𝑑𝑑 − 1 + 𝑀𝑀 𝛟𝛟𝑑𝑑there’s more math

to do!

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λtLetF(q(•;λt)) = KL[q(•;λt) || p(•)]

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Variational Inference: LDirA

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜃𝜃(𝑑𝑑))

𝜃𝜃(𝑑𝑑) ∌ Dirichlet(𝜞𝜞𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∌ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))

p: True model q: Mean-field approximation

ℒ ᅵ𝛟𝛟𝑑𝑑

= 𝛌𝛌 − 1 𝑇𝑇𝛻𝛻𝛟𝛟𝑑𝑑𝐎𝐎 𝛟𝛟𝑑𝑑 − 1 + 𝑀𝑀 𝛟𝛟𝑑𝑑

𝛻𝛻𝛟𝛟𝑑𝑑ℒ ᅵ𝛟𝛟𝑑𝑑= 𝛌𝛌 − 1 𝑇𝑇𝛻𝛻𝛟𝛟𝑑𝑑

2 𝐎𝐎 𝛟𝛟𝑑𝑑 − 1 + 𝛻𝛻𝛟𝛟𝑑𝑑𝑀𝑀 𝛟𝛟𝑑𝑑

top related