computer vision: models, learning and inference mixture ...computer vision: models, learning and...

Computer Vision: Models, Learning and Inference–

Mixture Models, Part 3

Oren Freifeld and Ron Shapira-Weber

Computer Science, Ben-Gurion University

April 29, 2019

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 1 / 44

www.cs.bgu.ac.il/~cv192/

1 Bayesian GMM: ModelingThe Categorical DistributionThe Dirichlet DistributionThe Normal-Inverse Wishart Distribution

2 Bayesian GMM: InferenceHard Problem with Easy Conditional SubproblemsGibbs Sampling in Bayesian GMM

3 Demos: Gibbs-sampling Inference in Bayesian GMMCategorical Data Likelihood and a Dirichlet PriorNormal Data Likelihood and an NIW PriorGibbs Sampling

4 Conjugate Priors

5 Beyond iid: Bayesian Connectivity-constrained superpixels



Bayesian GMM: Modeling

Bayesian GMM

θ = ((µj)Kj=1, (Σj)

Kj=1, (πj)

Kj=1) θ ∼ p(θ) xi

iid∼ p(xi|θ) D , (xi)Ni=1

The joint pdf:

p(D, θ) = p(D|θ)p(θ) = p(θ)

N∏i=1

p(xi|θ) = p(θ)

N∏i=1

K∑zi=1

p(xi, zi|θ)

= p(θ)N∏i=1

K∑j=1

p(xi, zi = j|θ) = p(θ)N∏i=1

K∑j=1

p(xi|zi = j, θ)p(zi = j|θ)

= p(θ)

N∏i=1

K∑j=1

p(xi|zi = j,µj ,Σj)πj = p(θ)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj




Bayesian GMM

Thus:

θ = ((µj)Kj=1, (Σj)

Kj=1, (πj)

Kj=1)

p(D, θ) = p(θ)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj

Remark

Some authors prefer using

θ = (θj)Kj=1 = ((µj ,Σj)

Kj=1) and π = (πj)

Kj=1 ;

in which case:

p(D, θ, π) = p(θ, π)

N∏i=1

K∑zi=1

N (xi;µj ,Σj)πj




Bayesian GMM

Typically:

p(θ) = p((µj)Kj=1, (Σj)

Kj=1, (πj)

Kj=1)

assumption=

K∏j=1

p(µj ,Σj)

p(π)

where π = (πj)Kj=1. In effect,

(µ1,Σ1)⊥⊥ (µ2,Σ2)⊥⊥ . . .⊥⊥ (µK ,ΣK)⊥⊥ π




Bayesian GMM

p(θ|D) ∝

K∏j=1

p(µj ,Σj)

p(π)

︸︷︷︸prior

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj︸︷︷︸GMM data likelihood

Need to specify p(π) and p(µj ,Σj).

Typically p(µj ,Σj) does not depend on j; i.e.,

(µj ,Σj)iid∼ p(µ,Σ) .

We will go with standard choices that simplify computations.

These choices are related to the concepts of conjugate priors andexponential families.



Bayesian GMM: Modeling The Categorical Distribution

Definition (categorical distribution)

A K-dimensional categorical distribution is a discrete distribution over afinite alphabet; WLOG, let this alphabet be {1, . . . ,K}; particularly, z issaid to be drawn from a K-dimensional categorical distributionparametrized by π = (π1, . . . , πK) if

z ∼ Cat(z;π) ,

{πz, z ∈ {1, . . . ,K}0, otherwise

; i.e.: p(z = j) = πj

If z = (zi)Ni=1 and zi

iid∼ Cat(z;π), then

p(z|π) =

N∏i=1

p(zi|π) =

N∏i=1

πzi =

N∏i=1

K∏j=1

π1zi=j

j =

K∏j=1

N∏i=1

π1zi=j

j =

K∏j=1

πNjj

where Nj =∑N

i=1 1zi=j = |{zi : zi = j}|.



Bayesian GMM: Modeling The Categorical Distribution

Example

Suppose j = 5. Thus, π = (π1, π2, π3, π4, π5). Suppose N = 6 and wehappened to observe

z = (z1, z2, z3, z4, z5, z6) = (1, 2, 1, 5, 3, 1)

Thus, N1 = 3, N2 = 1, N3 = 1, N4 = 0, N5 = 1, and

p(z|π) = p(z1, z2, z3, z4, z5, z6|π) =

5∏j=1

πNjj

= π31π

12π

13π

04π

15 = π3

1π12π

13π

15



Bayesian GMM: Modeling The Dirichlet Distribution

Definition (simplex)

The (K − 1)-dimensional (probability/standard) simplex, a subset of RK ,is

{π = (π1, . . . , πj) :∑K

j=1πj = 1 and πj ≥ 0∀j ∈ {1, . . . ,K}}

(in the figure: θ = (θ1, θ2, θ3) is a point on the 2-dimensional simplex)

Figure from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 10 / 44



Definition (Dirichlet distribution)

A K-dimensional Dirichlet distribution is a distribution over the(K − 1)-dimensional simplex such that

π ∼ Dir(π;α1, . . . , αK) ,Γ

K∑j=1

αj

∏K

j=1

παj−1j

Γ(αj)

= Γ

K∑j=1

αj

(∏K

j=1

1

Γ(αj)

)︸︷︷︸

const w.r.t. π

(∏K

j=1παj−1j

)

where all αj ’s are positive real numbers (see next slide for definition of Γ).

A Dirichlet distribution serves as a prior over categorical distributions.likelihood,




Note the similarity between the form of the categorical data likelihood,

p(z|π) =

K∏j=1

πNjj

and the form of the Dirichlet-distribution prior

π ∼ Dir(π;α1, . . . , αK) ∝

K∏j=1

παj−1j




Definition (the Γ function)

Γ(t) =

∫ ∞0

xt−1e−x dx

where t is real but not a non-positive integer.

If t is a positive integer, then Γ(t) = (t− 1)!




Dirichlet Distributions

π ∼ Dir(π;α1, . . . , αK) , Γ

(∑K

j=1αj

)(∏K

j=1

1

Γ(αj)

)(∏K

j=1παj−1j

)Mode (assuming αj > 1∀j):

αj−1

(∑Kk′=1 αk′)−K

∀k ∈ {1, . . . ,K}

Mean:αj∑K

k′=1 αk′∀k ∈ {1, . . . ,K}

If αj ’s are equal, the Dirichlet distribution is called symmetric and write

π ∼ Dir(π;α) =Γ(∑K

j=1 α)

Γ(α)K

∏K

j=1πα−1j

Example

Dir(π;α = 1) =Γ(

∑Kj=1 1)

Γ(1)K

∏K

j=11 ( = const w.r.t. π )

is the uniform distribution over the simplex.





K = 3, Dir(π; 2, 2, 2) = Dir(π; 2)

Note it is peaked around the uniform distribution, π = (1/3, 1/3, 1/3).Most of the samples from Dir(π; 2, 2, 2) will be somewhat close touniform.





K = 3, Dir(π; 20, 2, 2)

Note it is skewed toward π = (1, 0, 0). In effect, most of the samples fromDir(π; 20, 2, 2) are distributions whose most of their masses is placed onstate 1.





K = 3, Dir(π; 0.1, 0.1, 0.1) = Dir(π; 0.1)

0

0.5

1

0

0.5

10

5

10

15

α=0.10

p

(ignore the comb-like graphics artifacts)Note it peaks around the three sparse distributions, π = (1, 0, 0),π = (0, 1, 0), and π = (0, 0, 1). Most draws from Dir(π; 0.1, 0.1, 0.1) willcenter most of their mass on one state, but in every such draw, each ofthe three states has an equal chance to be the most probable state.




Categorical Likelihood with a Dirichlet Prior

Let z = (zi)i=1 with ziiid∼ Cat(z;π). Let Nj =

∑Ni=1 1zi=j .

p(π|z) ∝ p(π)

N∏i=1

p(zi|π)

=

Γ

K∑j=1

αj

K∏j=1

παj−1j

Γ(αj)

︸︷︷︸

Dirichlet prior

(K∏i=1

πNjj

)︸︷︷︸

categorical data likelihood

= Γ

K∑j=1

αj

K∏j=1

παj+Nd−1j

Γ(αj)

∝ Dir(π;α1 +N1, . . . , αK +NK)

In effect, p(π|z) = Dir(π;α1 +N1, . . . , αK +NK)





In general, the parameters of the prior or posterior are called hyperparameters.

We started with a Dirichlet prior, with some hyper parameters (αj)Kj=1,

and, once we took the categorical data likelihood into account, ended upwith a posterior such that:1) the posterior is yet another Dirichlet distribution;2) the posterior hyper parameters are given in closed form:

α∗j = αj +Nj , Nj =

N∑i=1

1zi=j ∀j ∈ {1, . . . ,K}

where we used ∗ to denote the hyper parameters of the posterior.

(α∗j )Kj=1 depend on both the data, z, and (αj)

Kj=1

All this is a particular example of what is called conjugacy.





We view the αs’s as pseudo-counts. In other words, it is as if we hadadditional data,

1, 1, . . . , 1︸︷︷︸α1 times

, 2, 2, . . . , 2︸︷︷︸α2 times

, . . . ,K,K, . . . ,K︸︷︷︸αK times

The higher the value of the (sums of the) αj ’s, the higher the weight ofthe Dirichlet prior is. For example, Dir(π; 100) is much more peakedaround the uniform distribution than Dir(π; 2) is.

When N becomes very large, the likelihood dominates the prior.



Bayesian GMM: Modeling The Normal-Inverse Wishart Distribution

Definition (Normal Inverse-Wishart distribution)

Let µ ∈ Rn and let Σ ∈ Rn×n be SPD. µ and Σ areNormal-Inverse-Wishart distributed if

p(µ,Σ;κ,m, ν,Ψ) = NIW(µ,Σ;κ,m, ν,Ψ)

,

p(µ|Σ;κ,m)︷︸︸︷N (µ;m, 1

κΣ)

p(Σ;ν,Ψ)︷︸︸︷W−1(Σ; ν,Ψ)

W−1(Σ; ν,Ψ) is the Inverse-Wishart distribution, a distribution overn× n SPD matrices. Think of it as a sort of a “Gaussian” overcovariances matrices (we can’t quite have a Gaussian over a nonlinearspace).

The generative process here is: 1) draw Σ from the IW; 2) given Σ,sample µ from an m-mean Gaussian whose covariance is 1

κΣ.




Definition (Inverse-Wishart distribution)

Rn×n 3 Σ ∼ W−1(ν,Ψ), where ν > n− 1 and Ψ ∈ Rn×n is SPD, ⇐⇒

W−1(Σ; ν,Ψ) =|νΨ|

ν2

2νn2 Γn(ν2 )

|Σ|−ν+n+1

2 e−12

tr(νΨΣ−1)

Its mode and mean have closed forms and are close to each other:

E(Σ) =νΨ

ν − n− 1for ν > n+ 1

arg maxΣ

W−1(Σ; ν,Ψ) =νΨ

ν − n+ 1

When n is large, the difference between the two is small.When ν →∞ they both converge to Ψ.Some (SW pkg) authors parametrize the IW with ∆ = Ψ/ν:

W−1(Σ; ν,∆) =|∆|

ν2

2νn2 Γn(ν2 )

|Σ|−ν+n+1

2 e−12

tr(∆Σ−1)




Fact (Normal likelihood with an NIW prior)

If xiiid∼ N (µ,Σ), and p(µ,Σ) = NIW(µ,Σ;κ,m, ν,Ψ) then

p(µ,Σ|x1, . . . ,xN ) = NIW(µ,Σ;κ∗,m∗, ν∗,Ψ∗)

(i.e., the posterior is also NIW) and the dependency of the closed-formupdates on the data is only through

∑i xi’s and

∑i xix

Ti :

κ∗ = κ+N m∗ = func(κ, κ∗,m,

N∑i=1

xi)

ν∗ = ν +N Ψ∗ = func(ν, ν∗, κ, κ∗,m,m∗,Ψ,∑i=1

xixTi )

(κ∗,m∗, ν∗,Ψ∗) depend on both the data, (xi)Ni=1, and (κ,m, ν,Ψ).

Again, all this is a particular example of what is called conjugacy.




The Details

κ∗ = κ+N m∗ = func(κ, κ∗,m,

N∑i=1

xi) =1

κ∗

[κm+

N∑i=1

xi

]ν∗ = ν +N

Ψ∗ = func(ν, ν∗, κ, κ∗,m,m∗,Ψ,∑i=1

xixTi )

=1

ν∗

[νΨ + κmmT +

(∑i=1

xixTi

)− κ∗m∗(m∗)T

]

Result taken from Jason Chang’s PhD Thesiswww.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 24 / 44


Bayesian GMM: Inference Hard Problem with Easy Conditional Subproblems

Bayesian GMM Inference

p(θ|(xi)Ni=1) ∝

K∏j=1

p(µj ,Σj)

p(π)

N∏i=1

K∑j=1

p(xi|zi = j,µj ,Σj)πj =

K∏j=1

NIW(µj ,Σj ;κ,m, ν,Ψ)

Dir(π; (αj)Kj=1)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj

p(θ|(xi)Ni=1) is nasty, so think in terms of p(θ,z|(xi)Ni=1) instead, andalternate between

p(θ|z, (xi)Ni=1) and p(z|θ, (xi)Ni=1)

EM (for MAP estimates) or Hard-assignment EM is one way to goabout it, but can also do Gibbs sampling.




Bayesian GMM Inference

Given the labels, z = (zi)Ni=1, we get

p(θ|z, (xi)Ni=1) = p(π|z)p(µj ,Σj |(xi)i:zi=j)

wherep(π|z) = Dir(π;α1 +N1, . . . , αK +NK)

andp(µj ,Σj |(xi)i:zi=j) = NIW(µj ,Σj ;κ

∗j ,m

∗j , ν∗j ,Ψ

∗j )

where the hyper-parameters posterior for Gaussian k are updated usingNj =

∑Ni 1zi=j (instead of N),

∑i:zi=j

xi (instead of∑N

i=1 xi), and∑i:zi=j

xixTi (instead of

∑Ni=1 xix

Ti ).

We can shoot for either one of:1 arg maxθ p(θ|z, (xi)Ni=1);2 E(θ|z, (xi)Ni=1);3 θ ∼ p(θ|z, (xi)Ni=1)




Remark (If sampling from the joint is easier than sampling from themarginal)

In general, suppose x and y are some two jointly-distributed RVs and wewant to sample x ∼ p(x) but it is hard. If, however, we can samplex, y ∼ p(x, y), we can just discard y and to obtain x ∼ p(x).

In our case here, we don’t know how to sample from p(θ|(xi)Ni=1).Suppose we manage (e.g., through Gibbs sampling) to sampleθ,z ∼ p(θ,z|(xi)Ni=1). Discarding z, we get θ ∼ p(θ|(xi)Ni=1).



Bayesian GMM: Inference Gibbs Sampling in Bayesian GMM

Gibbs Sampling for Bayesian GMM Inference

Here Gibbs sampling targets sampling (θ,z) ∼ p(θ|z,D) by alternatingbetween

θ ∼ p(θ|z,D) and z ∼ p(z|θ,D)





θ ∼ p(θ|z,D) = p(π|z, (xi)Ni=1)∏Kj=1 p(µj ,Σj |z, (xi)Ni=1).

In effect, this conditional sampling of θ has two independent parts:

(1) π ∼ p(π|z, (xi)Ni=1) = p(π|z) = Dir(π;α1 +N1, . . . , αK +NK)

(2) µj ,Σj ∼ p(µj ,Σj |z, (xi)Ni=1) = p(µj ,Σj |(xi)i:zi=j)= NIW(µj ,Σj ;κ

∗j ,m

∗j , ν∗j ,Ψ

∗j ) ∀j ∈ {1, . . . ,K}

(the second part may also be done in parallel over the K components)



Demos: Gibbs-sampling Inference in Bayesian GMM

See demos



Conjugate Priors

Bayesian Parameter Estimation

Forget about mixture models for a second, and consider the following:

D = (xi)Ni=1, xi

iid∼ p(x|θ) where θ ∼ p(θ) is also an RV.

p(xi|θ) , fx(xi; θ), where fx(·; θ) is parametrized by θ.

Likelihood:

p(D|θ) =

N∏i=1

fx(xi; θ)

p(θ) = fθ(θ;λ) where fθ(·;λ) is parametrized by λ.

Posterior:

p(θ|D) =p(D, θ)p(D)

=p(θ)p(D|θ)p(D)

∝ p(θ)p(D|θ) = fθ(θ;λ)

N∏i=1

fx(xiθ)



Conjugate Priors

Bayesian Parameter Estimation

Usually, p(θ|x) does not have the same parametric form as the prior, fθ.

For a certain class of prior-likelihood pairs, the posterior stays within thesame family of functions as the prior.

This class of paired distributions is typically referred to as the class ofconjugate distributions, and the prior distribution that pairs with aparticular data likelihood is called the conjugate prior (to that particularlikelihood)



Conjugate Priors

Definition (conjugate priors)

Let p(xi|θ) = fx(xi; θ) and let p(θ) = fθ(θ;λ). If

p(θ|D) = fθ(θ;λ∗)

then fθ is said to be conjugate to fx. The posterior hyper-parameters, λ∗,usually depend on both the observations, D, and the priorhyper-parameter, λ; i.e.,

λ∗ = λ∗(D, λ)



Conjugate Priors

Definition (exponential family)

Let F be a set of parametrized distributions. F is called an exponentialfamily if all its members have the form

p(xi|θ) = f(xi)g(θ) exp(η(θ)T t(xi))

)where η(θ) and t(xi) are vectors of the same dimension as θ.

If θ ∈ Rd, then

η(θ)T t(xi) =

d∑j=1

ηj(θ)tj(xi)

η is called the natural parameter(s).



Conjugate Priors

Exponential Families

The likelihood for iid samples, (xi)Ni=1, is

p((xi)Ni=1|θ) =

(N∏i=1

f(xi)

)g(θ)N exp

η(θ)T T ((xi)Ni=1)︸︷︷︸

suff stats

∝ g(θ)N exp

(η(θ)TT ((xi)

Ni=1)

)where T ((xi)

Ni=1) ,

∑Ni=1 t(xi) are called the sufficient statistics (i.e.,

sufficient for estimating η).

η(θ)TT ((xi)Ni=1) =

d∑j=1

ηj(θ)Tj((xi)Ni=1) =

d∑j=1

ηj(θ)

N∑i

tj(xi)



Conjugate Priors


Likelihood:

p((xi)Ni=1|θ) ∝ g(θ)N exp

(η(θ)TT ((xi)

Ni=1)

)If the prior is

p(θ) ∝ g(θ)m exp(η(θ)T v

)(for some vector v) then the posterior is:

p(θ|(xi)Ni=1) ∝ g(θ)m+N expη(θ)T (v+T ((xi)Ni=1))

Thus, this prior is conjugate to this exponential family. Here, λ = (m, v)and λ∗ = (m+N, v + T ((xi)

Ni=1)).



Conjugate Priors


The special structure of exponential families makes bothMaximum-likelihood estimation and Bayesian inference more tractable.

This is due to the fact that finite-dimensional exponential families admitfinite-dimensional sufficient statistics (i.e., the dimensionality of thesufficient statistics does not depend on N) and the fact (in the Bayesiancase) they admit conjugate priors.



Conjugate Priors

Exponential Families: Examples We Already Saw

Example

n-dimensional Gaussians (look up the “natural parametrization” of amultivariate Gaussian). The sufficient statistics are:

T ((xi)Ni=1) =

(N∑i=1

xi,

N∑i=1

xixTi

)=

N∑i=1

(xi,xix

Ti

)︸︷︷︸ti

where vectorize(∑N

i=1 xixTi ) is tied to a vector of length 1

2(n2 + n).



Conjugate Priors


Example

K−dimensional Categorical distributions. The sufficient statistics are:

T ((zi)Ni=1) =

(N∑i=1

1zi=1, . . . ,N∑i=1

1zi=K

)=

N∑i=1

1zi=1, . . . ,1zi=K︸︷︷︸ti

(ti is a binary vector with a single non-zero entry)



Conjugate Priors


Example

Gibbs distributions (MRFs). This is useful, e.g., when we want to

estimate the temperature parameter in the Ising model

“learn” (estimate, really) the parameters in the Field-of-Experts model.



Conjugate Priors


Remark

There is a connection between natural parameters in an exponential familyand Differential Geometry through a field called Information Geometry.There, every finite-dimensional exponential family is viewed as afinite-dimensional manifold. So, e.g., we can talk about the geodesicdistance – length of the shortest path on the manifold – between twoGaussians.



Beyond iid: Bayesian Connectivity-constrained superpixels

Digression: Bayesian Connectivity-constrained superpixels

In short: Fast inference in a spatio-intensity GMM where the labels arenot iid and where a flexible Inverse-Wisher prior does not discourageelongated superpixels too strongly.

See slides/paper from [Freifeld, Li, Fisher, ICIP 2015] athttps://www.cs.bgu.ac.il/~orenfr/papers.htm


https://www.cs.bgu.ac.il/~orenfr/papers.htm


Beyond iid: Bayesian Connectivity-constrained superpixels

Version Log

28/4/2019, ver 1.00.



computer vision: models, learning and inference mixture ...computer vision: models, learning and...

Documents