computer vision: models, learning and inference mixture ...computer vision: models, learning and...

44
Computer Vision: Models, Learning and Inference Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science, Ben-Gurion University April 29, 2019 www.cs.bgu.ac.il/ ~ cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 1 / 44

Upload: others

Post on 14-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Computer Vision: Models, Learning and Inference–

Mixture Models, Part 3

Oren Freifeld and Ron Shapira-Weber

Computer Science, Ben-Gurion University

April 29, 2019

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 1 / 44

Page 2: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

1 Bayesian GMM: ModelingThe Categorical DistributionThe Dirichlet DistributionThe Normal-Inverse Wishart Distribution

2 Bayesian GMM: InferenceHard Problem with Easy Conditional SubproblemsGibbs Sampling in Bayesian GMM

3 Demos: Gibbs-sampling Inference in Bayesian GMMCategorical Data Likelihood and a Dirichlet PriorNormal Data Likelihood and an NIW PriorGibbs Sampling

4 Conjugate Priors

5 Beyond iid: Bayesian Connectivity-constrained superpixels

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 2 / 44

Page 3: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling

Bayesian GMM

θ = ((µj)Kj=1, (Σj)

Kj=1, (πj)

Kj=1) θ ∼ p(θ) xi

iid∼ p(xi|θ) D , (xi)Ni=1

The joint pdf:

p(D, θ) = p(D|θ)p(θ) = p(θ)

N∏i=1

p(xi|θ) = p(θ)

N∏i=1

K∑zi=1

p(xi, zi|θ)

= p(θ)N∏i=1

K∑j=1

p(xi, zi = j|θ) = p(θ)N∏i=1

K∑j=1

p(xi|zi = j, θ)p(zi = j|θ)

= p(θ)

N∏i=1

K∑j=1

p(xi|zi = j,µj ,Σj)πj = p(θ)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 3 / 44

Page 4: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling

Bayesian GMM

Thus:

θ = ((µj)Kj=1, (Σj)

Kj=1, (πj)

Kj=1)

p(D, θ) = p(θ)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj

Remark

Some authors prefer using

θ = (θj)Kj=1 = ((µj ,Σj)

Kj=1) and π = (πj)

Kj=1 ;

in which case:

p(D, θ, π) = p(θ, π)

N∏i=1

K∑zi=1

N (xi;µj ,Σj)πj

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 4 / 44

Page 5: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling

Bayesian GMM

Typically:

p(θ) = p((µj)Kj=1, (Σj)

Kj=1, (πj)

Kj=1)

assumption=

K∏j=1

p(µj ,Σj)

p(π)

where π = (πj)Kj=1. In effect,

(µ1,Σ1)⊥⊥ (µ2,Σ2)⊥⊥ . . .⊥⊥ (µK ,ΣK)⊥⊥ π

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 5 / 44

Page 6: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling

Bayesian GMM

The posterior:

p(θ|D) =p(D|θ)p(θ)p(D)

∝ p(D|θ)p(θ)

p(θ|D) ∝ p(θ)

p(xi|θ)

N∏i=1

︷ ︸︸ ︷K∑j=1

p(xi|zi = j,µj ,Σj)πj︸ ︷︷ ︸p(D|θ)

=

K∏j=1

p(µj ,Σj)

p(π)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 6 / 44

Page 7: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling

Bayesian GMM

p(θ|D) ∝

K∏j=1

p(µj ,Σj)

p(π)

︸ ︷︷ ︸prior

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj︸ ︷︷ ︸GMM data likelihood

Need to specify p(π) and p(µj ,Σj).

Typically p(µj ,Σj) does not depend on j; i.e.,

(µj ,Σj)iid∼ p(µ,Σ) .

We will go with standard choices that simplify computations.

These choices are related to the concepts of conjugate priors andexponential families.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 7 / 44

Page 8: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Categorical Distribution

Definition (categorical distribution)

A K-dimensional categorical distribution is a discrete distribution over afinite alphabet; WLOG, let this alphabet be {1, . . . ,K}; particularly, z issaid to be drawn from a K-dimensional categorical distributionparametrized by π = (π1, . . . , πK) if

z ∼ Cat(z;π) ,

{πz, z ∈ {1, . . . ,K}0, otherwise

; i.e.: p(z = j) = πj

If z = (zi)Ni=1 and zi

iid∼ Cat(z;π), then

p(z|π) =

N∏i=1

p(zi|π) =

N∏i=1

πzi =

N∏i=1

K∏j=1

π1zi=j

j =

K∏j=1

N∏i=1

π1zi=j

j =

K∏j=1

πNjj

where Nj =∑N

i=1 1zi=j = |{zi : zi = j}|.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 8 / 44

Page 9: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Categorical Distribution

Example

Suppose j = 5. Thus, π = (π1, π2, π3, π4, π5). Suppose N = 6 and wehappened to observe

z = (z1, z2, z3, z4, z5, z6) = (1, 2, 1, 5, 3, 1)

Thus, N1 = 3, N2 = 1, N3 = 1, N4 = 0, N5 = 1, and

p(z|π) = p(z1, z2, z3, z4, z5, z6|π) =

5∏j=1

πNjj

= π31π

12π

13π

04π

15 = π3

1π12π

13π

15

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 9 / 44

Page 10: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Definition (simplex)

The (K − 1)-dimensional (probability/standard) simplex, a subset of RK ,is

{π = (π1, . . . , πj) :∑K

j=1πj = 1 and πj ≥ 0∀j ∈ {1, . . . ,K}}

(in the figure: θ = (θ1, θ2, θ3) is a point on the 2-dimensional simplex)

Figure from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 10 / 44

Page 11: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Definition (Dirichlet distribution)

A K-dimensional Dirichlet distribution is a distribution over the(K − 1)-dimensional simplex such that

π ∼ Dir(π;α1, . . . , αK) ,Γ

K∑j=1

αj

∏K

j=1

παj−1j

Γ(αj)

= Γ

K∑j=1

αj

(∏K

j=1

1

Γ(αj)

)︸ ︷︷ ︸

const w.r.t. π

(∏K

j=1παj−1j

)

where all αj ’s are positive real numbers (see next slide for definition of Γ).

A Dirichlet distribution serves as a prior over categorical distributions.likelihood,

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 11 / 44

Page 12: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Note the similarity between the form of the categorical data likelihood,

p(z|π) =

K∏j=1

πNjj

and the form of the Dirichlet-distribution prior

π ∼ Dir(π;α1, . . . , αK) ∝

K∏j=1

παj−1j

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 12 / 44

Page 13: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Definition (the Γ function)

Γ(t) =

∫ ∞0

xt−1e−x dx

where t is real but not a non-positive integer.

If t is a positive integer, then Γ(t) = (t− 1)!

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 13 / 44

Page 14: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Dirichlet Distributions

π ∼ Dir(π;α1, . . . , αK) , Γ

(∑K

j=1αj

)(∏K

j=1

1

Γ(αj)

)(∏K

j=1παj−1j

)Mode (assuming αj > 1∀j):

αj−1

(∑Kk′=1 αk′)−K

∀k ∈ {1, . . . ,K}

Mean:αj∑K

k′=1 αk′∀k ∈ {1, . . . ,K}

If αj ’s are equal, the Dirichlet distribution is called symmetric and write

π ∼ Dir(π;α) =Γ(∑K

j=1 α)

Γ(α)K

∏K

j=1πα−1j

Example

Dir(π;α = 1) =Γ(

∑Kj=1 1)

Γ(1)K

∏K

j=11 ( = const w.r.t. π )

is the uniform distribution over the simplex.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 14 / 44

Page 15: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Dirichlet Distributions

K = 3, Dir(π; 2, 2, 2) = Dir(π; 2)

Note it is peaked around the uniform distribution, π = (1/3, 1/3, 1/3).Most of the samples from Dir(π; 2, 2, 2) will be somewhat close touniform.

Figure from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 15 / 44

Page 16: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Dirichlet Distributions

K = 3, Dir(π; 20, 2, 2)

Note it is skewed toward π = (1, 0, 0). In effect, most of the samples fromDir(π; 20, 2, 2) are distributions whose most of their masses is placed onstate 1.

Figure from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 16 / 44

Page 17: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Dirichlet Distributions

K = 3, Dir(π; 0.1, 0.1, 0.1) = Dir(π; 0.1)

0

0.5

1

0

0.5

10

5

10

15

α=0.10

p

(ignore the comb-like graphics artifacts)Note it peaks around the three sparse distributions, π = (1, 0, 0),π = (0, 1, 0), and π = (0, 0, 1). Most draws from Dir(π; 0.1, 0.1, 0.1) willcenter most of their mass on one state, but in every such draw, each ofthe three states has an equal chance to be the most probable state.

Figure from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 17 / 44

Page 18: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Categorical Likelihood with a Dirichlet Prior

Let z = (zi)i=1 with ziiid∼ Cat(z;π). Let Nj =

∑Ni=1 1zi=j .

p(π|z) ∝ p(π)

N∏i=1

p(zi|π)

=

Γ

K∑j=1

αj

K∏j=1

παj−1j

Γ(αj)

︸ ︷︷ ︸

Dirichlet prior

(K∏i=1

πNjj

)︸ ︷︷ ︸

categorical data likelihood

= Γ

K∑j=1

αj

K∏j=1

παj+Nd−1j

Γ(αj)

∝ Dir(π;α1 +N1, . . . , αK +NK)

In effect, p(π|z) = Dir(π;α1 +N1, . . . , αK +NK)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 18 / 44

Page 19: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Categorical Likelihood with a Dirichlet Prior

In general, the parameters of the prior or posterior are called hyperparameters.

We started with a Dirichlet prior, with some hyper parameters (αj)Kj=1,

and, once we took the categorical data likelihood into account, ended upwith a posterior such that:1) the posterior is yet another Dirichlet distribution;2) the posterior hyper parameters are given in closed form:

α∗j = αj +Nj , Nj =

N∑i=1

1zi=j ∀j ∈ {1, . . . ,K}

where we used ∗ to denote the hyper parameters of the posterior.

(α∗j )Kj=1 depend on both the data, z, and (αj)

Kj=1

All this is a particular example of what is called conjugacy.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 19 / 44

Page 20: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Dirichlet Distribution

Categorical Likelihood with a Dirichlet Prior

We view the αs’s as pseudo-counts. In other words, it is as if we hadadditional data,

1, 1, . . . , 1︸ ︷︷ ︸α1 times

, 2, 2, . . . , 2︸ ︷︷ ︸α2 times

, . . . ,K,K, . . . ,K︸ ︷︷ ︸αK times

The higher the value of the (sums of the) αj ’s, the higher the weight ofthe Dirichlet prior is. For example, Dir(π; 100) is much more peakedaround the uniform distribution than Dir(π; 2) is.

When N becomes very large, the likelihood dominates the prior.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 20 / 44

Page 21: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Normal-Inverse Wishart Distribution

Definition (Normal Inverse-Wishart distribution)

Let µ ∈ Rn and let Σ ∈ Rn×n be SPD. µ and Σ areNormal-Inverse-Wishart distributed if

p(µ,Σ;κ,m, ν,Ψ) = NIW(µ,Σ;κ,m, ν,Ψ)

,

p(µ|Σ;κ,m)︷ ︸︸ ︷N (µ;m, 1

κΣ)

p(Σ;ν,Ψ)︷ ︸︸ ︷W−1(Σ; ν,Ψ)

W−1(Σ; ν,Ψ) is the Inverse-Wishart distribution, a distribution overn× n SPD matrices. Think of it as a sort of a “Gaussian” overcovariances matrices (we can’t quite have a Gaussian over a nonlinearspace).

The generative process here is: 1) draw Σ from the IW; 2) given Σ,sample µ from an m-mean Gaussian whose covariance is 1

κΣ.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 21 / 44

Page 22: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Normal-Inverse Wishart Distribution

Definition (Inverse-Wishart distribution)

Rn×n 3 Σ ∼ W−1(ν,Ψ), where ν > n− 1 and Ψ ∈ Rn×n is SPD, ⇐⇒

W−1(Σ; ν,Ψ) =|νΨ|

ν2

2νn2 Γn(ν2 )

|Σ|−ν+n+1

2 e−12

tr(νΨΣ−1)

Its mode and mean have closed forms and are close to each other:

E(Σ) =νΨ

ν − n− 1for ν > n+ 1

arg maxΣ

W−1(Σ; ν,Ψ) =νΨ

ν − n+ 1

When n is large, the difference between the two is small.When ν →∞ they both converge to Ψ.Some (SW pkg) authors parametrize the IW with ∆ = Ψ/ν:

W−1(Σ; ν,∆) =|∆|

ν2

2νn2 Γn(ν2 )

|Σ|−ν+n+1

2 e−12

tr(∆Σ−1)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 22 / 44

Page 23: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Normal-Inverse Wishart Distribution

Fact (Normal likelihood with an NIW prior)

If xiiid∼ N (µ,Σ), and p(µ,Σ) = NIW(µ,Σ;κ,m, ν,Ψ) then

p(µ,Σ|x1, . . . ,xN ) = NIW(µ,Σ;κ∗,m∗, ν∗,Ψ∗)

(i.e., the posterior is also NIW) and the dependency of the closed-formupdates on the data is only through

∑i xi’s and

∑i xix

Ti :

κ∗ = κ+N m∗ = func(κ, κ∗,m,

N∑i=1

xi)

ν∗ = ν +N Ψ∗ = func(ν, ν∗, κ, κ∗,m,m∗,Ψ,∑i=1

xixTi )

(κ∗,m∗, ν∗,Ψ∗) depend on both the data, (xi)Ni=1, and (κ,m, ν,Ψ).

Again, all this is a particular example of what is called conjugacy.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 23 / 44

Page 24: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Modeling The Normal-Inverse Wishart Distribution

The Details

κ∗ = κ+N m∗ = func(κ, κ∗,m,

N∑i=1

xi) =1

κ∗

[κm+

N∑i=1

xi

]ν∗ = ν +N

Ψ∗ = func(ν, ν∗, κ, κ∗,m,m∗,Ψ,∑i=1

xixTi )

=1

ν∗

[νΨ + κmmT +

(∑i=1

xixTi

)− κ∗m∗(m∗)T

]

Result taken from Jason Chang’s PhD Thesiswww.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 24 / 44

Page 25: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Inference Hard Problem with Easy Conditional Subproblems

Bayesian GMM Inference

p(θ|(xi)Ni=1) ∝

K∏j=1

p(µj ,Σj)

p(π)

N∏i=1

K∑j=1

p(xi|zi = j,µj ,Σj)πj =

K∏j=1

NIW(µj ,Σj ;κ,m, ν,Ψ)

Dir(π; (αj)Kj=1)

N∏i=1

K∑j=1

N (xi;µj ,Σj)πj

p(θ|(xi)Ni=1) is nasty, so think in terms of p(θ,z|(xi)Ni=1) instead, andalternate between

p(θ|z, (xi)Ni=1) and p(z|θ, (xi)Ni=1)

EM (for MAP estimates) or Hard-assignment EM is one way to goabout it, but can also do Gibbs sampling.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 25 / 44

Page 26: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Inference Hard Problem with Easy Conditional Subproblems

Bayesian GMM Inference

Given the labels, z = (zi)Ni=1, we get

p(θ|z, (xi)Ni=1) = p(π|z)p(µj ,Σj |(xi)i:zi=j)

wherep(π|z) = Dir(π;α1 +N1, . . . , αK +NK)

andp(µj ,Σj |(xi)i:zi=j) = NIW(µj ,Σj ;κ

∗j ,m

∗j , ν∗j ,Ψ

∗j )

where the hyper-parameters posterior for Gaussian k are updated usingNj =

∑Ni 1zi=j (instead of N),

∑i:zi=j

xi (instead of∑N

i=1 xi), and∑i:zi=j

xixTi (instead of

∑Ni=1 xix

Ti ).

We can shoot for either one of:1 arg maxθ p(θ|z, (xi)Ni=1);2 E(θ|z, (xi)Ni=1);3 θ ∼ p(θ|z, (xi)Ni=1)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 26 / 44

Page 27: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Inference Hard Problem with Easy Conditional Subproblems

Remark (If sampling from the joint is easier than sampling from themarginal)

In general, suppose x and y are some two jointly-distributed RVs and wewant to sample x ∼ p(x) but it is hard. If, however, we can samplex, y ∼ p(x, y), we can just discard y and to obtain x ∼ p(x).

In our case here, we don’t know how to sample from p(θ|(xi)Ni=1).Suppose we manage (e.g., through Gibbs sampling) to sampleθ,z ∼ p(θ,z|(xi)Ni=1). Discarding z, we get θ ∼ p(θ|(xi)Ni=1).

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 27 / 44

Page 28: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Inference Gibbs Sampling in Bayesian GMM

Gibbs Sampling for Bayesian GMM Inference

Here Gibbs sampling targets sampling (θ,z) ∼ p(θ|z,D) by alternatingbetween

θ ∼ p(θ|z,D) and z ∼ p(z|θ,D)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 28 / 44

Page 29: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Inference Gibbs Sampling in Bayesian GMM

Gibbs Sampling for Bayesian GMM Inference

θ ∼ p(θ|z,D) = p(π|z, (xi)Ni=1)∏Kj=1 p(µj ,Σj |z, (xi)Ni=1).

In effect, this conditional sampling of θ has two independent parts:

(1) π ∼ p(π|z, (xi)Ni=1) = p(π|z) = Dir(π;α1 +N1, . . . , αK +NK)

(2) µj ,Σj ∼ p(µj ,Σj |z, (xi)Ni=1) = p(µj ,Σj |(xi)i:zi=j)= NIW(µj ,Σj ;κ

∗j ,m

∗j , ν∗j ,Ψ

∗j ) ∀j ∈ {1, . . . ,K}

(the second part may also be done in parallel over the K components)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 29 / 44

Page 30: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Bayesian GMM: Inference Gibbs Sampling in Bayesian GMM

Gibbs Sampling for Bayesian GMM Inference

z ∼ p(z|θ,D) =∏Ni=1 p(zi|θ,D) =

∏Ni=1 p(zi|θ,xi):

zi ∼ p(zi|θ,xi) ∝K∑j=1

πjp(xi|µj ,Σj)1zi=j =

K∑j=1

πjN (xi;µj ,Σj)1zi=j

In effect:p(zi = j|θ,D) ∝ πjN (xi;µj ,Σj)

All labels can be sampled in parallel.

Don’t forget the numerics:

the log-sum-exp trickslogdet

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 30 / 44

Page 31: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Demos: Gibbs-sampling Inference in Bayesian GMM

See demos

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 31 / 44

Page 32: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Bayesian Parameter Estimation

Forget about mixture models for a second, and consider the following:

D = (xi)Ni=1, xi

iid∼ p(x|θ) where θ ∼ p(θ) is also an RV.

p(xi|θ) , fx(xi; θ), where fx(·; θ) is parametrized by θ.

Likelihood:

p(D|θ) =

N∏i=1

fx(xi; θ)

p(θ) = fθ(θ;λ) where fθ(·;λ) is parametrized by λ.

Posterior:

p(θ|D) =p(D, θ)p(D)

=p(θ)p(D|θ)p(D)

∝ p(θ)p(D|θ) = fθ(θ;λ)

N∏i=1

fx(xiθ)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 32 / 44

Page 33: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Bayesian Parameter Estimation

Usually, p(θ|x) does not have the same parametric form as the prior, fθ.

For a certain class of prior-likelihood pairs, the posterior stays within thesame family of functions as the prior.

This class of paired distributions is typically referred to as the class ofconjugate distributions, and the prior distribution that pairs with aparticular data likelihood is called the conjugate prior (to that particularlikelihood)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 33 / 44

Page 34: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Definition (conjugate priors)

Let p(xi|θ) = fx(xi; θ) and let p(θ) = fθ(θ;λ). If

p(θ|D) = fθ(θ;λ∗)

then fθ is said to be conjugate to fx. The posterior hyper-parameters, λ∗,usually depend on both the observations, D, and the priorhyper-parameter, λ; i.e.,

λ∗ = λ∗(D, λ)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 34 / 44

Page 35: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Definition (exponential family)

Let F be a set of parametrized distributions. F is called an exponentialfamily if all its members have the form

p(xi|θ) = f(xi)g(θ) exp(η(θ)T t(xi))

)where η(θ) and t(xi) are vectors of the same dimension as θ.

If θ ∈ Rd, then

η(θ)T t(xi) =

d∑j=1

ηj(θ)tj(xi)

η is called the natural parameter(s).

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 35 / 44

Page 36: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families

The likelihood for iid samples, (xi)Ni=1, is

p((xi)Ni=1|θ) =

(N∏i=1

f(xi)

)g(θ)N exp

η(θ)T T ((xi)Ni=1)︸ ︷︷ ︸

suff stats

∝ g(θ)N exp

(η(θ)TT ((xi)

Ni=1)

)where T ((xi)

Ni=1) ,

∑Ni=1 t(xi) are called the sufficient statistics (i.e.,

sufficient for estimating η).

η(θ)TT ((xi)Ni=1) =

d∑j=1

ηj(θ)Tj((xi)Ni=1) =

d∑j=1

ηj(θ)

N∑i

tj(xi)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 36 / 44

Page 37: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families

Likelihood:

p((xi)Ni=1|θ) ∝ g(θ)N exp

(η(θ)TT ((xi)

Ni=1)

)If the prior is

p(θ) ∝ g(θ)m exp(η(θ)T v

)(for some vector v) then the posterior is:

p(θ|(xi)Ni=1) ∝ g(θ)m+N expη(θ)T (v+T ((xi)Ni=1))

Thus, this prior is conjugate to this exponential family. Here, λ = (m, v)and λ∗ = (m+N, v + T ((xi)

Ni=1)).

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 37 / 44

Page 38: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families

The special structure of exponential families makes bothMaximum-likelihood estimation and Bayesian inference more tractable.

This is due to the fact that finite-dimensional exponential families admitfinite-dimensional sufficient statistics (i.e., the dimensionality of thesufficient statistics does not depend on N) and the fact (in the Bayesiancase) they admit conjugate priors.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 38 / 44

Page 39: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families: Examples We Already Saw

Example

n-dimensional Gaussians (look up the “natural parametrization” of amultivariate Gaussian). The sufficient statistics are:

T ((xi)Ni=1) =

(N∑i=1

xi,

N∑i=1

xixTi

)=

N∑i=1

(xi,xix

Ti

)︸ ︷︷ ︸ti

where vectorize(∑N

i=1 xixTi ) is tied to a vector of length 1

2(n2 + n).

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 39 / 44

Page 40: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families: Examples We Already Saw

Example

K−dimensional Categorical distributions. The sufficient statistics are:

T ((zi)Ni=1) =

(N∑i=1

1zi=1, . . . ,N∑i=1

1zi=K

)=

N∑i=1

1zi=1, . . . ,1zi=K︸ ︷︷ ︸ti

(ti is a binary vector with a single non-zero entry)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 40 / 44

Page 41: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families: Examples We Already Saw

Example

Gibbs distributions (MRFs). This is useful, e.g., when we want to

estimate the temperature parameter in the Ising model

“learn” (estimate, really) the parameters in the Field-of-Experts model.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 41 / 44

Page 42: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Conjugate Priors

Exponential Families

Remark

There is a connection between natural parameters in an exponential familyand Differential Geometry through a field called Information Geometry.There, every finite-dimensional exponential family is viewed as afinite-dimensional manifold. So, e.g., we can talk about the geodesicdistance – length of the shortest path on the manifold – between twoGaussians.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 42 / 44

Page 43: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Beyond iid: Bayesian Connectivity-constrained superpixels

Digression: Bayesian Connectivity-constrained superpixels

In short: Fast inference in a spatio-intensity GMM where the labels arenot iid and where a flexible Inverse-Wisher prior does not discourageelongated superpixels too strongly.

See slides/paper from [Freifeld, Li, Fisher, ICIP 2015] athttps://www.cs.bgu.ac.il/~orenfr/papers.htm

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 43 / 44

Page 44: Computer Vision: Models, Learning and Inference Mixture ...Computer Vision: Models, Learning and Inference {Mixture Models, Part 3 Oren Freifeld and Ron Shapira-Weber Computer Science,

Beyond iid: Bayesian Connectivity-constrained superpixels

Version Log

28/4/2019, ver 1.00.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 3 (ver. 1.00) Apr 29, 2019 44 / 44