1 dirichlet process mixtures a gentle tutorial graphical models – 10708 khalid el-arini carnegie...
TRANSCRIPT
![Page 1: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/1.jpg)
1
Dirichlet Process Mixtures A gentle tutorial
Graphical Models – 10708
Khalid El-Arini
Carnegie Mellon University
November 6th, 2006
![Page 2: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/2.jpg)
10-708 2
We are given a data set, and are told that it was generated from a mixture of Gaussians.
Unfortunately, no one has any idea how many Gaussians produced the data.
Motivation
![Page 3: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/3.jpg)
10-708 3
We are given a data set, and are told that it was generated from a mixture of Gaussians.
Unfortunately, no one has any idea how many Gaussians produced the data.
Motivation
![Page 4: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/4.jpg)
10-708 4
What to do?
We can guess the number of clusters, do EM for Gaussian Mixture Models, look at the results, and then try again…
We can do hierarchical agglomerative clustering, and cut the tree at a visually appealing level…
We want to cluster the data in a statistically principled manner, without resorting to hacks.
![Page 5: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/5.jpg)
10-708 5
Review: Dirichlet Distribution
Let
We write:
Distribution over possible parameter vectors for a multinomial distribution, and is in fact the conjugate prior for the multinomial.
Beta distribution is the special case of a Dirichlet for 2 dimensions.
Samples from the distribution lie in the m-1 dimensional simplex
Thus, it is in fact a “distribution over distributions.”
![Page 6: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/6.jpg)
10-708 6
Dirichlet Process
A Dirichlet Process is also a distribution over distributions.
We write:
G ~ DP(α, G0) G0 is a base distribution
α is a positive scaling parameter
G has the same support as G0
![Page 7: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/7.jpg)
10-708 7
Dirichlet Process
Consider Gaussian G0
G ~ DP(α, G0)
![Page 8: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/8.jpg)
10-708 8
Dirichlet Process
G ~ DP(α, G0)
G0 is continuous, so the probability that any two samples are equal is precisely zero.
However, G is a discrete distribution, made up of a countably infinite number of point masses [Blackwell] Therefore, there is always a non-zero probability of two samples
colliding
![Page 9: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/9.jpg)
10-708 9
Dirichlet Process
G ~ DP(α1, G0)
G ~ DP(α2, G0)
α values determine how closeG is to G0
![Page 10: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/10.jpg)
10-708 10
Sampling from a DP
G ~ DP(α, G0)
Xn | G ~ G for n = {1, …, N} (iid)
Marginalizing out G introduces dependencies
between the Xn variablesG
Xn
N
![Page 11: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/11.jpg)
10-708 11
Sampling from a DP
Assume we view these variables in a specific order, and are interested in the behavior of Xn given the previous n - 1 observations.
Let there be K unique values for the variables:
![Page 12: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/12.jpg)
10-708 12
Sampling from a DP
Notice that the above formulation of the joint does not depend on the order we consider the variables. We can arrive at a mixture model by assuming exchangeability and applying DeFinetti’s Theorem (1935).
Chain rule
P(partition) P(draws)
![Page 13: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/13.jpg)
10-708 13
Chinese Restaurant Process
Can rewrite as:
Let there be K unique values for the variables:
![Page 14: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/14.jpg)
10-708 14
Chinese Restaurant Process
Consider a restaurant with infinitely many tables, where the Xn’s represent the patrons of the restaurant. From the above conditional probability distribution, we can see that a customer is more likely to sit at a table if there are already many people sitting there. However, with probability
proportional to α, the customer will sit at a new table.
Also known as the “clustering effect,” and can be seen in the setting of social clubs. [Aldous]
![Page 15: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/15.jpg)
10-708 15
Dirichlet Process Mixture
G
ηn
N
yn
G0
α
countably infinite number of point masses
draw N times from G to get parameters for different mixture components
If ηn were drawn from e.g. a Gaussian, no two values would be the same, but since they are drawn from a distribution drawn from a Dirichlet Process, we expect a clustering of the ηn
# unique values for ηn = # mixture components
![Page 16: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/16.jpg)
10-708 16
CRP Mixture
![Page 17: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/17.jpg)
10-708 17
Stick Breaking
So far, we’ve just mentioned properties of a distribution G drawn from a Dirichlet Process
In 1994, Sethuraman developed a constructive way of forming G, known as “stick breaking”
![Page 18: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/18.jpg)
10-708 18
Stick Breaking
1. Draw η1* from G0 2. Draw v1 from Beta(1, α)
4. Draw η2* from G0 3. π1 = v1
…
5. Draw v2 from Beta(1, α)6. π2 = v2(1 – v1)
![Page 19: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/19.jpg)
10-708 19
Formal Definition
Let α be a positive, real-valued scalar Let G0 be a non-atomic probability distribution
over support set A We say G ~ DP(α, G0), if for all natural numbers
k and k-partitions {A1, …, Ak},
![Page 20: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/20.jpg)
10-708 20
Inference in a DPM
EM is generally used for inference in a mixture model, but G is nonparametric, making EM difficult
Markov Chain Monte Carlo techniques [Neal 2000]
Variational Inference [Blei and Jordan 2006]
G
ηn
N
yn
G0
α
![Page 21: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/21.jpg)
10-708 21
Gibbs Sampling [Neal 2000]
Algorithm 1: Define Hi to be the single
observation posterior We marginalize out G from
our model, and sample each ηn given everything else
G
ηn
N
yn
G0
α
SLOW TO CONVERGE!
![Page 22: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/22.jpg)
10-708 22
Gibbs Sampling [WAS 22-DAL 19]
![Page 23: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/23.jpg)
10-708 23
Gibbs Sampling [Neal 2000]
Algorithm 2:
G
ηn
N
yn
G0
α
cn
N
yn
G0
α
ηc
∞
[Grenager 2005]
![Page 24: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/24.jpg)
10-708 24
Gibbs Sampling [Neal 2000]
Algorithm 2 (cont.): We sample from the distribution over an individual
cluster assignment cn given yn, and all the other cluster assignments
1. Initialize cluster assignments c1, …, cN
2. For i=1,…,N, draw ci from:
3. For all c, draw ηc | yi (for all i such that ci = c)
if c = cj for some j ≠ i
otherwise
![Page 25: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/25.jpg)
10-708 25
We now have a statistically principled mechanism for solving our original problem.
This was intended as a general and fairly shallow overview of Dirichlet Processes.
Conclusion
![Page 26: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/26.jpg)
10-708 26
Acknowledgments
Much thanks goes to David Blei. Some material for this presentation was inspired
by slides from Teg Grenager and Zoubin Ghahramani.
![Page 27: 1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used](https://reader033.vdocuments.us/reader033/viewer/2022051000/56649f4a5503460f94c6be34/html5/thumbnails/27.jpg)
10-708 27
References
Blei, David M. and Michael I. Jordan. “Variational inference for Dirichlet process mixtures.” Bayesian Analysis 1(1), 2006.
R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249-265, 2000.
Ghahramani, Zoubin. “Non-parametric Bayesian Methods.” UAI Tutorial July 2005.
Grenager, Teg. “Chinese Restaurants and Stick Breaking: An Introduction to the Dirichlet Process”
Blackwell, David and James B. MacQueen. “Ferguson Distributions via Polya Urn Schemes.” The Annals of Statistics 1(2), 1973, 353-355.
Ferguson, Thomas S. “A Bayesian Analysis of Some Nonparametric Problems” The Annals of Statistics 1(2), 1973, 209-230.