bayesian nonparametrics: random partitionsteh/teaching/npbayes/mlss2011s.pdf · bayesian...
TRANSCRIPT
![Page 1: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/1.jpg)
Bayesian Nonparametrics:Random Partitions
Yee Whye TehGatsby Computational Neuroscience Unit, UCL
Machine Learning Summer School SingaporeJune 2011
![Page 2: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/2.jpg)
Previous Tutorials and Reviews
• Mike Jordan’s tutorial at NIPS 2005.
• Zoubin Ghahramani’s tutorial at UAI 2005.
• Peter Orbanz’ tutorial at MLSS 2009 (videolectures)
• My own tutorials at MLSS 2007, 2009 (videolectures) and elsewhere.
• Introduction to Dirichlet process [Teh 2010], nonparametric Bayes [Orbanz & Teh 2010, Gershman & Blei 2011], hierarchical Bayesian nonparametric models [Teh & Jordan 2010].
• Bayesian nonparametrics book [Hjort et al 2010].
• This tutorial: focus on the role of random partitions.
![Page 3: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/3.jpg)
Bayesian Modelling
![Page 4: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/4.jpg)
Probabilistic Machine Learning
• Machine Learning is all about data.
• Stochastic process
• Noisily observed
• Partially observed
• Probability theory is a language to express all these notions.
• Probabilistic models
• Allow us to reason coherently about such data.
• Give us powerful computational tools to implement such reasoning on computers.
![Page 5: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/5.jpg)
Probabilistic Models
• Data: x1,x2,....,xn.
• Latent variables: y1,y2,...,yn.
• Parameter: θ.
• A probabilistic model is a parametrized joint distribution over variables.
• Inference, of latent variables given observed data:
P (y1, . . . , yn|x1, . . . , xn, θ) =P (x1, . . . , xn, y1, . . . , yn|θ)
P (x1, . . . , xn|θ)
P (x1, . . . , xn, y1, . . . , yn|θ)
![Page 6: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/6.jpg)
Probabilistic Models
• Learning, typically by maximum likelihood:
• Prediction:
• Classification:
• Visualization, interpretation, summarization.
P (xn+1, yn+1|x1, . . . , xn, θ)
argmaxc
P (xn+1|θc)
θML = argmaxθ
P (x1, . . . , xn|θ)
![Page 7: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/7.jpg)
Graphical Models
Earthquake Burglar
Alarm
Visit to Asia Smoking
Tuberculosis Lung Cancer Bronchitis
Dyspnea
Tuberculosis or lung cancer
Abnormal X ray
• Nodes = variables
• Edges = dependencies
• Lack of edges = conditional independencies
![Page 8: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/8.jpg)
Bayesian Machine Learning
• Prior distribution:
• Posterior distribution (inference and learning):
• Prediction:
• Classification:
P (θ)
P (y1, . . . , yn, θ|x1, . . . , xn) =P (x1, . . . , xn, y1, . . . , yn|θ)P (θ)
P (x1, . . . , xn)
P (xn+1|xc1, . . . , x
cn) =
�P (xn+1|θc)P (θc|xc
1, . . . , xcn)dθ
c
P (xn+1|x1, . . . , xn) =
�P (xn+1|θ)P (θ|x1, . . . , xn)dθ
![Page 9: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/9.jpg)
Bayesian Models
• Latent variables and parameters are not distinguished.
• Important operations are marginalization and posterior computation.
zi
π
α
H
i = 1, . . . , n
xi
θ∗kk = 1, . . . ,K
z0 z1 z2 zτ
x1 x2 xτ
β πk
θ∗kK
topics k=1...K
document j=1...D
words i=1...nd
πj
zji
xji θk
![Page 10: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/10.jpg)
Blind Deconvolution
[Fergus et al 2006, Levin et al 2009]
![Page 11: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/11.jpg)
Pros and Cons
• Maximum likelihood: have to guard against overfitting.
• Bayesian methods do not fit any parameters (no overfitting).
• Bayesian posterior distribution captures more information from data.
• Bayesian learning is coherent (no Dutch book).
• Prior distributions and model structures can be useful way to introduce prior knowledge.
• Bayesian inference is often more complex.
• Bayesian inference is often more computationally intensive.
• Powerful computational tools have been developed for Bayesian inference.
• Often easier to think about the “ideal” and “right way” to perform learning and inference before considering how to do it efficiently.
![Page 12: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/12.jpg)
Bayesian Nonparametrics
[Hjort et al 2010]
![Page 13: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/13.jpg)
Model Selection
• In non-Bayesian methods model selection is needed to prevent overfitting and underfitting.
• In Bayesian contexts model selection is also useful as a way of determining the appropriateness of various models given data.
• Marginal likelihood:
• Model selection:
• Model averaging:
P (x|Mk) =
�P (x|θk,Mk)P (θk|Mk)dθk
argmaxMk
P (x|Mk)
P (Mk|x) ∝ P (x|Mk)P (Mk)
![Page 14: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/14.jpg)
Model Selection
• Model selection is often very computationally intensive.
• But reasonable and proper Bayesian methods should not overfit anyway.
• Idea: use one large model, and be Bayesian so will not overfit.
• Bayesian nonparametric idea: use a very large Bayesian model avoids both overfitting and underfitting.
![Page 15: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/15.jpg)
Large Function Spaces
• Large function spaces.
• More straightforward to infer the infinite-dimensional objects themselves.
*
*
*
*
*
*
*
*
* ** * ** * ** *** ****
![Page 16: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/16.jpg)
Structural Learning
• Learning structures.
• Bayesian prior over combinatorial structures.
• Nonparametric priors sometimes end up simpler than parametric priors.
duckchicken
sealdolphinmouse
ratsquirrel
catcow
sheeppig
deerhorse
tigerlion
lettucecucumber
carrotpotatoradishonions
tangerineorange
grapefruitlemonapplegrape
strawberrynectarinepineapple
drillclamppliers
scissorschiselaxe
tomahawkcrowbar
screwdriverwrenchhammer
sledgehammershovelhoerakeyachtship
submarinehelicopter
trainjetcarvan
truckbus
motorcyclebike
wheelbarrowtricyclejeep[Adams et al AISTATS 2010, Blundell et al UAI 2010]
![Page 17: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/17.jpg)
Novel Models with Useful Properties
• Many interesting Bayesian nonparametric models with interesting and useful properties:
• Projectivity, exchangeability.
• Zipf, Heap and other power laws
• (Pitman-Yao, 3-parameter IBP).
• Flexible ways of building complex models
• (Hierarchical nonparametric models, dependent Dirichlet processes).
• Techniques developed for Bayesian nonparametric models are applicable to many stochastic processes not traditionally considered as Bayesian nonparametric models.
![Page 18: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/18.jpg)
Are Nonparametric Models Nonparametric?
• Nonparametric just means not parametric: cannot be described by a fixed set of parameters.
• Nonparametric models still have parameters, they just have an infinite number of them.
• No free lunch: cannot learn from data unless you make assumptions.
• Nonparametric models still make modelling assumptions, they are just less constrained than the typical parametric models.
• Models can be nonparametric in one sense and parametric in another: semiparametric models.
![Page 19: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/19.jpg)
Random Partitions inBayesian Nonparametrics
![Page 20: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/20.jpg)
Overview
• Random partitions, through their relationship to Dirichlet and Pitman-Yor processes, play very important roles in Bayesian nonparametrics.
• Introduce Chinese restaurant processes via finite mixture models.
• Projectivity, exchangeability, de Finetti’s Theorem, and Kingman’s paintbox construction.
• Two parameter extension and the Pitman-Yor process.
• Infinite mixture models and inference in them.
![Page 21: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/21.jpg)
Random Partitions
![Page 22: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/22.jpg)
Partitions
• A partition ϱ of a set S is:
• A disjoint family of non-empty subsets of S whose union in S.
• S = {Alice, Bob, Charles, David, Emma, Florence}.
• ϱ = { {Alice, David}, {Bob, Charles, Emma}, {Florence} }.
• Denote the set of all partitions of S as PS.
AliceDavid
BobCharlesEmma
Florence
![Page 23: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/23.jpg)
Partitions in Model-based Clustering
• Given a dataset S, partition it into clusters of similar items.
• Cluster c ∈ ϱ described by a model
parameterized by θc.
• Bayesian approach: introduce prior over ϱ and θc; compute posterior over both.
• To do so we will need to work with random partitions.
F (x|θc)
![Page 24: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/24.jpg)
From Finite Mixture Models to Chinese Restaurant Processes
![Page 25: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/25.jpg)
Finite Mixture Models
• Explicitly allow only K clusters in partition:
• Each cluster c has parameter θc.
• Each data item i assigned to c with mixing probability vector πc.
• Gives a random partition with at most K clusters.
• Priors on the other parameters:
•
zi
π
α
H
i = 1 . . . n
xi
θcc = 1 . . .K
π|α ∼ Dirichlet(α)
θc|H ∼ H
![Page 26: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/26.jpg)
Finite Mixture Models
• Dirichlet distribution on the K-dimensional probability simplex { π | Σc πc = 1 }:
• with and .
• Standard distribution on probability vectors, due to conjugacy with multinomial.
zi
π
α
H
i = 1 . . . n
xi
θcc = 1 . . .K
P (π|α) =Γ(
�c αc)�
c Γ(αc)
K�
c=1
παc−1c
Γ(α) =�∞0 xα−1exdx αc ≥ 0
π|α ∼ Dirichlet(α)
θc|H ∼ H
zi|π ∼ Multinomial(π)
xi|zi, θzi ∼ F (·|θzi)
![Page 27: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/27.jpg)
Dirichlet Distribution
P (π|α) =Γ(
�c αc)�
c Γ(αc)
K�
c=1
παc−1c
(1, 1, 1) (2, 2, 2)
(2, 2, 5)
(5, 5, 5)
(2, 5, 5) (0.7, 0.7, 0.7)
![Page 28: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/28.jpg)
Dirichlet-Multinomial Conjugacy
• Joint distribution over zi and π:
• where nc = #{zi=c}.
• Posterior distribution:
• Marginal distribution:
P (π|α)×n�
i=1
P (zi|π) =Γ(
�c αc)�
c Γ(αc)
K�
c=1
παc−1c ×
K�
c=1
πncc
P (π|z,α) =Γ(n+
�c αc)�
c Γ(nc + αc)
K�
c=1
πnc+αc−1c
P (z|α) =Γ(
�c αc)�
c Γ(αc)
�c Γ(nc + αc)
Γ(n+�
c αc)
![Page 29: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/29.jpg)
Induced Distribution over Partitions
• P(z|α) describes a partition of the data set into clusters, and a labelling of each cluster with a mixture component index.
• Induces a distribution over partitions of the data set (without labelling).
• Start by supposing αc = α/K.
• The partition ϱ has |ϱ| = k ≤ K clusters, each of which can be assigned one of K labels (without replacement). So after some algebra:
P (z|α) =Γ(
�c αc)�
c Γ(αc)
�c Γ(nc + αc)
Γ(n+�
c αc)
P (z|α) = 1
K(K − 1) · · · (K − k + 1)P (�|α)
P (�|α) = [K]k−1Γ(α)
Γ(n+ α)
�
c∈�
Γ(|c|+ α/K)
Γ(α/K)
![Page 30: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/30.jpg)
Chinese Restaurant Process
• Taking K → ∞, we get a proper distribution over partitions without a limit on the number of clusters:
• (using ).
• This is the Chinese restaurant process (CRP).
• We write ϱ ~ CRP([n],α) if ϱ ∈ P[n] is CRP distributed.
Γ(α/K)α/K = Γ(1 + α/K)
P (�|α) = [K]k−1Γ(α)
Γ(n+ α)
�
c∈�
Γ(|c|+ α/K)
Γ(α/K)
→ α|�|Γ(α)
Γ(n+ α)
�
c∈�
Γ(|c|)
![Page 31: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/31.jpg)
Chinese Restaurant Processes
[Aldous 1985, Pitman 2006]
![Page 32: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/32.jpg)
Chinese Restaurant Process
• Each customer comes into restaurant and sits at a table:
• Customers correspond to elements of set S, and tables to clusters in the partition ϱ.
• Multiplying all terms together, we get the overall probability of ϱ:
136
27
458
9
P (sit at table c) =nc
α+�
c∈� nc
P (sit at new table) =α
α+�
c∈� nc
P (�|α) = α|�|Γ(α)
Γ(n+ α)
�
c∈�
Γ(|c|)
![Page 33: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/33.jpg)
Exchangeability
![Page 34: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/34.jpg)
Exchangeability
• A distribution over partitions PS is exchangeable if it is invariant to permutations of S: For example,
• P(ϱ = {{1,3,6},{2,7},{4,5,8},{9}}) =
![Page 35: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/35.jpg)
Exchangeability
• A distribution over partitions PS is exchangeable if it is invariant to permutations of S: For example,
• P(ϱ = {{1,3,6},{2,7},{4,5,8},{9}}) =
• P(ϱ = {{σ(1), σ(3), σ(6)},{σ(2), σ(7)},{σ(4), σ(5), σ(8)},{σ(9)}})
• where S = [9] = {1,...,9}, and σ is a permutation of [9].
![Page 36: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/36.jpg)
Exchangeability
• A distribution over partitions PS is exchangeable if it is invariant to permutations of S: For example,
• P(ϱ = {{1,3,6},{2,7},{4,5,8},{9}}) =
• P(ϱ = {{σ(1), σ(3), σ(6)},{σ(2), σ(7)},{σ(4), σ(5), σ(8)},{σ(9)}})
• where S = [9] = {1,...,9}, and σ is a permutation of [9].
• The Chinese restaurant process satisfies exchangeability:
![Page 37: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/37.jpg)
Exchangeability
• A distribution over partitions PS is exchangeable if it is invariant to permutations of S: For example,
• P(ϱ = {{1,3,6},{2,7},{4,5,8},{9}}) =
• P(ϱ = {{σ(1), σ(3), σ(6)},{σ(2), σ(7)},{σ(4), σ(5), σ(8)},{σ(9)}})
• where S = [9] = {1,...,9}, and σ is a permutation of [9].
• The Chinese restaurant process satisfies exchangeability:
• The finite mixture model is exchangeable (iid given parameters).
![Page 38: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/38.jpg)
Exchangeability
• A distribution over partitions PS is exchangeable if it is invariant to permutations of S: For example,
• P(ϱ = {{1,3,6},{2,7},{4,5,8},{9}}) =
• P(ϱ = {{σ(1), σ(3), σ(6)},{σ(2), σ(7)},{σ(4), σ(5), σ(8)},{σ(9)}})
• where S = [9] = {1,...,9}, and σ is a permutation of [9].
• The Chinese restaurant process satisfies exchangeability:
• The finite mixture model is exchangeable (iid given parameters).
• The probability of ϱ under the CRP does not depend on the identities of elements of S.
![Page 39: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/39.jpg)
Exchangeability
• A distribution over partitions PS is exchangeable if it is invariant to permutations of S: For example,
• P(ϱ = {{1,3,6},{2,7},{4,5,8},{9}}) =
• P(ϱ = {{σ(1), σ(3), σ(6)},{σ(2), σ(7)},{σ(4), σ(5), σ(8)},{σ(9)}})
• where S = [9] = {1,...,9}, and σ is a permutation of [9].
• The Chinese restaurant process satisfies exchangeability:
• The finite mixture model is exchangeable (iid given parameters).
• The probability of ϱ under the CRP does not depend on the identities of elements of S.
• An exchangeable is one that does not depend on the (arbitrary) way data items are indexed.
![Page 40: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/40.jpg)
Consistency and Projectivity
![Page 41: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/41.jpg)
Consistency and Projectivity
• Let ϱ be a partition of S, and S’ ⊂ S be a subset. The projection of ϱ onto S’ is the partition of S’ defined by ϱ:
• PROJ(ϱ, S’) = { c ∩ S’ | c ∩ S’ ≠ ∅, c ∈ S }
![Page 42: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/42.jpg)
Consistency and Projectivity
• Let ϱ be a partition of S, and S’ ⊂ S be a subset. The projection of ϱ onto S’ is the partition of S’ defined by ϱ:
• PROJ(ϱ, S’) = { c ∩ S’ | c ∩ S’ ≠ ∅, c ∈ S }
• A sequence of distributions P1,P2,... over P[1], P[2],... is projective or consistent if the distribution on P[n] induced by Pm for m>n is Pn.
![Page 43: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/43.jpg)
Consistency and Projectivity
• Let ϱ be a partition of S, and S’ ⊂ S be a subset. The projection of ϱ onto S’ is the partition of S’ defined by ϱ:
• PROJ(ϱ, S’) = { c ∩ S’ | c ∩ S’ ≠ ∅, c ∈ S }
• A sequence of distributions P1,P2,... over P[1], P[2],... is projective or consistent if the distribution on P[n] induced by Pm for m>n is Pn.
• Pm({ϱm : PROJ(ϱm,[n]) = ϱn}) = Pn(ϱn)
![Page 44: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/44.jpg)
Consistency and Projectivity
• Let ϱ be a partition of S, and S’ ⊂ S be a subset. The projection of ϱ onto S’ is the partition of S’ defined by ϱ:
• PROJ(ϱ, S’) = { c ∩ S’ | c ∩ S’ ≠ ∅, c ∈ S }
• A sequence of distributions P1,P2,... over P[1], P[2],... is projective or consistent if the distribution on P[n] induced by Pm for m>n is Pn.
• Pm({ϱm : PROJ(ϱm,[n]) = ϱn}) = Pn(ϱn)
• The Chinese restaurant process is projective since:
![Page 45: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/45.jpg)
Consistency and Projectivity
• Let ϱ be a partition of S, and S’ ⊂ S be a subset. The projection of ϱ onto S’ is the partition of S’ defined by ϱ:
• PROJ(ϱ, S’) = { c ∩ S’ | c ∩ S’ ≠ ∅, c ∈ S }
• A sequence of distributions P1,P2,... over P[1], P[2],... is projective or consistent if the distribution on P[n] induced by Pm for m>n is Pn.
• Pm({ϱm : PROJ(ϱm,[n]) = ϱn}) = Pn(ϱn)
• The Chinese restaurant process is projective since:
• The finite mixture model is, and also it is defined sequentially.
![Page 46: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/46.jpg)
Consistency and Projectivity
• Let ϱ be a partition of S, and S’ ⊂ S be a subset. The projection of ϱ onto S’ is the partition of S’ defined by ϱ:
• PROJ(ϱ, S’) = { c ∩ S’ | c ∩ S’ ≠ ∅, c ∈ S }
• A sequence of distributions P1,P2,... over P[1], P[2],... is projective or consistent if the distribution on P[n] induced by Pm for m>n is Pn.
• Pm({ϱm : PROJ(ϱm,[n]) = ϱn}) = Pn(ϱn)
• The Chinese restaurant process is projective since:
• The finite mixture model is, and also it is defined sequentially.
• A projective model is one that does not change when more data items are introduced (and can be learned sequentially in a self-consistent manner).
![Page 47: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/47.jpg)
Projective and Exchangeable Partitions
• Projective and exchangeable random partitions over [n] can be extended to distributions over partitions of N in a unique manner.
• Can we characterize all random exchangeable projective partitions of N?
![Page 48: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/48.jpg)
Projective and Exchangeable Partitions
• Let G be a distribution, with possibly both atomic and smooth components:
• Let ϕi ~ G for each i independently, and define an exchangeable random partition ϱ based on the unique values of {ϕi}.
• If G is random, the distribution over ϱ is given by
• All exchangeable random partitions admit such a representation.
• Note that the integral is written as if both ϱ and G have densities; this is not (typically) true.
G = G0 +∞�
c=1
πcδθc
P (�) =
�P (�|G)P (G)dG
![Page 49: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/49.jpg)
Kingman’s Paint-box Construction
• A paint-box consists of a number of colours, with colour c picked with probability πc. There is also a special colour 0 with probability π0 which looks different each time it is picked.
• For each i ∈ N pick a colour from the paint-box.
• This defines a partition ϱ of N according the different colours picked.
• Given a distribution over the probabilities {πc} we get an exchangeable random ϱ.
[Kingman JRSSB 1975]
![Page 50: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/50.jpg)
De Finetti’s Theorem
• Let be an infinitely exchangeable (i.e. projective and exchangeable) sequence of random variables:
• for all n and permutations σ of [n].
• Then there is a latent variable G such that:
x1, x2, x3, . . .
P (x1, . . . , xn) = P (xσ(1), . . . , xσ(n))
P (x1, . . . , xn) =
�P (G)
n�
i=1
P (xi|G)dG
[De Finetti 1931, Kallenberg 2005]
![Page 51: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/51.jpg)
Dirichlet Process
• Since the Chinese restaurant process is exchangeable, we can define an an infinitely exchangeable sequence as follows:
• Sample ϱ ~ CRP(N,α).
• For c ∈ ϱ :
• sample yc ~ H.
• For i=1,2,...:
• set xi = yc where i ∈ c.
• The resulting de Finetti measure is the Dirichlet Process with parameters α and H (DP(α,H)).
y1 y2 y3 y4
x1
x3x6
x2x7
x4
x5
x8
x9
[Ferguson AoS 1973, Blackwell & MacQueen AoS 1973]
![Page 52: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/52.jpg)
• The two-parameter Chinese restaurant process CRP([n],d,α) is a distribution over P[n]: (0 ≤ d < 1, α > -d)
Two-parameter Chinese Restaurant Processes136
27
458
9
P (sit at table c) =nc − d
α+�
c∈� ncP (sit at new table) =
α+ d|�|α+
�c∈� nc
P (�) =[α+ d]|�|−1
d
[α+ 1]n−11
�
c∈�
[1− d]|c|−11 [z]mb = z(z + b) · · · (z + (m− 1)b)
[Perman et al 1992, Pitman & Yor AoP 1997, Goldwater et al NIPS 2006, Teh ACL 2006]
![Page 53: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/53.jpg)
• The two-parameter Chinese restaurant process CRP([n],d,α) is a distribution over P[n]: (0 ≤ d < 1, α > -d)
• These are also projective and exchangeable distributions.
Two-parameter Chinese Restaurant Processes136
27
458
9
P (sit at table c) =nc − d
α+�
c∈� ncP (sit at new table) =
α+ d|�|α+
�c∈� nc
P (�) =[α+ d]|�|−1
d
[α+ 1]n−11
�
c∈�
[1− d]|c|−11 [z]mb = z(z + b) · · · (z + (m− 1)b)
[Perman et al 1992, Pitman & Yor AoP 1997, Goldwater et al NIPS 2006, Teh ACL 2006]
![Page 54: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/54.jpg)
• The two-parameter Chinese restaurant process CRP([n],d,α) is a distribution over P[n]: (0 ≤ d < 1, α > -d)
• These are also projective and exchangeable distributions.
• De Finetti measure is the Pitman-Yor process, which is a generalization of the Dirichlet process.
Two-parameter Chinese Restaurant Processes136
27
458
9
P (sit at table c) =nc − d
α+�
c∈� ncP (sit at new table) =
α+ d|�|α+
�c∈� nc
P (�) =[α+ d]|�|−1
d
[α+ 1]n−11
�
c∈�
[1− d]|c|−11 [z]mb = z(z + b) · · · (z + (m− 1)b)
[Perman et al 1992, Pitman & Yor AoP 1997, Goldwater et al NIPS 2006, Teh ACL 2006]
![Page 55: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/55.jpg)
Indian Buffet Processes
• Mixture models fundamentally use very simple representations of data:
• Each data item belongs to just one cluster.
• Better representation if we allow multiple clusters per data item.
• Indian buffet processes (IBPs).
Dishes
Cu
sto
me
rs
Tables
Cu
sto
me
rs
[Griffiths & Ghahramani 2006]
![Page 56: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/56.jpg)
Infinite Mixture Models
[Neal JCGS 2000, Rasmussen NIPS 2000, Ishwaran & Zarepour 2002]
![Page 57: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/57.jpg)
Infinite Mixture Models• Derived CRPs from finite mixture models.
• Taking K → ∞ gives infinite mixture models.
• Expressed using CRPs:
zi
π
α
H
i = 1 . . . n
xi
θcc = 1 . . .K
ρ
α
H
i = 1 . . . n
xi θcc = 1 . . .∞
ρ|α ∼ CRP([n],α)
θc|H ∼ H for c ∈ ρ
xi|θc ∼ F (·|θc) for c � i
![Page 58: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/58.jpg)
Number of Clusters
• Only a small number of mixture components used to model data.
• Prior over the number of clusters important in understanding the effect of the CRP prior (and through the infinite limit, the Dirichlet prior).
• Prior expectation and variance are:
E[k|n] =n�
i=1
α
α+ i− 1= α(ψ(α+ n)− ψ(α)) ∈ O
�α log
�1 +
n
α
��
V [k|n] = α(ψ(α+ n)− ψ(α)) + α2(ψ�(α+ n)− ψ�(α)) ∈ O
�α log
�1 +
n
α
��
ψ(α) =∂
∂αlogΓ(α)
![Page 59: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/59.jpg)
Number of Clusters
0 20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25alpha = 1
0 20 40 60 80 1000
0.02
0.04
0.06
0.08
0.1
0.12alpha = 10
0 20 40 60 80 1000
0.02
0.04
0.06
0.08
0.1alpha = 100
• Prior number of clusters strongly dependent on α, and has small variance (not very uninformative).
• If uncertain about the number of clusters, need to place prior on α and marginalize over it.
![Page 60: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/60.jpg)
Gibbs Sampling
• Iteratively resample clustering of each data item.
• Probability of cluster of data item i is:
• Complex and expensive to compute.
ρ
α
H
i = 1 . . . n
xi θcc = 1 . . .∞
ρi =cluster of i
P (ρ) =n�
i=1
P (ρi|ρ1:i−1)
P (ρi|ρ|\i,x, θ) =P (ρi|ρ|1:i−1)n�
j=i+1
P (ρj |ρ|1:j−1)
× P (xi|ρi,x\i, θρi)ρ|α ∼ CRP([n],α)
θc|H ∼ H for c ∈ ρ
xi|θc ∼ F (·|θc) for c � i
![Page 61: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/61.jpg)
Gibbs Sampling
• Iteratively resample clustering of each data item.
• Make use of exchangeability of ϱ to treat data item i as the last customer entering restaurant.
• Simpler conditionals.
ρ
α
H
i = 1 . . . n
xi θcc = 1 . . .∞
P (ρi|ρ|\i,x, θ) =P (ρi|ρ|\i)P (xi|ρi,x\i, θρi)
P (ρi|ρ|\i) =�
|c|n−1+α if ρi = c �= ∅
αn−1+α if ρi = ∅
ρ|α ∼ CRP([n],α)
θc|H ∼ H for c ∈ ρ
xi|θc ∼ F (·|θc) for c � i
![Page 62: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/62.jpg)
Parameter Updates
• Conditional distribution for α is:
• Various updates: auxiliary variables Gibbs, Metropolis-Hastings, slice sampling.
• Typical prior for α is gamma distribution.
ρ
α
H
i = 1 . . . n
xi θcc = 1 . . .∞
P (α) ∝ αa−1e−bα
Γ(α)
Γ(n+ α)=
1
Γ(n)
� 1
0tα−1(1− t)n−1dt
P (α|ρ) ∝ P (α)P (ρ|α)
=P (α)α|�|Γ(α)
Γ(n+ α)
�
c∈�
Γ(|c|)
ρ|α ∼ CRP([n],α)
θc|H ∼ H for c ∈ ρ
xi|θc ∼ F (·|θc) for c � i
![Page 63: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/63.jpg)
Parameter Updates
• Conditional distribution for θc are:
• If H is conjugate to P(x|θ), θc’s can be marginalized out.
• This is called a collapsed Gibbs sampler.
ρ
α
H
i = 1 . . . n
xi θcc = 1 . . .∞
P (xi|ρi = c,x\i)
=
�P (xi|θc)P (θc|xc)dθc
=
�F (xi|θc)
�j∈c F (xj |θc)H(θc)dθc� �
j∈c F (xj |θc)H(θc)dθc
P (θc) ∝ H(θc)�
i∈c
F (xi|θc)
ρ|α ∼ CRP([n],α)
θc|H ∼ H for c ∈ ρ
xi|θc ∼ F (·|θc) for c � i
![Page 64: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/64.jpg)
Random Trees inBayesian Nonparametrics
![Page 65: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/65.jpg)
Overview
• Bayesian nonparametric learning of trees and hierarchical partitions.
• View of rooted trees as sequences of partitions.
• Fragmentations and coagulations.
• Unifying view of various Bayesian nonparametric models for random trees.
![Page 66: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/66.jpg)
From Random Partitionsto Random Trees
![Page 67: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/67.jpg)
Treesduck
chickenseal
dolphinmouse
ratsquirrel
catcow
sheeppig
deerhorse
tigerlion
lettucecucumber
carrotpotatoradishonions
tangerineorange
grapefruitlemonapplegrape
strawberrynectarinepineapple
drillclamppliers
scissorschiselaxe
tomahawkcrowbar
screwdriverwrenchhammer
sledgehammershovelhoerakeyachtship
submarinehelicopter
trainjetcarvan
truckbus
motorcyclebike
wheelbarrowtricyclejeep
![Page 68: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/68.jpg)
Bayesian Inference for Trees
• Computational and statistical methods for constructing trees:
• Algorithmic, not model-based.
• Maximum likelihood
• Maximum parsimony
• Bayesian inference: introduce prior over trees and compute posterior.
• Projectivity and exchangeability leads to Bayesian nonparametric models for P(T).
P (T |x) ∝ P (T )P (x|T )
![Page 69: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/69.jpg)
Trees as Sequences of Partitions
![Page 70: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/70.jpg)
Trees as Sequences of Partitions
![Page 71: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/71.jpg)
Trees as Sequences of Partitions
![Page 72: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/72.jpg)
Trees as Sequences of Partitions
![Page 73: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/73.jpg)
Trees as Sequences of Partitions
![Page 74: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/74.jpg)
Trees as Sequences of Partitions
![Page 75: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/75.jpg)
Trees as Sequences of Partitions
![Page 76: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/76.jpg)
Fragmenting Partitions
• Sequence of finer and finer partitions.
• Each cluster fragments until all clusters contain only 1 data item.
• Can define a distribution over trees using a Markov chain of fragmenting partitions, with absorbing state 0S (partition where all data items are in their own clusters).
�1
�2
�4
�5
�3
1 2 3 4 5 6 7 8 9
1 2
1 2 5 6 3 7 8 9 4
5 6 7 8 93 4
43 7 8 91 2 5 6
1 2 5 6 3 7 8 9 4
![Page 77: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/77.jpg)
Coagulating Partitions
• Sequence of coarser and coarser partitions.
• Each cluster formed by coagulating smaller clusters until only 1 left.
• Can define a distribution over trees by using a Markov chain of coagulating partitions, with absorbing state 1S (partition where all data items are in one cluster).
�1
�2
�4
�5
�3
1 2 3 4 5 6 7 8 9
1 2
1 2 5 6 3 7 8 9 4
5 6 7 8 93 4
43 7 8 91 2 5 6
1 2 5 6 3 7 8 9 4
![Page 78: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/78.jpg)
Random Fragmentations and Random Coagulations
[Bertoin 2006]
![Page 79: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/79.jpg)
Coagulation and Fragmentation Operators
31
6
27
54
8
9
A
B
C
D
ϱ1
![Page 80: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/80.jpg)
Coagulation and Fragmentation Operators
31
6
27
54
8
9
A
B
C
D
ϱ1
Coagulate
AB
C
D
ϱ2
![Page 81: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/81.jpg)
Coagulation and Fragmentation Operators
31
6
27
54
8
9
A
B
C
D
ϱ1
21
367
54
8
9
C
Coagulate
AB
C
D
ϱ2
![Page 82: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/82.jpg)
Coagulation and Fragmentation Operators
31
6
27
54
8
9
A
B
C
D
ϱ1
21
367
54
8
9
C
Coagulate
AB
C
D
ϱ2
Fragment
31
6
27
54
8
9
F1 F2 F3
![Page 83: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/83.jpg)
Random Coagulations• Let ϱ1 ∈ P[n] and ϱ2 ∈ Pϱ1.
• Denote coagulation of ϱ1 by ϱ2 as coag(ϱ1, ϱ2).
• Write C | ϱ1 ~ COAG(ϱ1,d,α) if C = coag(ϱ1, ϱ2) with
• ϱ2 | ϱ1 ~ CRP(ϱ1,d,α).
•3
1
6
2
7
5
4
8
9
A
B
C
D
ϱ1
2
1
3
6
7
5
4
8
9
C
Coagulate
AB
C
D
ϱ2
Fragment
31
6
2
7
5
4
8
9
F1 F2 F3
![Page 84: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/84.jpg)
Random Fragmentations• Let C ∈ P[n] and for each c ∈ C let Fc ∈ Pc.
• Denote fragmentation of C by {Fc} as frag(C,{Fc}).
• Write ϱ1 | C ~ FRAG(C,d,α) if ϱ1 = frag(C,{Fc}) with
• Fc ~ CRP(c,d,α) iid.
3
1
6
2
7
5
4
8
9
A
B
C
D
ϱ1
2
1
3
6
7
5
4
8
9
C
Coagulate
AB
C
D
ϱ2
Fragment
31
6
2
7
5
4
8
9
F1 F2 F3
![Page 85: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/85.jpg)
Random Trees andRandom Hierarchical Partitions
![Page 86: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/86.jpg)
Nested Chinese Restaurant Processes
[Blei et al NIPS 2004, JACM 2010]
![Page 87: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/87.jpg)
Nested Chinese Restaurant Processes
• Start with the null partition ϱ0 = {[n]}.
• For each level l = 1,2,...,L:
• ϱl = FRAG(ϱl-1,0,αl)
• Fragmentations in different clusters (branches of the hierarchical partition) operate independently.
• Nested Chinese restaurant processes (nCRP) define a Markov chain of partitions, each of which is exchangeable.
• Can be used to define an infinitely exchangeable sequence, with de Finetti measure being the nested Dirichlet process (nDP).
�1
�0
�2
�L
[Rodriguez et al JASA 2008]
![Page 88: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/88.jpg)
Chinese Restaurant Franchise
[Teh et al JASA 2006]
![Page 89: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/89.jpg)
Chinese Restaurant Franchise
• For a simple linear hierarchy (restaurants linearly chained together), the Chinese restaurant franchise (CRF) is a sequence of coagulations:
• At the lowest level L+1, we start with the trivial partition ϱL+1 = {{1},{2},...,{n}}.
• For each level l = L,L-1,...,1:
• ϱl = COAG(ϱl+1,0,αl)
• This is also Markov chain of partitions.
�1
�2
�L
�L+1
G1
G0
G2
GL
![Page 90: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/90.jpg)
Hierarchical Dirichlet/Pitman-Yor Processes• Each partition in the Chinese restaurant franchise is
again exchangeable.
• The corresponding de Finetti measure is a Hierarchical Dirichlet process (HDP).
• Generalizable to tree-structured hierarchies and hierarchical Pitman-Yor processes.
• The CRF has been rarely used as a model of hierarchical partitions. Typically it is only used as a convenient representation for inference in the HDP and HPYP.
�1
�2
�L
�L+1
G1
G0
G2
GL
Gl|Gl−1 ∼ DP(αl, Gl−1)
[Teh et al JASA 2006, Goldwater et al NIPS 2006, Teh ACL 2006]
![Page 91: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/91.jpg)
Continuum Limit of Partition-valued Markov Chains
![Page 92: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/92.jpg)
Trees with Infinitely Many Levels
• Random trees described so far all consist of a finite number of levels L.
• We can be “nonparametric” about the number of levels of random trees.
• Allow a finite amount of change even with an infinite number of levels, by decreasing the change per level.
�1
�0
�2
�L
Δ/L
Δ/L
Δ/L
![Page 93: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/93.jpg)
Dirichlet Diffusion Trees
[Neal BA 2003]
![Page 94: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/94.jpg)
Dirichlet Diffusion Trees
• The Dirichlet diffusion tree (DFT) hierarchical partitioning structure can be derived from the continuum limit of a nCRP:
• Start with the null partition ϱ0 = {[n]}.
• For each time t, define
• ϱt+dt = FRAG(ϱt,0,a(t)dt)
• The continuum limit of the Markov chain of partitions becomes a continuous time partition-valued Markov process: a fragmentation process.
![Page 95: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/95.jpg)
Kingman’s Coalescent
• Taking the continuum limit of the one-parameter (Markov chain) CRF leads to another partition-valued Markov process: Kingman’s coalescent.
• Start with the trivial partition ϱ0 = {{1},{2},...,{n}}.
• For each time t < 0:
• ϱt-dt = COAG(ϱt,0,a(t)/dt)
• This is the simplest example of a coalescence or coagulation process.
[Kingman SP&A 1982, JoAP 1982]
![Page 96: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/96.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-0/N
N
![Page 97: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/97.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-1/N -0/N
N
![Page 98: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/98.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N -1/N -0/N
N
![Page 99: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/99.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N -1/N -0/N
N
![Page 100: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/100.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N -1/N -0/N
N
![Page 101: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/101.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N -1/N -0/N
N
![Page 102: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/102.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N -1/N -0/N
N
![Page 103: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/103.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N -1/N -0/N
N
![Page 104: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/104.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N -1/N -0/N
N
![Page 105: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/105.jpg)
Kingman’s Coalescent
• Derived from the Wright-Fisher model of population genetics.
• Model of the genealogies of n haploid individuals among a size N population.
• Gives a tree-structured genealogy because each individual assumed to have one parent.
-2/N-3/N-4/N-5/N
N ! !
-1/N -0/N
N
![Page 106: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/106.jpg)
Kingman’s Coalescent
![Page 107: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/107.jpg)
Fragmentation-Coagulation Processes
[Berestycki 2004]
![Page 108: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/108.jpg)
Overview
• Duality between Pitman-Yor coagulations and fragmentations.
• Using duality to construct a stationary reversible Markov chain over partitions.
![Page 109: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/109.jpg)
Duality of Coagulation and Fragmentation• The following statements are equivalent:
3
1
6
2
7
5
4
8
9
Coagulate
A
B
C
D
AB
C
D
Fragment
31
6
2
7
5
4
8
9
ϱ2
2
1
3
6
7
5
4
8
9
C
ϱ1
F1 F2 F3
(I) �2 ∼ CRP([n], d2,αd2) and �1|�2 ∼ CRP(�2, d1,α)
(II) C ∼ CRP([n], d1d2,αd2) and Fc|C ∼ CRP(c, d2,−d1d2) ∀c ∈ C
[Pitman 1999]
![Page 110: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/110.jpg)
Markov Chain over Partitions
ϱ0
ωε
ϱε
ω2ε
ϱ2ε
• Defines a Markov chain over partitions.
• Each transition is a fragmentation followed by coagulation.
CRP([n], µ, 0) CRP([n], µ, 0) CRP([n], µ, 0)
CRP([n], µ,R�) CRP([n], µ,R�)
FRAG
(� 0, 0, R�)
FRAG
(� �, 0, R�)C
OAG(ω� , µR� , 0)
COAG(ω2� , µ
R� , 0)
![Page 111: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/111.jpg)
Stationary Distribution
• Stationary distribution is a CRP with parameters μ and 0.
ϱ0
ωε
ϱε
ω2ε
ϱ2εCRP([n], µ, 0) CRP([n], µ, 0) CRP([n], µ, 0)
CRP([n], µ,R�) CRP([n], µ,R�)
FRAG
(� 0, 0, R�)
FRAG
(� �, 0, R�)C
OAG(ω� , µR� , 0)
COAG(ω2� , µ
R� , 0)
![Page 112: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/112.jpg)
Exchangeability and Projectivity
• Each πt is exchangeable, so that the whole Markov chain is an exchangeable process.
• Projectivity of the Chinese restaurant process extends to the Markov chain as well.
ϱ0
ωε
ϱε
ω2ε
ϱ2εCRP([n], µ, 0) CRP([n], µ, 0) CRP([n], µ, 0)
CRP([n], µ,R�) CRP([n], µ,R�)
FRAG
(� 0, 0, R�)
FRAG
(� �, 0, R�)C
OAG(ω� , µR� , 0)
COAG(ω2� , µ
R� , 0)
![Page 113: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/113.jpg)
Reversibility of Markov Chain
• The Markov chain is reversible.
• Coagulation and fragmentation are duals of each other.
ϱ0
ωε
ϱε
ω2ε
ϱ2εCRP([n], µ, 0) CRP([n], µ, 0) CRP([n], µ, 0)
CRP([n], µ,R�) CRP([n], µ,R�)
FRAG(ρ� , 0, R�) C
OAG(ω
2�,µR�, 0)
COAG(ω
�,µR�, 0) FRAG
(ρ2� , 0, R�)
![Page 114: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/114.jpg)
Continuum Limit
• Taking ε→0 obtains a continuous time Markov process over partitions, an exchangeable fragmentation-coalescence process (Berestycki 2004).
• At each time, at most one coagulation (involving two blocks) or one fragmentation (splitting into two blocks) will occur.
ϱ0
ωε
ϱε
ω2ε
ϱ2εCRP([n], µ, 0) CRP([n], µ, 0) CRP([n], µ, 0)
CRP([n], µ,R�) CRP([n], µ,R�)
FRAG
(� 0, 0, R�)
FRAG
(� �, 0, R�)C
OAG(ω� , µR� , 0)
COAG(ω2� , µ
R� , 0)
![Page 115: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/115.jpg)
Conditional Distribution of a Trajectory
3
1
6
2
7
5
4
8
|a|n+µ
µn+µ
3
1
6
2
4
7
5
8
3
1
6
2
5
7
4
8
9
9
9
|{2,5,8}||{2,4,5,7,8}|
|{4,7}||{2,4,5,7,8}|
1
rate |Π[n],t|Rµ rate R|a|
• This process is reversible.
![Page 116: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/116.jpg)
Dirichlet Diffusion Trees and Coalescents
• Rate of fragmentation is same as for Dirichlet diffusion trees with constant fragmentation rate.
• Rate of coagulation is same as for the Kingman’s coalescent.
• Reversibility means that the Dirichlet diffusion tree is precisely the converse of Kingman’s coalescent.
![Page 117: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/117.jpg)
Sequence Memoizer
[Teh ACL 2006, Wood & Teh ICML 2009, Wood et al CACM 2011]
![Page 118: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/118.jpg)
Overview
• Application of hierarchical Pitman-Yor processes to n-gram language models:
• Hierarchical Bayesian modelling allows for sharing of statistical strength and improved parameter estimation.
• Pitman-Yor processes has power law properties more suitable in modelling linguistic data.
• Generalization to ∞-gram (non-Markov) language models.
• Use of fragmentation coagulation duality to improve computational costs.
![Page 119: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/119.jpg)
Hierarchical Dirichlet/Pitman-Yor Processes• Each partition in the Chinese restaurant franchise is
again exchangeable.
• The corresponding de Finetti measure is a Hierarchical Dirichlet process (HDP).
• Generalizable to tree-structured hierarchies and hierarchical Pitman-Yor processes.
• The CRF has been rarely used as a model of hierarchical partitions. Typically it is only used as a convenient representation for inference in the HDP and HPYP.
�1
�2
�L
�L+1
G1
G0
G2
GL
Gl|Gl−1 ∼ DP(αl, Gl−1)
[Teh et al JASA 2006, Goldwater et al NIPS 2006, Teh ACL 2006]
![Page 120: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/120.jpg)
n-gram Language Models
![Page 121: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/121.jpg)
Sequence Models for Language and Text
• Probabilistic models for sequences of words and characters, e.g.
south, parks, road
s, o, u, t, h, _, p, a, r, k, s, _, r, o, a, d
• n-gram language models are high order Markov models of such discrete sequence:
P (sentence) =�
i
P (wordi|wordi−N+1 . . .wordi−1)
![Page 122: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/122.jpg)
• High order Markov models:
• Large vocabulary size means naïvely estimating parameters of this model from data counts is problematic for N>2.
• Naïve priors/regularization fail as well: most parameters have no associated data.
• Smoothing.
• Hierarchical Bayesian models.
n-gram Language Models
PML(wordi|wordi−N+1 . . .wordi−1) =C(wordi−N+1 . . .wordi)
C(wordi−N+1 . . .wordi−1)
P (sentence) =�
i
P (wordi|wordi−N+1 . . .wordi−1)
![Page 123: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/123.jpg)
Smoothing in Language Models
• Smoothing is a way of dealing with data sparsity by combining large and small models together.
• Combines expressive power of large models with better estimation of small models (cf bias-variance trade-off).
P smooth(wordi|wordi−1i−N+1) =
N�
n=1
λ(n)Qn(wordi|wordi−1i−n+1)
P smooth(road|south parks)
= λ(3)Q3(road|south parks) +
λ(2)Q2(road|parks) +λ(1)Q1(road|∅)
![Page 124: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/124.jpg)
Smoothing in Language Models
• Interpolated and modified Kneser-Ney are best.
[Chen & Goodman 1999]
![Page 125: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/125.jpg)
Hierarchical Pitman-Yor Language Models
![Page 126: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/126.jpg)
Hierarchical Bayesian Models• Hierarchical Bayesian modelling an important overarching theme in
modern statistics [Gelman et al, 1995, James & Stein 1961].
• In machine learning, have been used for multitask learning, transfer learning, learning-to-learn and domain adaptation.
i=1...n2
φ0
φ2
x2i
i=1...n3
φ3
x3i
i=1...n1
φ1
x1i
![Page 127: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/127.jpg)
• Context of conditional probabilities naturally organized using a tree.
• Smoothing makes conditional probabilities of neighbouring contexts more similar.
• Later words in context more important in predicting next word.
∅
Context Tree
along south parks
south parks
parks
to parks university parks
at south parks
P smooth(road|south parks)
= λ(3)Q3(road|south parks) +
λ(2)Q2(road|parks) +λ(1)Q1(road|∅)
![Page 128: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/128.jpg)
• Parametrize the conditional probabilities of Markov model:
•Gu is a probability vector associated with context u.
• [MacKay and Peto 1994].
G∅
Hierarchical Bayes on Context Tree
P (wordi = w|wordi−1i−N+1 = u) = Gu(w)
Gu = [Gu(w)]w∈vocabulary
Gparks
Gsouth parks Gto parks Guniversity parks
Galong south parks Gat south parks
![Page 129: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/129.jpg)
Hierarchical Dirichlet Language Models• What is ? [MacKay and Peto 1994] proposed using the
standard Dirichlet distribution over probability vectors.
• We will use Pitman-Yor processes instead [Perman, Pitman and Yor 1992], [Pitman and Yor 1997], [Ishwaran and James 2001].
P (Gu|Gpa(u))
T N-1 IKN MKN HDLM
2× 106 2 148.8 144.1 191.24× 106 2 137.1 132.7 172.76× 106 2 130.6 126.7 162.38× 106 2 125.9 122.3 154.7
10× 106 2 122.0 118.6 148.712× 106 2 119.0 115.8 144.014× 106 2 116.7 113.6 140.514× 106 1 169.9 169.2 180.614× 106 3 106.1 102.4 136.6
![Page 130: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/130.jpg)
• The two-parameter Chinese restaurant process CRP([n],d,α) is a distribution over P[n]: (0 ≤ d < 1, α > -d)
Two-parameter Chinese Restaurant Processes136
27
458
9
P (sit at table c) =nc − d
α+�
c∈� ncP (sit at new table) =
α+ d|�|α+
�c∈� nc
P (�) =[α+ d]|�|−1
d
[α+ 1]n−11
�
c∈�
[1− d]|c|−11 [z]mb = z(z + b) · · · (z + (m− 1)b)
[Perman et al 1992, Pitman & Yor AoP 1997]
![Page 131: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/131.jpg)
• The two-parameter Chinese restaurant process CRP([n],d,α) is a distribution over P[n]: (0 ≤ d < 1, α > -d)
• These are also projective and exchangeable distributions.
Two-parameter Chinese Restaurant Processes136
27
458
9
P (sit at table c) =nc − d
α+�
c∈� ncP (sit at new table) =
α+ d|�|α+
�c∈� nc
P (�) =[α+ d]|�|−1
d
[α+ 1]n−11
�
c∈�
[1− d]|c|−11 [z]mb = z(z + b) · · · (z + (m− 1)b)
[Perman et al 1992, Pitman & Yor AoP 1997]
![Page 132: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/132.jpg)
• The two-parameter Chinese restaurant process CRP([n],d,α) is a distribution over P[n]: (0 ≤ d < 1, α > -d)
• These are also projective and exchangeable distributions.
• De Finetti measure is the Pitman-Yor process, which is a generalization of the Dirichlet process.
Two-parameter Chinese Restaurant Processes136
27
458
9
P (sit at table c) =nc − d
α+�
c∈� ncP (sit at new table) =
α+ d|�|α+
�c∈� nc
P (�) =[α+ d]|�|−1
d
[α+ 1]n−11
�
c∈�
[1− d]|c|−11 [z]mb = z(z + b) · · · (z + (m− 1)b)
[Perman et al 1992, Pitman & Yor AoP 1997]
![Page 133: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/133.jpg)
Power Law Properties
• Chinese restaurant process:
• Small number of large clusters;
• Large number of small clusters.
• Customers = word instances, tables = word types.
• This is more suitable for languages than Dirichlet distributions.
[Pitman 2006, Goldwater et al NIPS 2006, Teh ACL 2006]
p(sit at table c) ∝ nc − d
p(sit at new table) ∝ α+ d|ρ|
![Page 134: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/134.jpg)
Power Law Properties
![Page 135: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/135.jpg)
Power Law Properties
![Page 136: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/136.jpg)
Hierarchical Pitman-Yor Language Models• Parametrize the conditional probabilities of Markov model:
•Gu is a probability vector associated with context u.
• Place Pitman-Yor process prior on each Gu.
P (wordi = w|wordi−1i−N+1 = u) = Gu(w)
Gu = [Gu(w)]w∈vocabulary
G∅
Gparks
Gsouth parks Gto parks Guniversity parks
Galong south parks Gat south parks
![Page 137: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/137.jpg)
Hierarchical Pitman-Yor Language Models• Significantly improved on the hierarchical Dirichlet language model.
• Results better Kneser-Ney smoothing, state-of-the-art language models.
• Similarity of perplexities not a surprise---Kneser-Ney can be derived as a particular approximate inference method.
T N-1 IKN MKN HDLM HPYLM
2× 106 2 148.8 144.1 191.2 144.34× 106 2 137.1 132.7 172.7 132.76× 106 2 130.6 126.7 162.3 126.48× 106 2 125.9 122.3 154.7 121.9
10× 106 2 122.0 118.6 148.7 118.212× 106 2 119.0 115.8 144.0 115.414× 106 2 116.7 113.6 140.5 113.214× 106 1 169.9 169.2 180.6 169.314× 106 3 106.1 102.4 136.6 101.9
![Page 138: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/138.jpg)
Non-Markov Language Models
![Page 139: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/139.jpg)
Markov Models for Language and Text
• Usually makes a Markov assumption to simplify model:
• Language models: usually Markov models of order 2-4 (3-5-grams).
• How do we determine the order of our Markov models?
• Is the Markov assumption a reasonable assumption?
• Be nonparametric about Markov order...
P(south parks road) ~ P(south)*
P(parks | south)*P(road | south parks)
![Page 140: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/140.jpg)
Non-Markov Models for Language and Text• Model the conditional probabilities of each possible word occurring
after each possible context (of unbounded length).
• Use hierarchical Pitman-Yor process prior to share information across all contexts.
• Hierarchy is infinitely deep.
• Sequence memoizer.....
....
....
....
G∅
Gparks
Gsouth parks Gto parks Guniversity parks
Galong south parks Gat south parks
Gmeet at south parks
![Page 141: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/141.jpg)
• The sequence memoizer model is very large (actually, infinite).
• Given a training sequence (e.g.: o,a,c,a,c), most of the model can be ignored (integrated out), leaving a finite number of nodes in context tree.
• But there are still O(T2) number of nodes in the context tree.
Model Size: Infinite -> O(T2)
G[oacac]
G[acac]
G[cac]
G[ac]
G[c]
G[ ]
G[a] G[o]
G[ca]
G[aca]
G[oaca]
G[oa]
G[oac]
c
a
a
c
c
c
c
a
a
o
o
o
oa
o
a
o
H
![Page 142: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/142.jpg)
• The sequence memoizer model is very large (actually, infinite).
• Given a training sequence (e.g.: o,a,c,a,c), most of the model can be ignored (integrated out), leaving a finite number of nodes in context tree.
• But there are still O(T2) number of nodes in the context tree.
• Integrate out non-branching, non-leaf nodes leaves O(T) nodes.
• Conditional distributions still Pitman-Yor due to closure property.
Model Size: Infinite -> O(T2) -> 2T
oac
ac
oac
G[oacac]
G[ac]
G[ ]
G[a] G[o]
G[oaca]
G[oa]
G[oac]
c
c
a
a
o
o
oa
o
H
![Page 143: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/143.jpg)
Duality of Coagulation and Fragmentation• The following statements are equivalent:
3
1
6
2
7
5
4
8
9
Coagulate
A
B
C
D
AB
C
D
Fragment
31
6
2
7
5
4
8
9
ϱ2
2
1
3
6
7
5
4
8
9
C
ϱ1
F1 F2 F3
(I) �2 ∼ CRP([n], d2,αd2) and �1|�2 ∼ CRP(�2, d1,α)
(II) C ∼ CRP([n], d1d2,αd2) and Fc|C ∼ CRP(c, d2,−d1d2) ∀c ∈ C
![Page 144: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/144.jpg)
Closure under Marginalization
G[a]
G[ca]
G[aca]
PY(θ2, d2, G[a])
G[a]
G[aca]
PY(θ2d3, d3, G[ca])
PY(θ2d3, d2d3, G[a])
• Marginalizing out internal Pitman-Yor processes is equivalent to coagulating the corresponding Chinese restaurant processes.
• Fragmentation and coagulation duality means that the coagulated partition is also Chinese restaurant process distributed.
• Corresponding Pitman-Yor process is the resulting marginal distribution of G[aca].
[Gasthaus & Teh NIPS 2010]
![Page 145: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/145.jpg)
Comparison to Finite Order HPYLM
![Page 146: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/146.jpg)
Compression Results
Calgary corpusSM inference: particle filterPPM: Prediction by Partial MatchingCTW: Context Tree WeigtingOnline inference, entropic coding.
Model Average bits/byte
gzip 2.61
bzip2 2.11
CTW 1.99
PPM 1.93
Sequence Memoizer 1.89
![Page 147: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/147.jpg)
A Few Final Words
![Page 148: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/148.jpg)
Summary
• Introduction to Bayesian learning and Bayesian nonparametrics.
• Chinese restaurant processes.
• Fragmentations and coagulations.
• Unifying view of random tree models as Markov chains of fragmenting and coagulating partitions.
• Fragmentation-coagulation processes.
• Hierarchical Dirichlet and Pitman-Yor processes, sequence memoizer.
![Page 149: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/149.jpg)
What Were Not Covered Here
• Bayesian nonparametrics in computational linguistics (Mark Johnson).
• Gaussian processes, Indian buffet processes (Zoubin Ghahramani).
• Nonparametric Hidden Markov Models [see Ghahramani CoNLL 2010].
• Dependent and hierarchical processes [see Dunson BNP 2010, Teh & Jordan BNP 2010].
• Foundational issues, convergence and asymptotics.
• Combinatorial stochastic processes and their relationship to data structures and programming languages.
• Relational models, topic models etc.
![Page 150: Bayesian Nonparametrics: Random Partitionsteh/teaching/npbayes/mlss2011S.pdf · Bayesian Nonparametrics: Random Partitions Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL](https://reader035.vdocuments.us/reader035/viewer/2022062910/5b9a97be09d3f2dc408ba16f/html5/thumbnails/150.jpg)
Future of Bayesian Nonparametrics
• Augmenting the standard modelling toolbox of machine learning.
• Development of better inference algorithms and software toolkits.
• Exploration of novel stochastic processes
• More applications in machine learning and beyond.