![Page 1: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/1.jpg)
Randomized Algorithms for Bayesian Hierarchical Clustering
Katherine A. Heller
Zoubin GhahramaniGatsby Unit, University College London
![Page 2: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/2.jpg)
Hierarchies: are natural outcomes of certain generative processes are intuitive representations for certain kinds of data
Examples: Biological organisms Newsgroups, Emails, Actions …
![Page 3: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/3.jpg)
Traditional Hierarchical ClusteringAs in Duda and Hart (1973):
* Many distance metrics are possible
![Page 4: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/4.jpg)
Limitations of Traditional Hierarchical Clustering Algorithms
How many clusters should there be? It is hard to choose a distance metric They do not define a probabilistic model of the data, so
they cannot: Predict the probability or cluster assignment of new data points Be compared to or combined with other probabilistic models
Our Goal: To overcome these limitations by defining a novel statistical approach to hierarchical clustering
![Page 5: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/5.jpg)
Bayesian Hierarchical Clustering Our algorithm can be understood from two
different perspectives:
A Bayesian way to do hierarchical clustering where marginal likelihoods are used to decide which merges are advantageous
A novel fast bottom-up way of doing approximate inference in a Dirichlet Process mixture model (e.g. an infinite mixture of Gaussians)
![Page 6: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/6.jpg)
Outline
Background Traditional Hierarchical Clustering and its Limitations Marginal Likelihoods Dirichlet Process Mixtures (infinite mixture models)
Bayesian Hierarchical Clustering (BHC) algorithm: Theoretical Results Experimental Results Randomized BHC algorithms Conclusions
![Page 7: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/7.jpg)
![Page 8: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/8.jpg)
Dirichlet Process Mixtures(a.k.a. infinite mixtures models)
Consider a mixture model with K components (eg. Gaussians) How to choose K? Infer K from data? But this would require that we really believe that the data came from a mixture of some finite number of components – highly implausible. Instead a DPM has K = countably infinite components. DPM can be derived by taking of a finite mixture model with Dirichlet prior on mixing proportions. The prior on partitions of data points into clusters in a DPM is called a Chinese Restaurant Process. The key to avoiding overfitting in DPMs is Bayesian inference:
you can integrate out all infinitely many parameters and sample from assignments of data points to clusters
limK
![Page 9: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/9.jpg)
Outline Background
Traditional Hierarchical Clustering and its Limitations Marginal Likelihoods Dirichlet Process Mixtures (infinite mixture models)
Bayesian Hierarchical Clustering (BHC) Algorithm: Theoretical Results Experimental Results Randomized BHC algorithms Conclusions
![Page 10: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/10.jpg)
Bayesian Hierarchical Clustering :
Building the Tree The algorithm is virtually identical to traditional hierarchical clustering except that instead of distance it uses marginal likelihood to decide on merges.
For each potential merge it compares two hypotheses:
all data in came from one cluster data in came from some other clustering consistent with the subtrees
Prior:
Posterior probability of merged hypothesis:
Probability of data given the tree :
1( | ) ( | ) (1 ) ( | ) ( | )k k k k k i i j jP D T P D H P D T P D T
11
1
( | )( | )
( | ) (1 ) ( | ) ( | )k k
k kk k k i i j j
P D Hr P H D
P D H P D T P D T
1 :H
2 :H,i jT T
iD jD
kD
iT jT
kT
k i jD D D
kD
kD
1( )k P H
kD kT
k i jD D D
![Page 11: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/11.jpg)
Building the TreeThe algorithm compares hypotheses:
in one cluster all other clusterings consistent with the subtrees
1 :H
2 :H,i jT T
iD jD
kD
iT jT
kT
k i jD D D
kD
![Page 12: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/12.jpg)
Comparison
Traditional Hierarchical Clustering
Bayesian Hierarchical Clustering
![Page 13: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/13.jpg)
Comparison
Traditional Hierarchical Clustering
Bayesian Hierarchical Clustering
![Page 14: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/14.jpg)
Computing the Single Cluster Marginal Likelihood The marginal likelihood for the hypothesis that all data points in belong to one cluster is
If we use models which have conjugate priors this integral is tractable and is a simple function of sufficient statistics of
Examples: For continuous Gaussian data we can use Normal-Inverse Wishart priors For discrete Multinomial data we can use Dirichlet priors
1( | ) ( | ) ( )k kP D H P D P d
kD
kD
![Page 15: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/15.jpg)
Theoretical Results
The BHC algorithm can be thought of as a new approximate inference method for Dirichlet Process mixtures.
Using dynamic programming, for any tree it sums over exponentially many tree-consistent partitions in (n) time, whereas the exact algorithm is (nn).
BHC provides a new lower bound on the marginal likelihood of DPMs.
![Page 16: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/16.jpg)
Tree-Consistent Partitions
Consider the above tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4),
(2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3),
(1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4) (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions
![Page 17: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/17.jpg)
![Page 18: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/18.jpg)
Simulations
Toy Example 1 – continuous data Toy Example 2 – binary data Toy Example 3 – digits data
![Page 19: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/19.jpg)
Results: a Toy Example
![Page 20: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/20.jpg)
Results: a Toy Example
![Page 21: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/21.jpg)
Predicting New Data Points
![Page 22: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/22.jpg)
Toy Examples
![Page 23: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/23.jpg)
Toy Examples
![Page 24: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/24.jpg)
Binary Digits Example
![Page 25: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/25.jpg)
Binary Digits Example
![Page 26: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/26.jpg)
4 Newsgroups Results
800 examples, 50 attributes: rec.sport.baseball, rec.sports.hockey, rec.autos, sci.space
![Page 27: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/27.jpg)
Results: Average Linkage HC
![Page 28: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/28.jpg)
Results: Bayesian HC
![Page 29: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/29.jpg)
Results: Purity Scores
Purity is a measure of how well the hierarchical tree structure is correlated with the labels of the known classes.
![Page 30: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/30.jpg)
Limitations
Greedy algorithm:The algorithm may not find the globally optimal tree
No tree uncertainty:The algorithm finds a single tree, rather than a distribution over plausible trees
complexity for building tree:Fast, but not for very large datasets; this can be improved
2( )O n
![Page 31: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/31.jpg)
Randomized BHC
![Page 32: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/32.jpg)
Randomized BHC
Algorithm is (n log n)Each level of the tree has (n) operations.
Assumptions:The top level clustering built from a subset of m data points will be a good approximation to the true top level clustering.The BHC algorithm tends to produce roughly balanced binary trees.
Can stop after any desired number of levels, before nodes containing only 1 data point are reached
![Page 33: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/33.jpg)
Randomized BHC – An Alternative based on EM
This randomized algorithm is (n)
![Page 34: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/34.jpg)
Approximation Methods for Marginal Likelihoods of Mixture Models
Bayesian Information Criterion (BIC) Laplace Approximation Variational Bayes (VB) Expectation Propagation (EP) Markov chain Monte Carlo (MCMC) Hierarchical Clustering new!
![Page 35: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/35.jpg)
BHC Conclusions We have shown a Bayesian Hierarchical Clustering
algorithm which Is simple, deterministic and fast (no MCMC, one-pass, etc.) Can take as input any simple probabilistic model p(x|) and gives
as output a mixture of these models Suggests where to cut the tree and how many clusters there are
in the data Gives more reasonable results than traditional hierarchical
clustering algorithms
This algorithm: Recursively computes an approximation to the marginal
likelihood of a Dirichlet Process Mixture… …which can be easily turned into a new lower bound
![Page 36: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/36.jpg)
Future WorkTry on some other real hierarchical clustering data sets
Gene Expression dataMore text dataSpam/Email clustering
Generalize to other models p(x|), including more complex models which will require approximate inferenceCompare to other marginal likelihood approximations (Variational Bayes, EP, MCMC)Hyperparameter optimization using EM-like algorithmTest randomized algorithms
![Page 37: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/37.jpg)
Appendix: Additional Slides
![Page 38: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/38.jpg)
![Page 39: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/39.jpg)
![Page 40: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/40.jpg)
Computing the Prior for MergingWhere do we get from?
This can be computed bottom-up as the tree is built:
is the relative mass of the partition where all points are in one cluster vs all other partitions consistent with the subtrees, in a Dirichlet process mixture model with hyperparameter
k
k
![Page 41: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/41.jpg)
Bayesian Occam’s Razor
![Page 42: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/42.jpg)
Model Structure: polynomials
![Page 43: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/43.jpg)
Bayesian Model Comparison
![Page 44: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/44.jpg)
Nonparametric Bayesian Methods
![Page 45: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/45.jpg)
DPM - III
![Page 46: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/46.jpg)
Dendrogram Purity
![Page 47: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/47.jpg)
Marginal Likelihoods
The marginal likelihood (a.k.a. evidence) is one of the key concepts in Bayesian statistics
We will review the concept of a marginal likelihood using the example of a Gaussian mixture model
![Page 48: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/48.jpg)
![Page 49: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/49.jpg)
![Page 50: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/50.jpg)
Theoretical Results
![Page 51: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/51.jpg)
Learning Hyperparameters
For any given setting of hyperparameters, the top node of the tree approximates the marginal likelihood
Model comparison between and
For a fixed tree it should be possible to compute gradients with respect to
EM-like algorithm
( | )P D
( | )P D ( | ')P D
![Page 52: Randomized Algorithms for Bayesian Hierarchical Clustering Katherine A. Heller Zoubin Ghahramani Gatsby Unit, University College London](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f205503460f94c39024/html5/thumbnails/52.jpg)
Making Predictions: The Predictive Distribution
How do we compute the probability P(x|D) of a new test point x? Recurse down tree: Two alternatives at each node: x is in the one cluster,
x is in one of the other clusters consistent with the tree structure
Compute predictive distribution by summing over all clusters weighted by their posterior probabilityExample: for Gaussian model this gives a mixture of multivariate t distributions
1 :H
2 :H ,i jT T
iD jD
kD
iT jT
kT
kD