![Page 1: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/1.jpg)
Unsupervised Learning Clustering
[70240413 Statistical Machine Learning, Spring, 2015]
Jun Zhu [email protected]
http://bigml.cs.tsinghua.edu.cn/~jun
State Key Lab of Intelligent Technology & Systems
Tsinghua University
March 10, 2015
![Page 2: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/2.jpg)
Unsupervised Learning Task: learn an explanatory function
Aka “Learning without a teacher”
No training/test split
Feature space
Words in documents
Word distribution
(probability of a word)
![Page 3: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/3.jpg)
Unsupervised Learning – density
estimation
Feature space
geographical information of a location Density function
![Page 4: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/4.jpg)
Unsupervised Learning – clustering
Feature space
Attributes (e.g., pixels & text) of images Cluster assignment function
http://search.carrot2.org/stable/search
![Page 5: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/5.jpg)
Unsupervised Learning – dimensionality
reduction
Feature space
pixels of images
Coordinate function in 2D space
Images have thousands or
millions of pixels
Can we give each image a
coordinate, such that similar
images are near each other ?
![Page 6: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/6.jpg)
Clustering
(K-Means, Gaussian Mixtures)
![Page 7: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/7.jpg)
What is clustering?
Clustering: the process of grouping a set of objects into classes of similar objects
High intra-class similarity
Low inter-class similarity
A common and important task that finds many applications in science, engineering, information science, etc
Group genes that perform the same function
Group individuals that has similar political view
Categorize documents of similar topics
Identify similar objects from pictures
…
![Page 8: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/8.jpg)
The clustering problem
Input: training data , where ,
integer K clusters
Output: a set of clusters
Word Vector Space Vocabulary
![Page 9: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/9.jpg)
The clustering problem
Input: training data , where ,
integer K clusters
Output: a set of clusters
![Page 10: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/10.jpg)
Issues for clustering
What is a natural grouping among these objects?
Definition of “groupness”
What makes objects “related”?
Definition of “similarity/distance”
Representation for objects
Vector space? Normalization?
How many clusters?
Fixed a priori?
Completely data driven?
Clustering algorithms
Partitional algorithms
Hierarchical algorithms
Formal foundation and convergence
![Page 11: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/11.jpg)
What is a natural grouping among objects?
Females Males Simpson’s Family School Employees
Clustering is subjective
![Page 12: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/12.jpg)
What is similarity?
The real meaning of similarity is a philosophical question.
Depends on representation and algorithm. For many rep./alg., easier to think in terms of distance (rather than similarity) between vectors
![Page 13: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/13.jpg)
Desirable distance measure properties
d(A,B) = d(B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like
Alex”
d(A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does”
d(A,B) = 0 iff A=B Positivity Separation Otherwise there are objects that are different, but you can’t tell apart
d(A,B) ≤ d(A,C)+d(B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl,
but Bob is very unlike Carl”
![Page 14: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/14.jpg)
Minkowski Distance
Common Minkowski distances
Euclidean distance (r=2):
Manhattan distance (r=1):
“Sup” distance ( ):
![Page 15: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/15.jpg)
Hamming distance
Manhattan distance is called Hamming distance when all
features are binary
E.g., gene expression levels under 17 conditions (1-high; 0-low)
Hamming distance: #(0 1) + #(1 0) = 4 + 1 = 5
![Page 16: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/16.jpg)
Correlation coefficient
Pearson correlation coefficient
Cosine Similarity:
![Page 17: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/17.jpg)
Edit Distance
To measure the similarity between two objects, transform
one into the other, and measure how much effort it took. The
measure of effort becomes the distance measure
The distance between Marge and Selma
Change dress color, 1 point Add earrings, 1 point Decrease height, 1 point Take up smoking, 1 point Loss weight, 1 point
D(Marge, Selma) = 5 Selman Patty Marge
![Page 18: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/18.jpg)
Clustering algorithms
Partitional algorithms
Usually start with a random (partial)
partitioning
Refine it iteratively
K-means
Mixture-Model based clustering
Hierarchical algorithms
Bottom-up, agglomerative
Top-down, divisive
![Page 19: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/19.jpg)
K-means Algorithm
1. Initialize the centroids
![Page 20: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/20.jpg)
K-means Algorithm
2. for each k,
![Page 21: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/21.jpg)
K-means Algorithm
3. for each k, (sample mean)
![Page 22: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/22.jpg)
K-means Algorithm
Repeat until no further change in cluster assignment
![Page 23: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/23.jpg)
Summary of K-means Algorithm
1. Initialize centroids
2. Repeat until no change of cluster assignment
(1) for each k:
(2) for each k:
Note: each iteration requires operations
![Page 24: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/24.jpg)
K-means Questions
What is it trying to optimize?
Are we sure it will terminate?
Are we sure it will find an optimal clustering?
How should we start it?
How could we automatically choose the number of centers?
![Page 25: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/25.jpg)
Theory: K-Means as an Opt. Problem
The opt. problem
Theorem: K-means iteratively leads to a non-increasing of the
objective, until local minimum is achieved
Proof ideas:
Each operation leads to non-increasing of the objective
The objective is bounded and the number of clusters is finite
![Page 26: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/26.jpg)
K-means as gradient descent
Find K prototypes to minimize the quantization error (i.e.,
the average distance between a data to its closest prototype):
First-order gradient descent applies
Newton method leads to the same update rule:
See [Bottou & Bengio, NIPS’95] for more details
![Page 27: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/27.jpg)
Trying to find a good optimum
Idea 1: Be careful about where you start
Idea 2: Do many runs of k-means, each from a different
random start configuration
Many other ideas floating around.
Note: K-means is often used to initialize other clustering
methods
![Page 28: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/28.jpg)
Mixture of Gaussians and EM algorithm
![Page 29: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/29.jpg)
Basics of Probability & MLE
![Page 30: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/30.jpg)
Basics of Probabilities
![Page 31: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/31.jpg)
Independence
Independent random variables:
Y and X don’t contain information about each other
Observing Y doesn’t help predicting X
Observing X doesn’t help predicting Y
Examples:
Independent:
winning on roulette this week and next week
Dependent:
Russian roulette
𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃(𝑌)
𝑃 𝑋 𝑌 = 𝑃(𝑋)
X Y
![Page 32: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/32.jpg)
Dependent / Independent?
![Page 33: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/33.jpg)
Conditional Independence
Conditionally independent:
knowing Z makes X and Y independent
Examples:
𝑃 𝑋, 𝑌 𝑍 = 𝑃 𝑋 𝑍 𝑃(𝑌|𝑍)
London taxi drivers: A survey has pointed out a
positive and significant correlation between the number
of accidents and wearing coats. They concluded that coats
could hinder movements of drivers and be the cause of
accidents. A new law was prepared to prohibit drivers
from wearing coats when driving.
Finally another study pointed out that people wear coats when it rains…
Z
X Y
![Page 34: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/34.jpg)
Conditional Independence
Conditionally independent:
knowing Z makes X and Y independent
Equivalent to:
E.g.:
𝑃 𝑋, 𝑌 𝑍 = 𝑃 𝑋 𝑍 𝑃(𝑌|𝑍)
Z
X Y
∀ 𝑥, 𝑦, 𝑧 : 𝑃 𝑋 = 𝑥 𝑌 = 𝑦, 𝑍 = 𝑧 = 𝑃 𝑋 = 𝑥 𝑍 = 𝑧)
𝑃 𝑇ℎ𝑢𝑛𝑑𝑒𝑟 𝑅𝑎𝑖𝑛, 𝐿𝑖𝑔ℎ𝑡𝑖𝑛𝑔) = P 𝑇ℎ𝑢𝑛𝑑𝑒𝑟 𝐿𝑖𝑔ℎ𝑡𝑖𝑛𝑔)
![Page 35: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/35.jpg)
Maximum Likelihood Estimation (MLE)
![Page 36: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/36.jpg)
Flipping a Coin
What’s the probability that a coin will fall with a head up (if
flipped)?
Let us flip it a few times to estimate the probability
The estimated probability is: 3/5 “frequency of heads”
![Page 37: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/37.jpg)
Questions:
Why frequency of heads?
How good is this estimation?
Why is this a machine learning problem?
The estimated probability is: 3/5 “frequency of heads”
![Page 38: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/38.jpg)
Question (1)
Why frequency of heads?
Frequency of heads is exactly the Maximum Likelihood
Estimator for this problem
MLE has nice properties
(interpretation, statistical guarantees, simple)
![Page 39: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/39.jpg)
MLE for Bernoulli Distribution
Flips are i.i.d:
Independent events that are identically distributed
according to Bernoulli distribution
MLE: choose θ that maximizes the probability of observed
data
𝑃 𝐻𝑒𝑎𝑑 = θ 𝑃 𝑇𝑎𝑖𝑙 = 1 − θ
![Page 40: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/40.jpg)
Maximum Likelihood Estimation (MLE)
MLE: choose θ that maximizes the probability of observed
data
Independent draws
Identically distributed
![Page 41: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/41.jpg)
Maximum Likelihood Estimation (MLE)
MLE: choose θ that maximizes the probability of observed
data
Solution?
Exactly the “Frequency of heads”
![Page 42: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/42.jpg)
Question (2)
How good is the MLE estimation?
Is it biased?
![Page 43: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/43.jpg)
How many flips do I need?
I flipped the coins 5 times: 3 heads, 2 tails
What if I flipped 30 heads and 20 tails?
Which estimator should we trust more?
![Page 44: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/44.jpg)
A Simple Bound
Let be the true parameter. For n data points, and
Then, for any ε>0 , we have the Hoeffding’s Inequality:
![Page 45: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/45.jpg)
Probably Approximately Correct (PAC)
Learning
I want to know the coin parameter θ, within ε=0.1 error
with probability at least 1-δ (e.g., 0.95)
How many flips do I need?
Sample complexity:
![Page 46: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/46.jpg)
Question (3)
Why is this a machine learning problem?
Improve their performance (accuracy of the estimated prob.)
At some task (estimating the probability of heads)
With experience (the more coin flips the better we are)
![Page 47: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/47.jpg)
How about continuous features?
![Page 48: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/48.jpg)
Gaussian Distributions
Univariate Gaussian distribution
Given parameters, we can draw samples and plot distributions
Carl F. Gauss (1777 – 1855)
![Page 49: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/49.jpg)
Maximum Likelihood Estimation
Given a data set , the likelihood is
MLE estimates the parameters as
sample mean
sample variance
Note: MLE for the variance of a Gaussian is biased
![Page 50: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/50.jpg)
Gaussian Distributions
d-dimensional multivariate Gaussian
Given parameters, we can draw samples and plot distributions
Carl F. Gauss (1777 – 1855)
Isotropic Diagonal General
![Page 51: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/51.jpg)
Maximum Likelihood Estimation
Given a data set , the likelihood is
MLE estimates the parameters as
sample mean
sample covariance
![Page 52: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/52.jpg)
Other Nice Analytic Properties
Marginal is Gaussian
Conditional is Gaussian
![Page 53: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/53.jpg)
Limitations of Single Gaussians
Single Gaussian is unimodal
… can’t fit well multimodal data, which is more realistic!
![Page 54: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/54.jpg)
Mixture of Gaussians
A simple family of multi-modal distributions
treat unimodal Gaussians as basis (or component) distributions
superpose multiple Gaussians via linear combination
![Page 55: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/55.jpg)
Mixture of Gaussians
A simple family of multi-modal distributions
treat unimodal Gaussians as basis (or component) distributions
superpose multiple Gaussians via linear combination
What conditions should the mixing coefficients satisfy ?
![Page 56: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/56.jpg)
MLE for Mixture of Gaussians
Log-likelihood
this is complicated …
… but, we know the MLE for single Gaussians are easy
A heuristic procedure (can we iterate?)
allocate data into different components
estimate each component Gaussian analytically
![Page 57: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/57.jpg)
Optimal Conditions
Some math
A weighted sample mean!
![Page 58: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/58.jpg)
Optimal Conditions
Some math
A weighted sample variance!
![Page 59: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/59.jpg)
Optimal Conditions
Some math
The ratio of data assigned to component k!
Note: constraints exist for mixing coefficients!
![Page 60: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/60.jpg)
Optimal Conditions – summary
The set of couple conditions
The key factor to get them coupled
If we know , each component Gaussian is easy to estimate!
![Page 61: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/61.jpg)
The EM Algorithm
E-step: estimate the responsibilities
M-step: re-estimate the parameters
Initialization plays a key role to succeed!
![Page 62: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/62.jpg)
A Running Example
The data and a mixture of two isotropic Gaussians
![Page 63: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/63.jpg)
A Running Example
Initial E-step
![Page 64: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/64.jpg)
A Running Example
Initial M-step
![Page 65: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/65.jpg)
A Running Example
The 2nd M-step
![Page 66: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/66.jpg)
A Running Example
The 5th M-step
![Page 67: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/67.jpg)
A Running Example
The 20th M-step
![Page 68: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/68.jpg)
Theory
Let’s take the latent variable view of mixture of Gaussians
Indicator (selecting) variable
?
Note: the idea of data augmentation is influential in statistics and machine learning!
![Page 69: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/69.jpg)
Theory
Re-visit the log-likelihood
Jensen’s inequality
![Page 70: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/70.jpg)
Theory
Re-visit the log-likelihood
Jensen’s inequality
![Page 71: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/71.jpg)
Theory
Re-visit the log-likelihood
Jensen’s inequality
How to apply?
![Page 72: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/72.jpg)
Theory
What we have is a lower bound
What’s the GAP?
![Page 73: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/73.jpg)
Theory
What we have is a lower bound
What’s the GAP?
![Page 74: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/74.jpg)
EM-algorithm
Maximize the lower bound or minimize the gap:
Maximize over q(Z) => E-step
Maximize over Θ => M-step
![Page 75: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/75.jpg)
Convergence of EM
Local optimum is guaranteed under mild conditions (Depster
et al., 1977)
alternating minimization for a bi-convex problem
Some special cases with global optimum (Wu, 1983)
First-order gradient descent for log-likelihood
for comparison with other gradient ascent methods, see (Xu &
Jordan, 1995)
![Page 76: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/76.jpg)
Relation between GMM and K-Means
Small variance asymptotics:
The EM algorithm for GMM reduces to K-Means under certain
conditions:
E-step:
M-step:
![Page 77: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/77.jpg)
Single Linkage Hierarchical Clustering
Start with “every point is its own
cluster”
[Slide courtesy: Andrew Moore]
![Page 78: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/78.jpg)
Single Linkage Hierarchical Clustering
Start with “every point is its own
cluster”
Find “most similar” pairs of
clusters
[Slide courtesy: Andrew Moore]
![Page 79: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/79.jpg)
Single Linkage Hierarchical Clustering
Start with “every point is its own
cluster”
Find “most similar” pairs of
clusters
Merge it into a parent cluster
[Slide courtesy: Andrew Moore]
![Page 80: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/80.jpg)
Single Linkage Hierarchical Clustering
Start with “every point is its own
cluster”
Find “most similar” pairs of
clusters
Merge it into a parent cluster
Repeat
[Slide courtesy: Andrew Moore]
![Page 81: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/81.jpg)
Single Linkage Hierarchical Clustering
Start with “every point is its own
cluster”
Find “most similar” pairs of
clusters
Merge it into a parent cluster
Repeat
[Slide courtesy: Andrew Moore]
![Page 82: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/82.jpg)
Single Linkage Hierarchical Clustering
Start with “every point is its own
cluster”
Find “most similar” pairs of
clusters
Merge it into a parent cluster
Repeat
Key Question:
How do we define similarity between clusters?
=> minimum, maximum, or average distance
between points in clusters
[Slide courtesy: Andrew Moore]
![Page 83: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/83.jpg)
How many components are good?
Can we let the data speak for themselves?
let data determine model complexity (e.g., the number of components in mixture models)
allow model complexity to grow as more data observed
![Page 84: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/84.jpg)
How many components are good?
Can we let the data speak for themselves?
we will talk about Dirichlet Process (DP) Mixtures
and nonparametric Bayesian models
![Page 85: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/85.jpg)
Summary
Gaussian Mixtures and K-means are effective tools to
discover clustering structures
EM algorithms can be applied to do MLE for GMMs
Relationships between GMMs and K-means are discussed
Unresolved issues
How to determine the number of components for mixture
models?
How to determine the number of components for K-means?
![Page 86: Infinite SVM: a DP Mixture of Large-margin Kernel Machinesml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/2-Clustering.pdfAttributes (e.g., pixels & text) of images ... The distance](https://reader030.vdocuments.us/reader030/viewer/2022040620/5f3395b47b1a2e33a65e8b4a/html5/thumbnails/86.jpg)
Materials to Read
Chap. 9 of Bishop’s PRML book
Bottou, L. & Bengio, Y. Convergence Properties of the K-
means Algorithms, NIPS 1995.