clustering techniques k-means & em
DESCRIPTION
Clustering Techniques K-Means & EM. Mario Haddad. What is Clustering?. Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Clustering is unsupervised classification : no predefined classes Typical applications - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/1.jpg)
Clustering TechniquesK-Means & EM
Mario Haddad
![Page 2: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/2.jpg)
What is Clustering? Cluster: a collection of data objects
Similar to one another within the same clusterDissimilar to the objects in other clusters
Clustering is unsupervised classification: no predefined classes
Typical applications As a stand-alone tool to get insight into data
distribution As a preprocessing step for other algorithms
![Page 3: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/3.jpg)
K-Means
![Page 4: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/4.jpg)
K-Means
Groups data into K clusters and attempts to group data points to minimize the sum of squares distance to their central mean.
Algorithm works by iterating between two stages until the data points converge.
![Page 5: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/5.jpg)
Problem Formulation
Given a data set of {x1,…,xN} which consists of N random instances of a random D-dimensional Euclidean variable x.
Introduce a set of K prototype vectors, µk where k=1,…,K and µk corresponds to the mean of the kth cluster.
![Page 6: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/6.jpg)
Problem Formulation
Goal is to find a grouping of data points and prototype vectors that minimizes the sum of squares distance of each data point.
![Page 7: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/7.jpg)
Problem Formulation (cont.) This can be formalized by introducing
an indicator variable for each data point: rnk is {0,1}, and k=1,…,K
Our objective function becomes:
𝐽=∑𝑛=1
𝑁
∑𝑘=1
𝐾
𝑟𝑛𝑘 ¿|𝑥𝑛−𝜇𝑘|∨¿2¿
![Page 8: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/8.jpg)
How K-Means works
Algorithm initializes the K prototype vectors to K distinct random data points.
Cycles between two stages until convergence is reached.
Convergence: since there are only a finite set of possible assignments.
![Page 9: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/9.jpg)
How K-Means works
1. For each data point, determine rnk where:
2. Update µk :
𝜇𝑘=∑𝑛𝑟𝑛𝑘 𝑥𝑛
∑𝑛𝑟𝑛𝑘
𝑟𝑛𝑘={1𝑖𝑓 𝑘=𝑎𝑟𝑔𝑚𝑖𝑛 𝑗¿|𝑥𝑛−𝜇𝑘|∨¿2 ¿0 h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
![Page 10: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/10.jpg)
How K-Means works (cont)
If k (number of clusters) and d (dimension of variables) are fixed, the clustering can be performed in time.
Exponential bounds for plane:
𝑂 (𝑁 𝑑𝑘+1 𝑙𝑜𝑔𝑁 )
![Page 11: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/11.jpg)
K-Means Initialization example Pick K cluster centers (unfortunate choice)
![Page 12: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/12.jpg)
Pick K cluster centers (random choice)
K-Means Initialization example
![Page 13: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/13.jpg)
K-Means++ K-Means with smart initial seeding
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen. Now that the initial centers have been chosen, proceed
using standard k-means.
![Page 14: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/14.jpg)
K-Means++
This seeding method yields considerable improvement in the final error of k-means
Takes more time to initialize Once initialized, K-Means converges
quickly Usually faster than K-Means 1000 times less prone to error than K-
Means
![Page 15: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/15.jpg)
K-Means Example
Cluster black and white intensities: Intensities: 1 3 8 11 Centers c1 = 7, c2=10 Assign 1, 3, 8 to c1, 11 to c2 Update c1 = (1+3+8)/3 = 4, c2 = 11 Assign 1,3 to c1 and 8 and 11 to c2 Update c1 = 2, c2 = 9 ½ Converged
![Page 16: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/16.jpg)
Computer Vision - A Modern ApproachSet: SegmentationSlides by D.A. Forsyth
Image Clusters on intensity
K-Means
![Page 17: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/17.jpg)
K-Means
Original:
After Intensity Clustering:
![Page 18: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/18.jpg)
Computer Vision - A Modern ApproachSet: SegmentationSlides by D.A. Forsyth
K-means using color alone,11 segments.
![Page 19: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/19.jpg)
Computer Vision - A Modern ApproachSet: SegmentationSlides by D.A. Forsyth
K-means using color and position,20 segments.
![Page 20: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/20.jpg)
Pros and Cons of K-Means
Convergence: J may converge to a local minima and not the global minimum. May have to repeat algorithm multiple times.
With a large data set, the Euclidian distance calculations can be slow.
K is an input parameter. If K is inappropriately chosen it may yield poor results.
![Page 21: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/21.jpg)
Local Minima
K-Means might not find the best possible assignments and centers.
Consider points 0, 20, 32. K-means can converge to centers at 10,
32. Or to centers at 0, 26.
Heuristic solutions Start with many random starting points
and pick the best solution.
![Page 22: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/22.jpg)
EMExpectation Maximization
![Page 23: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/23.jpg)
Soft Clustering
Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster.
Does not allow uncertainty in class membership or for an instance to belong to more than one cluster.
Soft clustering gives probabilities that an instance belongs to each of a set of clusters.
Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1).
![Page 24: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/24.jpg)
EM
Tends to work better than K-Means. Soft Assignments
A point is partially assigned to all clusters. Use probabilistic formulation
![Page 25: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/25.jpg)
Mixture of Gaussians
g(x; m, σ)
The probability of a point x based on a Gaussian Distribution with mean m and variance σ
![Page 26: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/26.jpg)
Intuition
![Page 27: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/27.jpg)
A mixture of K Gaussians
A distribution generated by randomly selecting one of K Gaussians, then randomly draw a point from that distribution.
Gaussian k with a probability of pk
Goal: find pk, σk, mk that maximize the probability of our data points.
![Page 28: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/28.jpg)
Back to EM
Iterative Algorithm Goal: group some primitives together Chicken and Egg problem:
Items in group -=> Description of the group
Description Of the group -=> Items in group
![Page 29: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/29.jpg)
Brace Yourselves..
![Page 30: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/30.jpg)
EM
Iterative Algorithm: E Step and M Step
E Step: Compute the probability that point n
is generated by distribution k𝑝 (𝑖 ) (𝑘|𝑛)=
𝑝𝑘(𝑖 )𝑔(𝑥𝑛 ;𝑚𝑘
(𝑖 ) ,𝜎𝑘( 𝑖 ))
∑𝑚=1
𝐾
𝑝𝑘(𝑖 )𝑔(𝑥𝑛 ;𝑚𝑘
(𝑖) ,𝜎𝑘(𝑖))
![Page 31: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/31.jpg)
EM M Step:
𝑚𝑘(𝑖+1 )=
∑𝑛=1
𝑁
𝑝(𝑖)(𝑘∨𝑛)𝑥𝑛
∑𝑛=1
𝑁
𝑝(𝑖)(𝑘∨𝑛)
𝜎 𝑘(𝑖+1)=√ 1𝐷 ∑
𝑛=1
𝑁
𝑝 ( 𝑖 ) (𝑘|𝑛)∨¿ 𝑥𝑛−𝑚𝑘(𝑖+1)∨¿2
∑𝑛=1
𝑁
𝑝(𝑖)(𝑘∨𝑛)
𝑝𝑘(𝑖+1)= 1
𝑁 ∑𝑛=1
𝑁
𝑝(𝑖) (𝑘∨𝑛)
![Page 32: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/32.jpg)
EM
Converges to a locally optimal solution
Each step increases the probability of the points given the distributions.
Can get stuck in local optima (less than K-Means)
![Page 33: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/33.jpg)
EM vs K-Means local optima 1D points at 0 20 32 Centers at 10 and 32 A local minima for k-means EM: 20 almost evenly shared
between the two centers. The center at 32 moves closer to 20
and takes over First center shifts to the left.
![Page 34: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/34.jpg)
EM and K-means
Notice the similarity between EM for Normal mixtures and K-means. The expectation step is the assignment. The maximization step is the update of
centres.
K-means is a simplified EM. K-means makes a hard decision while EM makes
a soft decision when updating the parameters of the model.
![Page 35: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/35.jpg)
EM and K-Means
𝑝 (𝑖 ) (𝑘|𝑛)=𝑝𝑘
(𝑖 )𝑔(𝑥𝑛 ;𝑚𝑘(𝑖 ) ,𝜎𝑘
( 𝑖 ))
∑𝑚=1
𝐾
𝑝𝑘(𝑖 )𝑔(𝑥𝑛 ;𝑚𝑘
(𝑖) ,𝜎𝑘(𝑖))
𝑟𝑛𝑘={1𝑖𝑓 𝑘=𝑎𝑟𝑔𝑚𝑖𝑛 𝑗¿|𝑥𝑛−𝜇𝑘|∨¿2 ¿0 h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒K-Means
EM
![Page 36: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/36.jpg)
EM and K-Means
𝑚𝑘(𝑖+1 )=
∑𝑛=1
𝑁
𝑝(𝑖)(𝑘∨𝑛)𝑥𝑛
∑𝑛=1
𝑁
𝑝(𝑖)(𝑘∨𝑛)
𝜇𝑘=∑𝑛𝑟𝑛𝑘 𝑥𝑛
∑𝑛𝑟𝑛𝑘
K-Means
EM
![Page 37: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/37.jpg)
Fast Image Segmentation Based on K-Means Clustering with Histograms in HSV Color
Space
![Page 38: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/38.jpg)
HSVHue-Saturation-Value
![Page 39: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/39.jpg)
Overview
![Page 40: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/40.jpg)
Histogram Generation
![Page 41: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/41.jpg)
Motivation
Gray and color histograms in HSV color space for K-Means clustering
Cluster number automatically set by “Maximin” initialization
Fast and efficient to extract regions with different colors in images.
Segmentation results are close to human perceptions.
![Page 42: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/42.jpg)
Maximin Initialization and Parameter Estimation Use Maximin to initialize number of
clusters and centroid positions:
Step A: From the color histogram bins and gray histogram bins, find the bin which has the maximum number of pixels to be the first centroid.
![Page 43: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/43.jpg)
Maximin Initialization and Parameter Estimation Step B: For each remaining histogram bin,
calculate the min distance, which is the distance between it and its nearest centroid. Then the bin which has the maximum value of min distance is chosen as the next centroid.
Step C: Repeat the process until the number of centroid equals to KMax or the maximum value of the distance in Step B is smaller than a predefined threshold ThM.
![Page 44: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/44.jpg)
Maximin Initialization and Parameter Estimation Kmax is set to 10.
There should be no more than 10 dominant colors in one image for high level image segmentation.
ThM is set to 25. According to human perception of
different colors in HSV color space.
![Page 45: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/45.jpg)
K-Means Clustering in HSV Color Space Step 1: Estimate the parameters of K-
Means.
Step 2: Two kinds of histogram bins will be clustered together in this step. For color histogram bins, since the hue dimension is circular (e.g. 0◦ = 360◦), the numerical boundary should be considered in the distance measurement and the process of centroid calculations.
![Page 46: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/46.jpg)
K-Means Clustering in HSV Color Space For gray histogram bins, there is no
hue information inside. which means that the saturation values of gray histogram bins are all considered as zero, and the hue values can be arbitrary.
![Page 47: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/47.jpg)
K-Means Clustering in HSV Color Space
![Page 48: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/48.jpg)
K-Means Clustering in HSV Color Space Step 3: Recalculate and update K
cluster centroids. Again, since the hue dimension is circular, the indices in the hue dimension should be considered not absolutely but relatively.
![Page 49: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/49.jpg)
K-Means Clustering in HSV Color Space Step 4: Check if the clustering process
is converged according to the total distortion measurement, which is the summation of distances between each histogram bin and its nearest cluster centroid
When the difference of total distortion measurement is smaller than a predefined threshold or max iterations reached, terminate. Else, go to step 2
![Page 50: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/50.jpg)
K-Means Clustering in HSV Color Space Step 4 intuition: G(v) represents the number of pixels
in the gray histogram bin with parameter v
B(h, s, v) represents the number of pixel in the color histogram bin with parameters (h, s, v)
![Page 51: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/51.jpg)
K-Means Clustering in HSV Color Space Step 5: Image pixels are labeled with
the index of nearest centroid of their corresponding histogram bins. A labeled image is obtained in this step, and K-Means clustering is finished.
![Page 52: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/52.jpg)
K-Means Clustering in HSV Color Space Eliminate noise and unnecessary
details of labeled images Statistical filter Windows over pixels The purpose of this filter is to replace
the pixel in the labeled image with the label with maximum number in a window.
Areas smaller than a certain threshold are merged with biggest neighboring region to avoid over segmentation
![Page 53: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/53.jpg)
K-Means Clustering in HSV Color Space
![Page 54: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/54.jpg)
Summary
Clustering K-Means K-Means++ initialization EM EM as a general K-Means Fast Image Segmentation Based on K-
Means Clustering with Histograms in HSV Color Space
![Page 55: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/55.jpg)
Thank You For Listening
![Page 56: Clustering Techniques K-Means & EM](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816648550346895dd9bfb2/html5/thumbnails/56.jpg)
References
Fast Image Segmentation Based on K-Means Clustering with Histograms in HSV Color Space Tse-Wei Chen 1, Yi-Ling Chen 2, Shao-Yi Chien
K-Means and EM - David Jacobs D. Forsyth Expectation-Maximization Algorithm and Image
Segmentation - Daozheng Chen