unsupervised learning clustering using the k-means algorithm avi libster
Post on 19-Dec-2015
234 views
TRANSCRIPT
![Page 1: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/1.jpg)
Unsupervised learning
Clustering using the k-means algorithm
Avi Libster
![Page 2: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/2.jpg)
clustering
• Used when we have a very large data set with very high dimensionality and lots of complex structure.
• Basic assumption : attributes of the data are independent.
![Page 3: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/3.jpg)
Cluster Analysis
• Given a collection of data points in space ,which might be high dimensional one, the goal is to find structure in the data: organize that data into sensible groups, so that each group will contain points that are near in some sense.
• We want points in the same cluster to have high intersimilarity and low outersimilarity compared to points from different clusters.
![Page 4: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/4.jpg)
Taxonomy of Clustering
![Page 5: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/5.jpg)
What is K-means
• An unsupervised learning algorithm
• Used for partitioning datasets.
• Simple to use
• Based on the minimization of the square error.
![Page 6: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/6.jpg)
Basic K-means algorithm
Begin Initialize n , k ,m1 .. mk
do classify n samples according to nearest mi
recompute miuntil no change in mi
return m1 … mk
End
The goal is to minimize E = ∑i=1,K∑vЄCid(μiv)
![Page 7: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/7.jpg)
And now for something completely different…
*Pictures
* Adopted from http://www.cs.ucr.edu/~eamonn/teaching/cs170materials/MachineLearning3.ppt
![Page 8: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/8.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 1K-means Clustering: Step 1N – points , 3 centers randomly chosen
k1
k2
k3
![Page 9: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/9.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 2K-means Clustering: Step 2Notice that the 3 centers divide the space into 3 parts
k1
k2
k3
![Page 10: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/10.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 3K-means Clustering: Step 3New centers are calculated according to the instances of each K.
k1
k2
k3
![Page 11: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/11.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 4K-means Clustering: Step 4Classifying each point to the new calculated K.
k1
k2
k3
![Page 12: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/12.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
ex
pre
ss
ion
in
co
nd
itio
n 2
K-means Clustering: Step 5K-means Clustering: Step 5After classifying the points to previous K vector , calculating new one
k1
k2 k3
![Page 13: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/13.jpg)
Classic K-Means Strengths
First and furthermost
VERY easy to implement and understand. Nice results for a simple algorithm.
![Page 14: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/14.jpg)
Classical K-Means - Strengths
K-means can be viewed as a stochastic hill climbing procedure. we are looking for local optimum and not global optimum (as opposed to genetic or deterministic annealing algorithms which look for global optimum).
![Page 15: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/15.jpg)
Why Hill climbing ?
Actually Hill climbing can be a misleading term in this context. The hill climbing Is not done over the dataset points , but over the means values. When the k-means algorithm is running we actually change the values of the means (k of those). The changes of the means is somewhat dependent on each other.
![Page 16: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/16.jpg)
Hill climbing continued…
• The algorithm is said to converge when we don’t change the means values.
• This happens when dmi/dt = 0. in the phase plane created from the means values we have reached the top of the hill (stable point or saddle point).
![Page 17: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/17.jpg)
K-means Strengths continued…
K-means complexity is easily derived from the algorithm :
O(ndcT)n – number of samplesd – number of features (usually the dimension of the samples)k – number of centers checked T – number of iterations
When the datasets are not to large and of low dimensions the average time of running is not high.
![Page 18: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/18.jpg)
Using K-means strengths
The following are real life situations in which K-means is used as the key clustering algorithm.
During the presentation of the samples, I would like to emphasize some important points considering the implementation of k-means.
![Page 19: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/19.jpg)
Sample :Understanding Gene regulation using Expression array cluster analysis
![Page 20: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/20.jpg)
Scheme of gene expression research
Gene Expression Data
Distance/Similarity Matrix
Gene Clusters
Regulatory Elements / Gene Functions
Pairwise Measures
Clustering
Motif Searching/Network Construction
Integrated Analysis(NMR/SNP/Clinic/….)
![Page 21: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/21.jpg)
Normalized Expression Data
Gene Expression Clustering
Protein/protein complex
DNA regulatory elements
Semantics of clusters:From co-expressed to co-regulated
![Page 22: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/22.jpg)
The meaning of gene clusters
• genes that are consistently either up- or downregulated in given set of conditions. Down or upregulation may shade light on the causes of biological processes
• patterns of gene expression and grouping genes into expression classes might provide much greater insight into their biological function and relevance.
![Page 23: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/23.jpg)
Why should clusters emerge ?
Genes contained in a particular pathway, or that respond to some environmental change , should be co-regulated and consequently, should show similar patterns of expression.
from that fact one can see that the main goal is to identify genes which show similar patterns of expressions. Examples:
1. If gene a expression is rising, gene b expression is rising too (might be because gene a encodes a protein which regulates the expression of b).
2. Gene a is always expressed with gene b (it might happen because of the co-regulation by same protein).
![Page 24: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/24.jpg)
Classic Microarray Experiment
Control
Treated
mRNA
RT andlabel with fluor dyes
cDNA
Spot (DNA probe):• known cDNA or• Oligo
Mix and hybridize target to microarray
![Page 25: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/25.jpg)
Measured Microarray raw data
Measure amounts of green and red fluorescence. For each well 4 results are possible :
1. No color – the gene wasn’t expressed2. Red color – the gene was expressed only by the control
group3. Green color – the gene was expressed only by the
treated group4. Yellow color – the gene was expressed by both of the
control groups.
Important conclusion : don’t let color blind people perform the test .
![Page 26: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/26.jpg)
Example of microarray image
![Page 27: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/27.jpg)
Data extraction Process
• Adjust fluorescent intensities using standards (as necessary)
• Calculate ratio of red to green fluorescence
• Convert to log2 and round to integer
• Values may be : saturated green=-2 to black = 0 to saturated red = 2
![Page 28: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/28.jpg)
Input data for clustering
• Genes in rows , conditions in columns. Condition can be seen as : exposure to specific environment, time , ect. Each column is one microarray test.
YORF NAME GWEIGHT Cell-cycle Alpha-Factor 1Cell-cycle Alpha-Factor 2Cell-cycle Alpha-Factor 3EWEIGHT 1 1 1YHR051W YHR051W COX6 oxidative phosphorylation cytochrome-c oxidase subunit VI S00010931 0.03 0.3 0.37YKL181W YKL181W PRS1 purine, pyrimidine, tryptophanphosphoribosylpyrophosphate synthetase S00016641 0.33 -0.2 -0.12YHR124W YHR124W NDT80 meiosis transcription factor S00011661 0.36 0.08 0.06YHL020C YHL020C OPI1 phospholipid metabolism negative regulator of phospholipid biosynthesS00010121 -0.01 -0.03 0.21YGR072W YGR072W UPF3 mRNA decay, nonsense-mediated unknown S00033041 0.2 -0.43 -0.22YGR145W YGR145W unknown unknown; similar to MESA gene of Plasmodium fS00033771 0.11 -1.15 -1.03YGR218W YGR218W CRM1 nuclear protein targeting nuclear export factor S00034501 0.24 -0.23 0.12YGL041C YGL041C unknown unknown S00030091 0.06 0.23 0.2YOR202W YOR202W HIS3 histidine biosynthesis imidazoleglycerol-phosphate dehydratase S00057281 0.1 0.48 0.86YCR005C YCR005C CIT2 glyoxylate cycle peroxisomal citrate synthase S00005981 0.34 1.46 1.23YER187W YER187W unknown unknown; similar to killer toxin Khs1p S00009891 0.71 0.03 0.11YBR026C YBR026C MRF1' mitochondrial respiration ARS-binding protein S00002301 -0.22 0.14 0.14YMR244W YMR244W unknown unknown; similar to Nca3p S00048581 0.16 -0.18 -0.38YAR047C YAR047C unknown unknown S00000831 -0.43 -0.56 -0.14YMR317W YMR317W unknown unknown S00049361 -0.43 -0.03 0.21
![Page 29: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/29.jpg)
Why Data extraction process is relevant and important ?
• Creating an easy to work with scale (-2 < x < 2)
• more important k-means is sensitive to the measure units we chose, more correctly to linear transformations. Let’s demonstrate that :
![Page 30: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/30.jpg)
What happens if we change the measurement unit to ft ?
Age Height
Person (yr) (cm)
A 35 190
B 40 190
C 35 160
D 40 160
160
170
180
190
20 30 40 5035
A B
C D
Height (cm)
Age (year)
![Page 31: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/31.jpg)
when the measurement units were changed, a very different clustering
structure emerged.
Age Height
Person (yr) (ft)
A 35 6.2
B 40 6.2
C 35 5.2
D 40 5.2
4
5
6
7
35 37 3938
A B
C D
Height (ft)
Age (year)36 40
![Page 32: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/32.jpg)
How to overcome measure unit problems ?
• It’s clear that if k-means algorithm is to be used the data should be normalized and standardized.
let’s just have a brief look on the dataset structure …
![Page 33: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/33.jpg)
We are dealing with a Multivariate Dataset
composed of p variables (p microarrays tests done) for n independent observations (genes). We represent it using n x p matrix M consisting of vectors X1 through Xn each of length p.
Dataset structure
![Page 34: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/34.jpg)
• Normalization:
– Calculates the mean value of variables
– a measure how well the of the ith variable is spread over the data
• Mean absolute deviation
(column) everyfor jxxxn
m njjjj 211
||||||1
21 jnjjjjjj mxmxmxn
s
222
21 )()()(
1
1jnjjjjjj mxmxmx
nstd
Note: if this is done we are becoming sensitive to outliers.More of that in pitfalls of k-means section
![Page 35: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/35.jpg)
• z-scores: standardize measurements
j
jijij
s
mxz
variables
objects
p
zz
zz
n
npn
p
......
............
............
......
1
111
![Page 36: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/36.jpg)
calculating covariance matrix will come handy later. The covariance matrix is PxP matrix.
cov( j,k) M(i, j) I( j ) M(i,k) I(k)
i1
n
n 1
![Page 37: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/37.jpg)
before running k-means on the data it’s also a good idea to do mean centering. Mean centering reduces the effect of the variables with the largest values (column) , which obscure other important differences.
but When applying the previous steps to the data you should be cautious. When data is being standardized it may cause some damages to the clusters structure , because of the reduced effects ,from variables with a big contribution, being divided by large sj.
![Page 38: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/38.jpg)
Back to microarrays world
![Page 39: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/39.jpg)
a1
a2 b2
b1
Distance
Sample 2
Sample 1
Gene aGene b
Sample 1 sample 2
a1 a2b1 b2
![Page 40: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/40.jpg)
Distance
The way Distance is measured is of the highest importance to k-means algorithm. Using the distance function we aim to classify points to different centers.
Distance should be a function with the following properties : 1. d(A,A) = 02. d(A,B) = d(B,A)3. d(A,B) > 0
![Page 41: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/41.jpg)
Distances example 1
Below are distances which are good to use when we are looking for similarities
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
• Weighted Euclidean Distance
2222
211 )()()(),( jpipjiji xxxxxxjid
||||||),( 2211 jpipjiji xxxxxxjid
qqjpip
qji
qji xxxxxxjid
1
2211 )|||||(|),(
22222
2111 )()()(),( jpippjiji xxwxxwxxwjid
![Page 42: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/42.jpg)
Distances example 2The following examples are a measure of
dissimilarity :• Mahalanobis distance
• Tanimoto distance
)()(),( 12jiji xxCxxjir
Covariance matrix C (calculated before)
jijT
jiT
i
ji
xxxxxx
xxjid
T
T
),(
ji xx Tdenotes the number of common attributes between i and j
![Page 43: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/43.jpg)
Pearson correlation coefficient
the most common distance measurement used in microarrays.
![Page 44: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/44.jpg)
Effects of choosing distance function
Euclidean distances from the mean, of points a and b are equal but point B is clearly “more different” from the population than point A (it lies on the border of the ellipse).
BA
The ellipse was created using a distance function using the covariance matrix. it shows the 50% contour of a hypothetical population.
![Page 45: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/45.jpg)
Result from microarrays data analysis
![Page 46: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/46.jpg)
![Page 47: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/47.jpg)
Results clustering for patterns
![Page 48: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/48.jpg)
Problems with K-means type algorithms
• Clusters are approximately spherical
• Local optimum may be incorrect and influenced by the choice of the first K means values
• High dimensionality is a problem
• The value of K is an input parameter
• Sensitive to outliers
![Page 49: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/49.jpg)
Clusters are approximately spherical
• What happens if cluster is not spherical ?
Also when k-means assumes the data to be spherical , it becomes sensitive to coordinate changes (i.e. weighting changes).
![Page 50: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/50.jpg)
• What happens with non conventional structures ?
Min
stance
Average
distance
Max
distance
![Page 51: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/51.jpg)
Local optimum problem
• If we begin with b,e as the k values we end with {a,b,c} and {e,d,f} as clusters. If we begin with e,f as k values we end with {c,f} and {a,b,d,e} as clusters.
![Page 52: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/52.jpg)
Outliers problem
• What happens in the next situation ?
![Page 53: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/53.jpg)
High dimensionality
Poses a computational problem. The higher the dimension , the more resources are required. Also a linear connection exists between the dimension and running time.
![Page 54: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/54.jpg)
Dimensions reduction
Using PCA (principle component analysis) we can guess the dimensions which are most important to us (most changes of data occur in them). After finding the proper dimensions we reduce the data set dimensions accordingly.
![Page 55: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/55.jpg)
The value of k is an input parameter
Insufficient number of centers Too much centers
![Page 56: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/56.jpg)
Variation K-means
• Instead of fixed number of centers , the number of centers change as the algorithm runs.
• G-means is an example of such an algorithm.
![Page 57: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/57.jpg)
G-means algorithm
![Page 58: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/58.jpg)
Ideas behind G-means
• Each cluster adheres to unimodal distribution , such as Gaussian.
• Doesn’t presume prior or domain specific knowledge
• Increment of the number of centers occurs only if we have a cluster without gaussian distribution.
• Statistical test is used to determine whether clusters have gaussian distributions or not.
![Page 59: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/59.jpg)
Testing clusters for Gaussian fit
we need a test to detect whether the data assigned to a center is sampled from a gaussian. The alternative hypotheses are:
• H0 – the data around the center is sampled from a gaussian• H1 – the data around the center is not sampled from a
gaussian
Accepting the null H means we believe that one center is sufficient and we shouldn’t split the clusters.
Regecting the null H, then we want to split the cluster
![Page 60: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/60.jpg)
Anderson – Darling Statistic
• One dimensional test.• Normality test based on the empirical cumulative distribution
function (ECDF)
Equations :
If the mean and variance are estimated from the data
![Page 61: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/61.jpg)
Hypothesis test
Hypothesis test for subset data X in d dimensions that belongs to center c• Choose the significance level for the test.• Initialize two centers , called children of “C”• Run k-means on these two centers in X.• Let v be the vector that connects between c1 and c2.v is the direction
that k-means believes to be important for clustering project X onto v let it be X’. X’is a one dimensional representation of the data projected onto v. Transform X’ to mean 0 and variance 1.
• Zi = F(x’i), calculate according to that the anderson – darling statistic. Decide critical level according to significance level . If in the range of non critical accept null H hypothesis and discard the children. Otherwise keep the children
![Page 62: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/62.jpg)
Two clusters composed from 1000 points each. Alpha = 0.0001. critical value for anderson-darling test is 1.8692 for this confidence level. Starting with on center.
![Page 63: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/63.jpg)
After one iteration of G-means we have two centers. Calculation of the anderson-darling statistics is 38.103 .It’s much larger than the critical value so we reject the Null H and accept this spilit.
![Page 64: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/64.jpg)
On the next iteration we split each new center and repeat the Statistical test. The values for two splits is 0.386 and 0.496Both are below the critical value. Null H is accepted for bothTests and the splits are discarded. Thus G-means final answer Is 2.
![Page 65: Unsupervised learning Clustering using the k-means algorithm Avi Libster](https://reader031.vdocuments.us/reader031/viewer/2022032015/56649d3b5503460f94a15fd3/html5/thumbnails/65.jpg)
Another strength of the g-means algorithm it’s is ability to handle non-spherical data, even when the number of points for each cluster is small .
One should notice that experiments shows a tendency to type II error (not splitting when should) when the number of points in each cluster is small.