assessment. schedule graph may be of help for selecting the best solution best solution corresponds...
TRANSCRIPT
![Page 1: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/1.jpg)
Assessment
![Page 2: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/2.jpg)
Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump
Solutions with very small or even singletons clusters are rather suspicious
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
![Page 3: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/3.jpg)
Standardization K-means
Initial centroids Validation example Cluster merit Index
Cluster validation, three approaches Relative criteria Validity Index
Dunn Index Davies-Bouldin (DB) index
Combination of different distances/diameter methods
![Page 4: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/4.jpg)
Standardization Issue The need for any standardization must be
questioned If the interesting clusters are based on the
original features, than any standardization method may distort those clusters
Only when there are grounds to search for clusters in transformed space that some standardization rule should new used
![Page 5: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/5.jpg)
There is no methodological way except by “trail and error”
€
y i =x i − μ( )
σ
€
y i =x i
max(x i) − min(x i)( )€
y i =x i − min(x i( )
max(x i) − min(x i)( )
![Page 6: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/6.jpg)
An easy standardization method that will often follow and frequently achieve good results is the simple division or multiplication by a simple scale factor
A should be properly chosen so that all feature values occupy a suitable interval
€
y i =x i
a
![Page 7: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/7.jpg)
k-means Clustering
Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck
€
d2(r x ,
r z ) = x i − zi( )
2
i=1
d
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
1
2
![Page 8: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/8.jpg)
Initial centroids Specify which patterns are used as initial
centroids Random initialization Tree clustering in a reduced number of patterns may
performed for this purpose Choose first k patterns as initial centroids Sort distances between all patterns and choose
patterns at constant intervals of these distances as initial centroids
Adaptive initialization (according to a chosen radius)
![Page 9: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/9.jpg)
k-means example (Sá 2001)
€
E = d2(x,c j )2
x∈C j
∑j=1
k
∑
Cluster merit index Ri • (n patterns in k cluserts)€
E i = (x i2 − c i
2j )
x∈C j
∑j=1
k
∑
€
R(k +1) =E i
(k )
E i(k +1)
−1 ⎛
⎝ ⎜
⎞
⎠ ⎟(n − k −1)
![Page 10: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/10.jpg)
Cluster merit index measure the decrease in overall within-cluster distance when passing from a solution with k clusters to one with k+1 clusters
High value of the merit indexes indicates a substantial decrease in overall within-cluster distance
![Page 11: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/11.jpg)
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
![Page 12: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/12.jpg)
Cluster merit index
Factor 1 has the most important contribution The values k=3,5,8 are sensible choices k=3 attractive
-500
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7
(k+1)
R
![Page 13: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/13.jpg)
Cluster validation
The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity
In general terms, there are three approaches to investigate cluster validity
![Page 14: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/14.jpg)
The first is based on external criteria This implies that we evaluate the results
of a clustering algorithm based on a pre-specified structure, which is imposed on a data set and reflects our intuition about the clustering structure of the data set
![Page 15: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/15.jpg)
Error Classification Ratesmaller value, good representation
Data partition according to known classes Li,
€
L = L1{ ,L2,...,LG}
€
ϕ L (C j ) := maxi=1...G (|C j ∩ Li |)
€
ECR :=1
k(|C j |
j=1
k
∑ −ϕ L (C j ))
![Page 16: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/16.jpg)
The second approach is based on internal criteria
We may evaluate the results of a clustering algorithm in terms of quantities that involve the vectors of the data set themselves (e.g. proximity matrix)
![Page 17: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/17.jpg)
Proximity matrix
Dissimilarity matrix
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
![Page 18: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/18.jpg)
The basis of the above described validation methods is often statistical testing
Major drawback of techniques based on internal or external criteria and statistical testing is their high computational demands
![Page 19: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/19.jpg)
The third approach of clustering validity is based on relative criteria
Here the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes, resulting by the same algorithm but with different parameter values
![Page 20: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/20.jpg)
There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme (Berry and Linoff, 1996)
Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized
Separation, the clusters themselves should be widely spaced
![Page 21: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/21.jpg)
Distance between two clusters There are three common approaches
measuring the distance between two different clusters
Single linkage: It measures the distance between the closest members of the clusters
Complete linkage: It measures the distance between the most distant members
Comparison of centroids: It measures the distance between the centers of the clusters
![Page 22: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/22.jpg)
Relative criteria
Based on relative criteria, does not involve statistical tests
The fundamental idea of this approach is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion
![Page 23: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/23.jpg)
Among the clustering schemes Ci ,i=1, ..., k defined by a specific algorithm, for different values of the parameters choose the one that best fits the data set
The procedure of identifying the best clustering scheme is based on a validity index q
![Page 24: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/24.jpg)
Selecting a suitable performance index q,we proceed with the following steps We run the clustering algorithm for all values of k
between a minimum kmin and a maximum kmax
• The minimum and maximum values have been defined a-priori by user
For each of the values of k, we run the algorithm r times, using different set of values for the other parameters of the algorithm (e.g. different initial conditions)
We plot the best values of the index q obtained by each k as the function of k
![Page 25: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/25.jpg)
Based on this plot we may identify the best clustering scheme
There are two approaches for defining the best clustering depending on the behavior of q with respect to k
If the validity index does not exhibit an increasing or decreasing trend as k increases we seek the maximum (minimum) of the plot
![Page 26: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/26.jpg)
For indices that increase (or decrease) as the number of clusters increase we search for the values of k at which a significant local change in value of the index occurs
This change appears as a “knee” (joelho) in the plot and it is an indication of the number of clusters underlying the data-set
The absence of a knee may be an indication that the data set possesses no clustering structure
![Page 27: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/27.jpg)
Validity index
Dunn index, a cluster validity index for k-means clustering proposed in Dunn (1974)
Attempts to identify “compact and well separated clusters”
![Page 28: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/28.jpg)
Dunn index
€
d(Ci,C j ) =min
r x ∈ Ci,
r y ∈ C j
d(r x ,
r y )
€
diam(Ci) =max
r x ,
r y ∈ Ci
d(r x ,
r y )
€
Dk =min
1≤ i ≤ k
min
1≤ j ≤ k
i ≠ j
d(Ci,C j )max
1≤ l ≤ kdiam(Cl ){ }
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
⎫
⎬ ⎪ ⎪
⎭ ⎪ ⎪
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
⎫
⎬ ⎪ ⎪
⎭ ⎪ ⎪
![Page 29: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/29.jpg)
If the dataset contains compact and well-separated clusters, the distance between the clusters is expected to be large and the diameter of the clusters is expected to be small
Large values of the index indicate the presence of compact and well-separated clusters
![Page 30: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/30.jpg)
The index Dk does not exhibit any trend with respect to number of clusters
Thus, the maximum in the plot of Dk versus the number of clusters k can be an indication of the number of clusters that fits the data
![Page 31: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/31.jpg)
The implications of the Dunn index are:
Considerable amount of time required for its computation
Sensitive to the presence of noise in datasets, since these are likely to increase the values of diam(c)
![Page 32: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/32.jpg)
The Davies-Bouldin (DB) index (1979)
€
d(Ci,C j ) =min
r x ∈ Ci,
r y ∈ C j
d(r x ,
r y )
€
diam(Ci) =max
r x ,
r y ∈ Ci
d(r x ,
r y )
€
DBk =1
k
max
i ≠ j
diam(Ci) + diam(C j )
d(Ci,C j )
⎧ ⎨ ⎩
⎫ ⎬ ⎭i=1
k
∑
![Page 33: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/33.jpg)
Small indexes correspond to good clusters, clusters are compact and their centers are far away
The DBk index exhibits no trends with respect to the number of clusters and thus we seek the minimum value of DBk its plot versus the number of clusters
![Page 34: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/34.jpg)
Different methods may be used to calculate distance between clusters
• Single linkage
• Complete linkage
• Comparison of centroids
• Average linkage
€
d2(Ci,C j ) =min
r x ∈ Ci,
r y ∈ C j
d(r x ,
r y )
€
d1(Ci,C j ) =max
r x ∈ Ci,
r y ∈ C j
d(r x ,
r y )
€
d4 (Ci,C j ) =1
Ci C j
d(r x ,
r y )
x
∑y
∑€
d3(Ci,C j ) = d(c i,c j )
![Page 35: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/35.jpg)
Differnet methods to calculate the diamater of a cluster Max
Radius
Average distance
€
diam1(Ci) =max
r x ,
r y ∈ Ci
d(r x ,
r y )
€
diam3(Ci) =
d(r x l ,
r x m )
l=1
|C i |
∑(|Ci | −1) | Ci |
2
with (r x l ,
r x m ∈ Ci)∧(l < m)
€
diam2(Ci) =max
r x ∈ Ci
d(r x ,
r c i)
![Page 36: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/36.jpg)
A connected graph with s nodes has edges
€
(s −1)s
2
![Page 37: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/37.jpg)
Combination of different distances/diameter methods
It has been shown that using different distances/diameter methods may produce indices of different scale range (Azuje and Bolshakova 2002)
![Page 38: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/38.jpg)
Normalization
i selects the different distance method i (1,2,3,4)
j selects the different diameter method j (1,2,3)
(Dij) or (DBij) standart deviation of Dkij or
DBkij accross diferent values for k
![Page 39: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/39.jpg)
Normalized indexes
€
ˆ D kij =
(Dkij − D ij )
σ (Dij )
€
D ij =1
kDl
ij
l=1
k
∑
![Page 40: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/40.jpg)
Literature
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
![Page 41: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/41.jpg)
See also
J.P Marques de Sá, Pattern Recognition, Springer, 2001
https://www.cs.tcd.ie/publications/tech-reports/
TCD-CS-2002-34.pdf TCD-CS-2005-25.pdf
![Page 42: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/42.jpg)
Standardization K-means
Initial centroids Validation example Cluster merit Index
Cluster validation, three approaches Relative criteria Validity Index
Dunn Index Davies-Bouldin (DB) index
Combination of different distances/diameter methods
![Page 43: Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649cfd5503460f949cda04/html5/thumbnails/43.jpg)
Next lecture
KNN LVQ SOM