rcp209-presentation - automated variable weighting in k ...francky.me/doc/rcp209-presentation -...
TRANSCRIPT
![Page 1: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/1.jpg)
> Automated Variable Weighting in k-Means Type Clustering(Huang, J.Z.; Ng, M.K.; Hongqiang Rong; Zichen Li. ; 2005)
Presentation
28 Septembre 2010
NB: this presentation was originally written in Fre nch. I have translated it quickly so please don’t g et too upset if you encounter English mistakes. Instead, b lame Gtanslate ;-) and drop me in email. Thanks! Fra nck.
![Page 2: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/2.jpg)
31/12/2011 [email protected]
Table of Contents
1.Clustering
2.k-means clustering
3.Automated Variable Weighting
4.Experiments and results
5.Limitations of the study
![Page 4: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/4.jpg)
31/12/2011 [email protected]
1. Clustering
Examples of clustering applications
� Marketing: find groups of similar customers
� Biology: classify plants according to their characteristics
� Evolutionary algorithms : improve crossovers’ diversity.
� …
![Page 5: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/5.jpg)
31/12/2011 [email protected]
Table of Contents
1.Clustering
2.k-means clustering
3.Automated Variable Weighting
4.Experiments and results
5.Limitations of the study
![Page 6: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/6.jpg)
31/12/2011 [email protected]
2. k-means clustering
K-meansAn unsupervised method.MacQueen, 1967
2 ) k g r o u p s a r e created by combining each individual at the n e a r e s t c e n t e r .
1) k initial centers are randomly selected a m o n g a l l d a t a . .(here k=3 )
3) The centroid of each group b e c o m e t h e new centers.
4) Steps 2 and 3 are repeated as long as the centers are not stabilized
The iteration ends when centers are s tab i l i zed.
![Page 7: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/7.jpg)
31/12/2011 [email protected]
2. k-means clustering
Limitations of the K-means algorithm:
Require a metric
Need to guess a priori the number of classes
The choice of initial centers influences the results
Sensitive to noise
![Page 8: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/8.jpg)
31/12/2011 [email protected]
2. k-means clustering
Limitations of the K-means algorithm:
Require a metric
Need to guess a priori the number of classes
The choice of initial centers influences the results
Sensitive to noise
![Page 9: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/9.jpg)
31/12/2011 [email protected]
2. k-means clustering
3D example: x1 and x2 are "clusterable", x3 is noise
x1 and x2x1 and x3
x2 and x3Clustering
result
![Page 10: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/10.jpg)
31/12/2011 [email protected]
Table of Contents
1.Clustering
2.k-means clustering
3.Automated Variable Weighting
4.Experiments and results
5.Limitations of the study
![Page 11: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/11.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
(Automated Variable Weighting in k-Means Type Clustering – 2005)
Idea: Weight each variable in order to give less weight to the variables affected by significant noise.
State of the art: Modha and Spangler already had this idea ... but they calculate the weights at the beginning of the algorithm.
Here, the weights will be calculated dynamically at each iteration of the algorithm K-means.
![Page 12: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/12.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
2 ) k g r o u p s a r e created by combining each individual at the n e a r e s t c e n t e r .
1) k initial centers are randomly selected a m o n g a l l d a t a . .(here k=3 )
3) The centroid of each group b e c om e t he new centers.
5) Steps 2, 3 and 4 are repeated as long as the
centers are not stabilized
The iteration ends when centers are s t a b i l i z ed .
4) Computing weights
![Page 13: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/13.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
Computing weights - Theorem
with
Where ui,l means that the object i is assigned to the class l
d(xi,j , zl,j) is the distance between objects x and z
h is the number of variables Dj such as Dj ≠ 0
zl,j is the value of the variable j of the centroid of the cluster l
![Page 14: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/14.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
Computing weights - Theorem
with
Where ui,l means that the object i is assigned to the class l
d(xi,j , zl,j) is the distance between objects x and z
h is the number of variables Dj such as Dj ≠ 0
zl,j is the value of the variable j of the centroid of the cluster l
Idea: give a low weight for the variables, the value of each individual on average away
from the centroid
![Page 15: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/15.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
Computing weights
Function to be minimized:
Constraint:
� Lagrange multipliers!
![Page 16: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/16.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
Computing weights
The Lagrangian:
We get:
![Page 17: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/17.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
Computing weights
We can see:
Hence:QED!
Moreover:
![Page 18: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/18.jpg)
31/12/2011 [email protected]
3. Automated Variable Weighting
Computing weights - Theorem
with
Where ui,l means that the object i is assigned to the class l
d(xi,j , zl,j) is the distance between objects x and z
h is the number of variables Dj such as Dj ≠ 0
zl,j is the value of the variable j of the centroid of the cluster l
Idea: give a low weight for the variables, the value of each individual on average away
from the centroid
![Page 19: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/19.jpg)
31/12/2011 [email protected]
Table of Contents
1.Clustering
2.k-means clustering
3.Automated Variable Weighting
4.Experiments and results
5.Limitations of the study
![Page 20: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/20.jpg)
31/12/2011 [email protected]
4. Experiments and results
Experiment 1: with a synthetic data set
5 Variables, 300 individuals :
X1, X2, X3 : Data forming 3 clear classes
X4, X5 : Noise
As we know the 3 classes, we will compare the results obtained by the K-means algorithm with the standard K-means with dynamic weighting.
To make this comparison, we use the Rand index and the Clustering Accuracy in order to assess the performance of a classification compared to the desired classification.
![Page 21: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/21.jpg)
31/12/2011 [email protected]
4. Experiments and results
Rand index:
Clustering accuracy :
• ai is the number of points assigned to the correct class
• N is the total number of points
![Page 22: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/22.jpg)
31/12/2011 [email protected]
4. Experiments and results
Rand index:
Clustering accuracy :
• ai is the number of points assigned to the correct class
• N is the total number of points
Erratum
![Page 25: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/25.jpg)
31/12/2011 [email protected]
4. Experiments and results
Experiment:
1) Data generation2) Random weights
3) Classical K-means algorithm
3) Classical K-means algorithm4) Weights computation5) Back to step 3.
After 10 iterations, break loop
![Page 26: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/26.jpg)
31/12/2011 [email protected]
4. Experiments and results
Classical K-means algorithm K-means algorithm with dynamic weighting
Results
![Page 27: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/27.jpg)
31/12/2011 [email protected]
4. Experiments and results
Classical K-means algorithm K-means algorithm with dynamic weighting
Results
Common issue: The classification depends
on initial centroids and weights
![Page 28: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/28.jpg)
31/12/2011 [email protected]
4. Experiments and results
Common solution:Run the algorithms several times and
take the best result.
The weights converge similarly:
![Page 29: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/29.jpg)
31/12/2011 [email protected]
4. Experiments and results
Common solution:Run the algorithms several times and
take the best result.
Results:
![Page 30: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/30.jpg)
31/12/2011 [email protected]
4. Experiments and results
Experiment 2: With two real-world datasets
Australian Credit Card data : 690 individuals, 5 quantitative variables, 8 qualitative variables.
Heart Diseases : 270 individuals, 6 quantitative variables, 9 qualitative variables.
Objectives : 1) Assess the impact of β, the parameter used in the formula
calculation of weight.
2) Compare the results with previous studies performed on these data sets.
![Page 32: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/32.jpg)
31/12/2011 [email protected]
4. Experiments and results
Results
Australian Credit Card data
+0.02 increase in prediction compared to previous studies!
![Page 33: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/33.jpg)
31/12/2011 [email protected]
4. Experiments and results
Results
Australian Credit Card data
+0.02 increase in prediction compared to previous studies!
Erratum : 0.85 is also reached when β = 8
![Page 35: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/35.jpg)
31/12/2011 [email protected]
4. Experiments and results
Results
Heart Diseases
+0.02 increase in prediction compared to previous studies!
![Page 38: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/38.jpg)
31/12/2011 [email protected]
4. Experiments and results
Results after removal of theleast significant variables
Australian Credit Card data
![Page 39: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/39.jpg)
31/12/2011 [email protected]
4. Experiments and results
Australian Credit Card data +0.03 increase in prediction compared to previous studies!
Results after removal of theleast significant variables
![Page 40: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/40.jpg)
31/12/2011 [email protected]
4. Experiments and results
Heart Diseases
Results after removal of theleast significant variables
![Page 41: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/41.jpg)
31/12/2011 [email protected]
4. Experiments and results
+0.01 increase in prediction compared to previous studies!
Heart Diseases
Results after removal of theleast significant variables
![Page 42: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/42.jpg)
31/12/2011 [email protected]
4. Experiments and results
Résultats
+0.01 increase in prediction compared to previous studies!
Heart DiseasesErratum: The results here are worse than before the removal of variables
Results after removal of theleast significant variables
![Page 43: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/43.jpg)
31/12/2011 [email protected]
Table of Contents
1.Clustering
2.k-means clustering
3.Automated Variable Weighting
4.Experiments and results
5.Limitations of the study
![Page 44: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/44.jpg)
31/12/2011 [email protected]
5. Limitations of the study
1) The choice of β does seem empirically.
The study found that depending on the value of β, the classification results vary widely, and that the best result is better than the results obtained with other algorithms K-means.
Observation, but not interpretation.
![Page 45: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/45.jpg)
31/12/2011 [email protected]
5. Limitations of the study
Heart Diseases
Résultats en supprimant lesvariables au poids faible
![Page 46: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/46.jpg)
31/12/2011 [email protected]
5. Limitations of the study
2) Algorithm complexity analysis?
The article says that the complexity is O (tmnk) with:
• k is the number of classes,
• m is the number of variables;
• n is the number of individuals;
• t is the number of iterations of the algorithm.
t should not be included in O, since by including in it, the complexity of the algorithm is not at all assessable.
![Page 47: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/47.jpg)
31/12/2011 [email protected]
5. Limitations of the study
2) Algorithm complexity analysis?
Example with two sorting algorithms
Bubble sort is О(n²), where n is the number of items to classify.Using a t indicating the number of iterations, the complexity of bubble sort could be also be denoted O(t) (since an iteration is O(1)).
Quicksort sort is О(nlogn) where n is the number of items to classify. Using a t indicating the number of iterations, the complexity of quicksort sort also be denoted O(t) (since an iteration is O(1)).
Conclusion : Using a t indicating the number of iterations makes complexities incomparable . Even if it is used to evaluate the complexity of a single iteration
![Page 48: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/48.jpg)
31/12/2011 [email protected]
5. Limitations of the study
3) Measurement of quality of the quality index?
This study uses the Rand index and Clustering Accuracy.
We saw that the differences between the two indices are sometimes very important.
![Page 49: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/49.jpg)
31/12/2011 [email protected]
5. Limitations of the study
3) Measurement of quality of the quality index?
This study uses the Rand index and Clustering Accuracy.
We saw that the differences between the two indices are sometimes very important.
![Page 50: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/50.jpg)
31/12/2011 [email protected]
5. Limitations of the study
3) Measurement of quality of the quality index?
How about the other quality indexes?
![Page 51: RCP209-Presentation - Automated Variable Weighting in k ...francky.me/doc/RCP209-Presentation - Automated... · 3. Automated Variable Weighting Computing weights -Theorem with Where](https://reader036.vdocuments.us/reader036/viewer/2022081617/604811f30348404517681ea1/html5/thumbnails/51.jpg)
31/12/2011 [email protected]
5. Limitations of the study
Conclusion
1) A very interesting study improving a classical algorithm for clustering (K-means)
2) A dynamic-weighting approach which seems well founded and which meets an existing need.
3) The results look promising but deserved to be better explored.