data mining and data warehousing henryk maciejewski data

57
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Upload: others

Post on 14-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining and Data Warehousing Henryk Maciejewski Data

Data Mining and Data Warehousing

Henryk Maciejewski

Data Mining – Clustering

Page 2: Data Mining and Data Warehousing Henryk Maciejewski Data

Clustering Algorithms – Contents

• K-means

• Hierarchical algorithms

• Linkage functions

• Vector quantization

• SOM

Page 3: Data Mining and Data Warehousing Henryk Maciejewski Data

Clustering – Formulation

...

...

...

...

...

...

...

........................... ..................

...

Objects

Attributes

Model

................................................ ..................

...

Find groups of similarpoints (observations) inmultidimensional space

No target variable(unsupervised learning)

Page 4: Data Mining and Data Warehousing Henryk Maciejewski Data

Methods of Clustering - Overview

• Variety of methods:– Hierarchical clustering – create hierarchy of clusters (one cluster entirely

contained within another cluster)– Non-hierarchical methods – create disjoint clusters– Overlapping clusters (objects can belong to >1 cluster simultaneously) – Fuzzy clusters (defined by the probability (grade) of membership of each

object in each cluster)

• Useful data preprocessing prior to clustering:– PCA (Principal Components Analysis) – to reduce dimensionality of data– Data standarization (transform data to reduce large influence of variables

with larger variance on results of clustering)

Page 5: Data Mining and Data Warehousing Henryk Maciejewski Data

Introductory Example

• 97 countries described by 3 attributes: Birth, Death, InfantDeath rate (given as number per 1000, data from year 1995)

Page 6: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – cntd.

Analysis I

• Clustering raw data

• K-means algorithm

• Result: 3 clusters (no. of obs. in each cluster: 13, 32, 52)

Page 7: Data Mining and Data Warehousing Henryk Maciejewski Data
Page 8: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Profiles of Clusters

Page 9: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Profiles of Clusters

• Notice: data clustered based on InfantDeath Rate only!

Page 10: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Standarization of Data

Analysis I

Analysis II• Data standarized prior to

clustering (variables dividedby their standard deviation)

• Result: 3 clusters(with 35, 46, 16 obs.)

• Data clustered based on InfantDeath and Death

• Observe that data withlargest variance have largestinfluence on results of clustering

Analysis II

Page 11: Data Mining and Data Warehousing Henryk Maciejewski Data

• Analysis II: profiles of clusters

Example – Profiles of Clusters

Page 12: Data Mining and Data Warehousing Henryk Maciejewski Data

Methods of Clustering

• Non-hierarchical methods – K-means clustering– Non-deterministic– O(n), n - number of observations

• Hierarchical methods– Aglomerative (join small clusters)– Divisive (split big clusters)– Deterministic methods– O(n2) – O(n3), depending on the clustering method (i.e. definition of inter-

cluster distance)

Page 13: Data Mining and Data Warehousing Henryk Maciejewski Data

Methods of Clustering - Remarks

• Clustering large datasets– K-means– If results of hierarchical clustering needed – first use K-means yielding e.g.

50 clusters, followed by hierarchical clustering on results of K-means

• Consensus clustering– Discover real clusters in data – analyze stability of results with noise

injected

Page 14: Data Mining and Data Warehousing Henryk Maciejewski Data

K-means Algorithm

• K-means clustering– Select k points (centroids of initial clusters; select randomly)– Assign each observation to the nearest centroid (nearest cluster)– For each cluster find the new centroid– Repeat step 2 and 3 until no change occurs in cluster assignments

Page 15: Data Mining and Data Warehousing Henryk Maciejewski Data

K-means Algorithm

• Result: k separate clusters

• Algorithm requires that the correct number of clusters k is specified in advance (difficult problem: how to know the real number of clusters in data…)

Page 16: Data Mining and Data Warehousing Henryk Maciejewski Data

Hierarchical Clustering

• Notation – xi – observations, i=1..n – Ck – clusters– G – current number of clusters – DKL – distance between clusters CK and CL

• Between-cluster distance DKL – linkage function (various definitions available, results of clustering depend on DKL)

CK

CL

DKL

Page 17: Data Mining and Data Warehousing Henryk Maciejewski Data

Hierarchical Clustering

• Algorithm (agglomerative hierarchical clustering)– Ck = {xk}, k=1..n, G=n– Find K, L such that DKL= min DIJ , 1<=I,J<=G– Replace clusters CK and CL by cluster CKCL ,

G=G-1 – Repeat steps 2 and 3 while G>1

• Result: hierarchy of clusters dendrogram

CK

CL

DKL

Page 18: Data Mining and Data Warehousing Henryk Maciejewski Data

Hierarchy of Clusters - Dendrogram

Page 19: Data Mining and Data Warehousing Henryk Maciejewski Data

Definitions of Distance Between Clusters

• Different definitions of distance between clusters– Average linkage

– Single linkage

– Complete linkage

– Density linkage

– Ward’s minimum variance method

– …

(SAS CLUSTER procedure accepts 11 different definitions of inter-cluster distance)

Page 20: Data Mining and Data Warehousing Henryk Maciejewski Data

Average Linkage• Notation

– xi – observations, i=1..n

– d(x,y) – distance between observations (Euclidean distance assumed from now on)

– Ck – clusters

– NK – number of observations in cluster CK

– DKL – distance between clusters CK and CL

– meanCK – mean observation in cluster CK

– WK= |xi-meanCK|2 xiCK – variance in cluster

• Average linkage

– Tends to join clusters with small variance

– Resulting clusters tend to have similar variance

Page 21: Data Mining and Data Warehousing Henryk Maciejewski Data

Complete Linkage• Notation

– xi – observations, i=1..n

– d(x,y) – distance between observations

– Ck – clusters

– NK – number of observations in cluster CK

– DKL – distance between clusters CK and CL

– meanCK – mean observation in cluster CK

– WK= |xi-meanCK|2 xiCK – variance in cluster

• Complete linkage

– Resulting clusters tend to have similar diameter

Page 22: Data Mining and Data Warehousing Henryk Maciejewski Data

Single Linkage• Notation

– xi – observations, i=1..n

– d(x,y) – distance between observations

– Ck – clusters

– NK – number of observations in cluster CK

– DKL – distance between clusters CK and CL

– meanCK – mean observation in cluster CK

– WK= |xi-meanCK|2 xiCK – variance in cluster

• Single linkage

– Tends to produce elongated clusters, irregular in shape

Page 23: Data Mining and Data Warehousing Henryk Maciejewski Data

Ward’s Minimum Variance Method• Notation

– xi – observations, i=1..n

– d(x,y) – distance between observations

– Ck – clusters

– NK – number of observations in cluster CK

– DKL – distance between clusters CK and CL

– meanCK – mean observation in cluster CK

– WK= |xi-meanCK|2 xiCK – variance in cluster

– BKL=WM-WK-WL where CM=CKCL

• Ward’s minimum variance method

– Tends to join small clusters

– Tends to produce clusters with similar number of observations

Page 24: Data Mining and Data Warehousing Henryk Maciejewski Data

Density Linkage• Notation

– xi – observations, i=1..n

– d(x,y) – distance between observations

– r – a fixed constant

– f(x) – proportion of observations within sphere centered at x with radius r divided by the volume of the sphere (measure of density of points near observation x)

• Density linkage

– We realize single linkage using the measure d*

– Capable of discovering clusters of irregular shape

Page 25: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Average LinkageElongated clusters in data

Page 26: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – K-meansElongated clusters in data

Page 27: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Density LinkageElongated clusters in data

Page 28: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – K-meansNonconvex clusters in data

Page 29: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Centroid LinkageNonconvex clusters in data

Page 30: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Density LinkageNonconvex clusters in data

Page 31: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – True ClustersClusters of unequal size

Page 32: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – K-meansClusters of unequal size

Page 33: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Ward’s MethodClusters of unequal size

Page 34: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Average LinkageMethod: average linkage

Page 35: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Centroid LinkageClusters of unequal size

Page 36: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Single LinkageClusters of unequal size

Page 37: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Well Separated DataAny method will work

Page 38: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Poorly Separated DataTrue clusters

Page 39: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Poorly Separated DataMethod: K-means

Page 40: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – Poorly Separated DataWard’s method

Page 41: Data Mining and Data Warehousing Henryk Maciejewski Data

Clustering Methods – Final Remarks

• Standarization of variables prior to clustering– Often necessary, otherwise variables with large variance tend to have large

influence on clustering– Often standarized measurement zij is computed as the z-score:

where xij – original measurement in observation i and variable j, j – mean value of variable j, sj – mean absolute deviation of variable j (or its standard deviation)

– Other ideas: divide variable by its range, max value or standard deviation

Page 42: Data Mining and Data Warehousing Henryk Maciejewski Data

Clustering Methods – Final Remarks

• The number of clusters– No satisfactory theory to determine the right number of clusters in data– Various criteria can be observed to help determine the right number of clusters,

e.g. criteria based on variance accounted for by clusters• R2=1-PG/T • or semipartial R2=BKL/T

where T – total variance of observations; PG= WK over G clustersBKL=WM-WK-WL where CM=CKCL

– Cubic Clustering Criterion (CCC)– Often data visualization useful for determining the number of clusters

• Scatterplot for 2-3 dimensional data• In high dimensions apply PCA transformation (or similar) visualize data in 2-3

dimensional space of first principal components

Page 43: Data Mining and Data Warehousing Henryk Maciejewski Data

Example – R2, Semi-partial R2

Page 44: Data Mining and Data Warehousing Henryk Maciejewski Data

•PST2: 3 or 6 or 9 (one before peak in value) •PSF: 9 (peak in value)•CCC: 18 (CCC around 3)

Example – Number of Clusters –Useful Checks

Page 45: Data Mining and Data Warehousing Henryk Maciejewski Data

Kohonen VQ (Vector Quantization)• Algorithm similar to k-means

• Idea of VQ algorithm:1. Select k points (initial cluster centroids)2. For observation xi find nearest centroid (winning seed) – denoted by Sn

3. Modify Sn according to the formula:

where L – learning constant (decresing during learning process)

4. Repeat steps 2 and 3 over all training observations5. Repeat steps 2-4 given number of iterations

Page 46: Data Mining and Data Warehousing Henryk Maciejewski Data

46

VQ –MacQueen Method• For L=const VQ algorithm does not coverge

• MacQueen method:

Learning constant L reciprocal to the numer of observations Nn in cluster associated with the „winning seed” Sn

• This algorithm converges

Page 47: Data Mining and Data Warehousing Henryk Maciejewski Data

Kohonen SOM (Self Organizing Maps)

1. Select k initial points (cluster centroids), represent them on a 2D map

2. For observation xi find winning seed Sn

3. Modify all centroids :Sj=Sj (1-K(j,n)L)+xiK(j,n)L, whereL – learning constant (decreasing during

training)K(j,n) – function decreasing with increasingdistance on the 2D map between Sj i Sncentroids (K(j,j)=1)

4. Repeat steps 2 and 3 over all trainingobservations

47

Page 48: Data Mining and Data Warehousing Henryk Maciejewski Data

Example• SOM-based clustering of wine data (R language, dataset

wines, package kohonen)

48

Page 49: Data Mining and Data Warehousing Henryk Maciejewski Data

Example• SOM-based clustering of wine data (R language, dataset

wines, package kohonen)

49

Page 50: Data Mining and Data Warehousing Henryk Maciejewski Data

• R system implementation of the SOM algorithm: function som() (package kohonen)

• Results: structure wine.som

important members:

wine.som$codes # codebook vectors

wine.som$unit.classif # winning units for all data points

50

Page 51: Data Mining and Data Warehousing Henryk Maciejewski Data

• Codebook vectors represent clusters created at each 2D grid element(attributes of codebook vectors are mean values of respective attributes of cluster elements)

51

Page 52: Data Mining and Data Warehousing Henryk Maciejewski Data

• R system implementation of the SOM algorithm: function som() (package kohonen)

• Results: structure wine.som

important members:

wine.som$codes # codebook vectors

wine.som$unit.classif # winning units for all data points

52

Page 53: Data Mining and Data Warehousing Henryk Maciejewski Data

• R system implementation of the SOM algorithm: function som() (package kohonen)

• Results: structure wine.som

important members:

wine.som$codes # codebook vectors

wine.som$unit.classif # winning units for all data points

53

Page 54: Data Mining and Data Warehousing Henryk Maciejewski Data

• Results: assignment of observations (individual wines) to 2D grid

• Grouping seeds (codebook vectors) – e.g. with hierarchical clustering (hclust function):

54

Page 55: Data Mining and Data Warehousing Henryk Maciejewski Data

Przykład – SOM w R

55

Page 56: Data Mining and Data Warehousing Henryk Maciejewski Data

Przykład – SOM w R

56

Page 57: Data Mining and Data Warehousing Henryk Maciejewski Data

Przykład – SOM w R

57