data mining and data warehousing henryk maciejewski data

Data Mining and Data Warehousing

Henryk Maciejewski

Data Mining – Clustering

Clustering Algorithms – Contents

• K-means

• Hierarchical algorithms

• Linkage functions

• Vector quantization

• SOM

Clustering – Formulation

...

...

...

...

...

...

...

........................... ..................

...

Objects

Attributes

Model

................................................ ..................

...

Find groups of similarpoints (observations) inmultidimensional space

No target variable(unsupervised learning)

Methods of Clustering - Overview

• Variety of methods:– Hierarchical clustering – create hierarchy of clusters (one cluster entirely

contained within another cluster)– Non-hierarchical methods – create disjoint clusters– Overlapping clusters (objects can belong to >1 cluster simultaneously) – Fuzzy clusters (defined by the probability (grade) of membership of each

object in each cluster)

• Useful data preprocessing prior to clustering:– PCA (Principal Components Analysis) – to reduce dimensionality of data– Data standarization (transform data to reduce large influence of variables

with larger variance on results of clustering)

Introductory Example

• 97 countries described by 3 attributes: Birth, Death, InfantDeath rate (given as number per 1000, data from year 1995)

Example – cntd.

Analysis I

• Clustering raw data

• K-means algorithm

• Result: 3 clusters (no. of obs. in each cluster: 13, 32, 52)

Example – Profiles of Clusters


• Notice: data clustered based on InfantDeath Rate only!

Example – Standarization of Data

Analysis I

Analysis II• Data standarized prior to

clustering (variables dividedby their standard deviation)

• Result: 3 clusters(with 35, 46, 16 obs.)

• Data clustered based on InfantDeath and Death

• Observe that data withlargest variance have largestinfluence on results of clustering

Analysis II

• Analysis II: profiles of clusters


Methods of Clustering

• Non-hierarchical methods – K-means clustering– Non-deterministic– O(n), n - number of observations

• Hierarchical methods– Aglomerative (join small clusters)– Divisive (split big clusters)– Deterministic methods– O(n2) – O(n3), depending on the clustering method (i.e. definition of inter-

cluster distance)

Methods of Clustering - Remarks

• Clustering large datasets– K-means– If results of hierarchical clustering needed – first use K-means yielding e.g.

50 clusters, followed by hierarchical clustering on results of K-means

• Consensus clustering– Discover real clusters in data – analyze stability of results with noise

injected

K-means Algorithm

• K-means clustering– Select k points (centroids of initial clusters; select randomly)– Assign each observation to the nearest centroid (nearest cluster)– For each cluster find the new centroid– Repeat step 2 and 3 until no change occurs in cluster assignments

K-means Algorithm

• Result: k separate clusters

• Algorithm requires that the correct number of clusters k is specified in advance (difficult problem: how to know the real number of clusters in data…)

Hierarchical Clustering

• Notation – xi – observations, i=1..n – Ck – clusters– G – current number of clusters – DKL – distance between clusters CK and CL

• Between-cluster distance DKL – linkage function (various definitions available, results of clustering depend on DKL)

CK

CL

DKL

Hierarchical Clustering

• Algorithm (agglomerative hierarchical clustering)– Ck = {xk}, k=1..n, G=n– Find K, L such that DKL= min DIJ , 1<=I,J<=G– Replace clusters CK and CL by cluster CKCL ,

G=G-1 – Repeat steps 2 and 3 while G>1

• Result: hierarchy of clusters dendrogram

CK

CL

DKL

Hierarchy of Clusters - Dendrogram

Definitions of Distance Between Clusters

• Different definitions of distance between clusters– Average linkage

– Single linkage

– Complete linkage

– Density linkage

– Ward’s minimum variance method

– …

(SAS CLUSTER procedure accepts 11 different definitions of inter-cluster distance)

Average Linkage• Notation

– xi – observations, i=1..n

– d(x,y) – distance between observations (Euclidean distance assumed from now on)

– Ck – clusters

– NK – number of observations in cluster CK

– DKL – distance between clusters CK and CL

– meanCK – mean observation in cluster CK

– WK= |xi-meanCK|2 xiCK – variance in cluster

• Average linkage

– Tends to join clusters with small variance

– Resulting clusters tend to have similar variance

Complete Linkage• Notation


– d(x,y) – distance between observations

– Ck – clusters





• Complete linkage

– Resulting clusters tend to have similar diameter

Single Linkage• Notation



– Ck – clusters





• Single linkage

– Tends to produce elongated clusters, irregular in shape

Ward’s Minimum Variance Method• Notation



– Ck – clusters





– BKL=WM-WK-WL where CM=CKCL

• Ward’s minimum variance method

– Tends to join small clusters

– Tends to produce clusters with similar number of observations

Density Linkage• Notation



– r – a fixed constant

– f(x) – proportion of observations within sphere centered at x with radius r divided by the volume of the sphere (measure of density of points near observation x)

• Density linkage

– We realize single linkage using the measure d*

– Capable of discovering clusters of irregular shape

Example – Average LinkageElongated clusters in data

Example – K-meansElongated clusters in data

Example – Density LinkageElongated clusters in data

Example – K-meansNonconvex clusters in data

Example – Centroid LinkageNonconvex clusters in data

Example – Density LinkageNonconvex clusters in data

Example – True ClustersClusters of unequal size

Example – K-meansClusters of unequal size

Example – Ward’s MethodClusters of unequal size

Example – Average LinkageMethod: average linkage

Example – Centroid LinkageClusters of unequal size

Example – Single LinkageClusters of unequal size

Example – Well Separated DataAny method will work

Example – Poorly Separated DataTrue clusters

Example – Poorly Separated DataMethod: K-means

Example – Poorly Separated DataWard’s method

Clustering Methods – Final Remarks

• Standarization of variables prior to clustering– Often necessary, otherwise variables with large variance tend to have large

influence on clustering– Often standarized measurement zij is computed as the z-score:

where xij – original measurement in observation i and variable j, j – mean value of variable j, sj – mean absolute deviation of variable j (or its standard deviation)

– Other ideas: divide variable by its range, max value or standard deviation

Clustering Methods – Final Remarks

• The number of clusters– No satisfactory theory to determine the right number of clusters in data– Various criteria can be observed to help determine the right number of clusters,

e.g. criteria based on variance accounted for by clusters• R2=1-PG/T • or semipartial R2=BKL/T

where T – total variance of observations; PG= WK over G clustersBKL=WM-WK-WL where CM=CKCL

– Cubic Clustering Criterion (CCC)– Often data visualization useful for determining the number of clusters

• Scatterplot for 2-3 dimensional data• In high dimensions apply PCA transformation (or similar) visualize data in 2-3

dimensional space of first principal components

Example – R2, Semi-partial R2

•PST2: 3 or 6 or 9 (one before peak in value) •PSF: 9 (peak in value)•CCC: 18 (CCC around 3)

Example – Number of Clusters –Useful Checks

Kohonen VQ (Vector Quantization)• Algorithm similar to k-means

• Idea of VQ algorithm:1. Select k points (initial cluster centroids)2. For observation xi find nearest centroid (winning seed) – denoted by Sn

3. Modify Sn according to the formula:

where L – learning constant (decresing during learning process)

4. Repeat steps 2 and 3 over all training observations5. Repeat steps 2-4 given number of iterations

46

VQ –MacQueen Method• For L=const VQ algorithm does not coverge

• MacQueen method:

Learning constant L reciprocal to the numer of observations Nn in cluster associated with the „winning seed” Sn

• This algorithm converges

Kohonen SOM (Self Organizing Maps)

1. Select k initial points (cluster centroids), represent them on a 2D map

2. For observation xi find winning seed Sn

3. Modify all centroids :Sj=Sj (1-K(j,n)L)+xiK(j,n)L, whereL – learning constant (decreasing during

training)K(j,n) – function decreasing with increasingdistance on the 2D map between Sj i Sncentroids (K(j,j)=1)

4. Repeat steps 2 and 3 over all trainingobservations

47

Example• SOM-based clustering of wine data (R language, dataset

wines, package kohonen)

48

Example• SOM-based clustering of wine data (R language, dataset

wines, package kohonen)

49

• R system implementation of the SOM algorithm: function som() (package kohonen)

• Results: structure wine.som

important members:

wine.som$codes # codebook vectors

wine.som$unit.classif # winning units for all data points

50

• Codebook vectors represent clusters created at each 2D grid element(attributes of codebook vectors are mean values of respective attributes of cluster elements)

51



important members:



52



important members:



53

• Results: assignment of observations (individual wines) to 2D grid

• Grouping seeds (codebook vectors) – e.g. with hierarchical clustering (hclust function):

54

Przykład – SOM w R

55


56


57

data mining and data warehousing henryk maciejewski data

Documents