data mining and statistical learning - lecture 14 clustering methods partitional clustering in...

23
Data mining and statistic al learning - lecture 14 Clustering methods Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS) Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER) Density-based clustering in which core points and associated border points are clustered (proc MODECLUS)

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Clustering methods

Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS)

Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER)

Density-based clustering in which core points and associated border points are clustered (proc MODECLUS)

Page 2: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Proc FASTCLUS

Select k initial centroids

Repeat the following until the clusters remain unchanged:

Form k clusters by assigning each point to its nearest centroid

Update the centroid of each cluster

Page 3: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect

total nitrogen levels

0

5000

10000

15000

20000

25000

0 5000 10000 15000 20000 25000 30000

Total nitrogen (persulfate) mg/l

To

tal

nit

rog

en (

Kje

ldah

l) m g

/l

Page 4: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect total nitrogen levels

- 2-means clustering

0

5000

10000

15000

20000

25000

0 5000 10000 15000 20000 25000 30000

Total nitrogen (persulfate digestion)

To

tal

nit

rog

en (

Kje

ldah

l)

Cluster 1 Cluster 2

Initializationproblems?

Page 5: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Limitations of K-means clustering

1. Difficult to detect clusters with non-spherical shapes

2. Difficult to detect clusters of widely different sizes

3. Difficult to detect clusters of different densities

Page 6: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Proc MODECLUS

Use a smoother to estimate the (local) density of the given dataset

A cluster is loosely defined as a region surrounding a local maximum of the probability density function

Page 7: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect

total nitrogen levels

- proc MODECLUS, R = 1000

Smoothing parameter R = 1000

0

5000

10000

15000

20000

25000

0 10000 20000 30000

Tot_N (Kj) mg/l

To

t_N

(p

s)

mg

/l Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Other clusters

What will happen if R is increased?

Page 8: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect

total nitrogen levels

- proc MODECLUS, R = 4000

Smoothing parameter R = 4000

0

5000

10000

15000

20000

25000

0 10000 20000 30000

Tot_N (Kj) mg/l

To

t_N

(p

s)

mg

/l

Cluster 1

Cluster 2

Page 9: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect

total nitrogen levels

- proc MODECLUS, method 6

0

5000

10000

15000

20000

25000

0 5000 10000 15000 20000 25000 30000

Total nítrogen (persulfate digestion)

To

tal n

itro

ge

n (

Kje

lda

hl) Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Clusters 6 - 18

No cluster assigned

Why did the clustering fail?

Page 10: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Limitations of density-based clustering

1. Difficult to control (requires repeated runs)

2. Collapses in high dimensions

Page 11: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Strength of density-based clustering

Given a sufficiently large sample, nonparametric density-based clustering methods are capable of detecting clusters of unequal size and dispersion and with highly irregular shapes

Page 12: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect

total nitrogen levels

- transformed data

-10000

-5000

0

5000

10000

15000

0 5000 10000 15000 20000 25000

Total N (Kj)

To

tal

N (

ps)

-T

ota

l N

(K

j)

Page 13: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Identification of water samples with incorrect

total nitrogen levels

- proc MODECLUS, R = 2000, transformed data

-10000

-5000

0

5000

10000

15000

0 5000 10000 15000 20000 25000

Total N (Kj)

To

tal

N (

ps)

-T

ota

l N

(K

j)

Cluster 1

Cluster 2

Cluster 3-6

Page 14: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Preprocessing

1. Standardization

2. Linear transformation

3. Dimension reduction

Page 15: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Postprocessing

1. Split a cluster• Usually, the cluster with the largest SSE is split

2. Introduce a new cluster centroid• Often the point that is farthest from any cluster center is

chosen

3. Disperse a cluster• Remove one centroid and reassign the points to other

clusters

4. Merge two clusters• Typically, the clusters with the closest centroids are chosen

Page 16: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

1. A total of 296 pages at a Microsoft website are grouped into 13 homogenous categories• Initial• Support• Entertainment• Office• Windows• Othersoft• Download• …..

2. For each of 32711 visitors we have recorded how many times they have visited the different categories of pages

3. We would like to make a behavioural segmentation of the users ( a cluster analysis) that can be used in future marketing decisions

Page 17: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

- the dataset

client_codeinitial help entertainmentoffice windows othersft download otherint developmenthardware business informationarea10001 1 1 1 0 0 0 0 0 0 0 0 0 010002 1 1 0 0 0 0 0 0 0 0 0 0 010003 2 1 0 0 0 0 0 0 0 0 0 0 010004 0 0 0 0 0 0 0 0 0 0 0 0 110005 0 0 0 0 0 0 0 1 0 0 0 0 010006 2 0 0 0 0 0 0 0 0 0 0 0 010007 0 0 0 1 0 0 0 0 0 0 0 0 010008 1 0 0 0 0 0 0 0 0 0 0 0 010009 0 0 0 0 1 0 1 0 0 0 0 0 010010 1 1 0 1 0 1 0 0 2 0 0 0 010011 2 0 0 3 0 0 0 0 0 0 0 0 010012 0 0 0 0 0 1 0 0 1 0 0 0 010013 0 0 0 0 0 0 1 0 0 0 0 0 010014 0 0 0 0 0 0 0 0 0 0 0 0 110015 0 0 0 0 0 0 0 0 0 1 0 0 010016 0 0 0 0 0 0 0 1 1 0 0 0 010017 1 0 0 0 0 0 0 0 3 0 0 0 010018 1 0 0 0 0 0 0 0 0 0 0 0 010019 4 0 2 1 0 0 2 0 0 1 1 0 010020 0 1 1 1 0 0 1 0 0 0 0 0 010021 3 1 1 1 2 1 1 3 2 0 1 1 0

Why is it necessary to group the pages into categories?

Page 18: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

- 10-means clustering

Page 19: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

- cluster proximities

Page 20: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

- profiles

Page 21: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

- Kohonen Map of cluster frequencies

Page 22: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Profiling website visitors

- Kohonen Maps of means by variable and grid cell

Page 23: Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc

Data mining and statistical learning - lecture 14

Characteristics of Kohonen maps

The centroids vary smoothly over the map• The set of clusters having unusually large (or small) values

of a given variable tend to form connected spatial patterns

Clusters with similar centroids need not be close to each other in a Kohonen map

The sizes of the clusters in Kohonen maps tend to be less variable than those obtained by K-means clustering