k-means algorithm each cluster is represented by the mean value of the objects in the cluster input:...

30

Upload: barnaby-harvey

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Page 2: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Page 3: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

K-Means AlgorithmEach cluster is represented by the mean value of

the objects in the clusterInput : set of objects (n), no of clusters (k)Output : set of k clustersAlgo

Randomly select k samples & mark them a initial cluster

Repeat Assign/ reassign in sample to any given cluster to which

it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.

Page 4: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

K-Means (graph)Step1: Form k centroids, randomlyStep2: Calculate distance between centroids

and each objectUse Euclidean’s law do determine min distance:

d(A,B) = (x2-x1)2 + (y2-y1)2

Step3: Assign objects based on min distance to k clusters

Step4: Calculate centroid of each cluster using

C = (x1+x2+…xn , y1+y2+…yn)

n n

Go to step 2.Repeat until no change in centroids.

Page 5: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

K-Mediod (PAM)Also called Partitioning Around Mediods.Step1: choose k mediodsStep2: assign all points to closest mediodStep3: form distance matrix for each

cluster and choose the next best mediod. i.e., the point closest to all other points in clustergo to step2.Repeat until no change in any mediods

Page 6: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

What are Hierarchical Methods?Groups data objects into a tree of clustersClassified as

Agglomerative (Bottom-up)Divisive (Top-Bottom)

Once a merge or split decision is made it cannot be backtracked

Page 7: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Types of hierarchical clusteringAgglomerative (Bottom-up) AGNES

Places each object into a cluster and merges atomic clusters into larger clusters

They differ in the definition of intercluster similarityDivisive: (Top-Bottom) DIANA

All objects are initially in one clusterSubdivides the cluster into smaller and smaller

pieces, until each object forms a cluster of its own or satisfies some termination condition

In both of the above methods the termination condition is the number of clusters

Page 8: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Dendogram

Level 0

Level 1

Level 2

Level 3

Level 4

Page 9: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Measures of DistanceMinimum distance – Nearest Neighbor-

single linkage –minimum spanning tree Maximum distance – Farthest neighbor

clustering algorithm – complete linkageMean distance - avoids outlier sensitivity

problemAverage distance : can handle categorical as

well as numeric data

Page 10: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Euclidean Distance

Page 11: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Agglomerative AlgorithmStep1: Make each object as a clusterStep2: Calculate the Euclidean distance

from every point to every other point. i.e., construct a Distance Matrix

Step3: Identify two clusters with shortest distance.Merge themGo to Step 2Repeat until all objects are in one cluster

Page 12: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Agglomerative Algorithm ApproachesSingle Link:

Quite simpleNot very efficientSuffers from chain effect

Complete LinkMore compact than those found using the

single link techniqueAverage Link

Page 13: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Simple Example

Item E A C B D

E 0 1 2 2 3

A 1 0 2 5 3

C 2 2 0 1 6

B 2 5 1 0 3

D 3 3 6 3 0

Page 14: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Another ExampleFind single link technique to find clusters in

the given database.X Y

10.4 0.53

20.22 0.38

30.35 0.32

40.26 0.19

50.08 0.41

60.45 0.3

Page 15: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Plot given data

Page 16: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Identify two nearest clusters

Page 17: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Repeat process until all objects in same cluster

Page 18: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Average linkAverage distance matrix

Page 19: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Construct a distance matrix

1 2 3 4 5 6

1 0

2 0.24 0

3 0.22 0.15 0

4 0.37 0.2 0.15 0

5 0.34 0.14 0.28 0.29 0

6 0.23 0.25 0.11 0.22 0.39 0

Page 20: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Page 21: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Divisive ClusteringAll items are initially placed in one cluster The clusters are repeatedly split in two until

all items are in their own cluster

A B

C

D

E

1

2

13

Page 22: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Difficulties in Hierarchical ClusteringDifficulties regarding the selection of merge

or split pointsThis decision is critical because the further

merge or split decisions are based on the newly formed clusters

Method does not scale wellSo hierarchical methods are integrated with

other clustering techniques to form multiple-phase clustering

Page 23: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Types of hierarchical clustering techniquesBIRCH-Balanced Iterative Reducing and

Clustering using hierarchiesROCK: Robust clustering with links, explores

the concept of linksCHAMELEON: hierarchical clustering

algorithm using dynamic modeling

Page 24: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Outlier AnalysisOutliers are data objects, which are different

from or inconsistent with the remaining set of data

Outliers can be caused because ofMeasurement or execution errorResult of inherent data variability

Can be used in fraud detectionOutlier detection and analysis is referred to

as outlier mining.

Page 25: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Applications of outlier miningFraud detectionCustomized marketing for identifying the

spending behavior of customers with extremely low or high incomes.

Medical analysis for finding unusual responses to various medical treatments.

Page 26: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

What is outlier mining?

Given a set of n data points or objects and k, the expected number of outliers find the top k objects that are dissimilar, exceptional or inconsistent with respect to remaining data

There are two subproblemsDefine what data can be considered as

inconsistent in a given data setMethod to mine the outliers

Page 27: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Methods of outlier detectionStatistical approachdistance-based approachDensity-based local outlier approachDeviation-based approach

Page 28: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Statistical DistributionIdentifies outliers with respect to a discordancy

testDiscordancy test examines a working hypothesis

and an alternative hypothesisIt verifies whether an object oi, is significantly

large in relation to the distribution F.This helps in accepting the working hypothesis

or rejecting it (alternative distribution)Inherent alternative distributionMixture alternative distributionSlippage alternative distribution

Page 29: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Procedures for detecting outliersBlock procedures: All suspect objects are

treated as outliers or all of then are accepted as consistent

Consecutive procedures: object that is least likely to be an outlier is tested first. If it is found to be an outlier then all of the more extreme values are also considered as outliers. Else the next most extreme object is tested and so on

Page 30: K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Questions in Clustering