k-means algorithm each cluster is represented by the mean value of the objects in the cluster input:...

Post on 17-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

K-Means AlgorithmEach cluster is represented by the mean value of

the objects in the clusterInput : set of objects (n), no of clusters (k)Output : set of k clustersAlgo

Randomly select k samples & mark them a initial cluster

Repeat Assign/ reassign in sample to any given cluster to which

it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.

K-Means (graph)Step1: Form k centroids, randomlyStep2: Calculate distance between centroids

and each objectUse Euclidean’s law do determine min distance:

d(A,B) = (x2-x1)2 + (y2-y1)2

Step3: Assign objects based on min distance to k clusters

Step4: Calculate centroid of each cluster using

C = (x1+x2+…xn , y1+y2+…yn)

n n

Go to step 2.Repeat until no change in centroids.

K-Mediod (PAM)Also called Partitioning Around Mediods.Step1: choose k mediodsStep2: assign all points to closest mediodStep3: form distance matrix for each

cluster and choose the next best mediod. i.e., the point closest to all other points in clustergo to step2.Repeat until no change in any mediods

What are Hierarchical Methods?Groups data objects into a tree of clustersClassified as

Agglomerative (Bottom-up)Divisive (Top-Bottom)

Once a merge or split decision is made it cannot be backtracked

Types of hierarchical clusteringAgglomerative (Bottom-up) AGNES

Places each object into a cluster and merges atomic clusters into larger clusters

They differ in the definition of intercluster similarityDivisive: (Top-Bottom) DIANA

All objects are initially in one clusterSubdivides the cluster into smaller and smaller

pieces, until each object forms a cluster of its own or satisfies some termination condition

In both of the above methods the termination condition is the number of clusters

Dendogram

Level 0

Level 1

Level 2

Level 3

Level 4

Measures of DistanceMinimum distance – Nearest Neighbor-

single linkage –minimum spanning tree Maximum distance – Farthest neighbor

clustering algorithm – complete linkageMean distance - avoids outlier sensitivity

problemAverage distance : can handle categorical as

well as numeric data

Euclidean Distance

Agglomerative AlgorithmStep1: Make each object as a clusterStep2: Calculate the Euclidean distance

from every point to every other point. i.e., construct a Distance Matrix

Step3: Identify two clusters with shortest distance.Merge themGo to Step 2Repeat until all objects are in one cluster

Agglomerative Algorithm ApproachesSingle Link:

Quite simpleNot very efficientSuffers from chain effect

Complete LinkMore compact than those found using the

single link techniqueAverage Link

Simple Example

Item E A C B D

E 0 1 2 2 3

A 1 0 2 5 3

C 2 2 0 1 6

B 2 5 1 0 3

D 3 3 6 3 0

Another ExampleFind single link technique to find clusters in

the given database.X Y

10.4 0.53

20.22 0.38

30.35 0.32

40.26 0.19

50.08 0.41

60.45 0.3

Plot given data

Identify two nearest clusters

Repeat process until all objects in same cluster

Average linkAverage distance matrix

Construct a distance matrix

1 2 3 4 5 6

1 0

2 0.24 0

3 0.22 0.15 0

4 0.37 0.2 0.15 0

5 0.34 0.14 0.28 0.29 0

6 0.23 0.25 0.11 0.22 0.39 0

Divisive ClusteringAll items are initially placed in one cluster The clusters are repeatedly split in two until

all items are in their own cluster

A B

C

D

E

1

2

13

Difficulties in Hierarchical ClusteringDifficulties regarding the selection of merge

or split pointsThis decision is critical because the further

merge or split decisions are based on the newly formed clusters

Method does not scale wellSo hierarchical methods are integrated with

other clustering techniques to form multiple-phase clustering

Types of hierarchical clustering techniquesBIRCH-Balanced Iterative Reducing and

Clustering using hierarchiesROCK: Robust clustering with links, explores

the concept of linksCHAMELEON: hierarchical clustering

algorithm using dynamic modeling

Outlier AnalysisOutliers are data objects, which are different

from or inconsistent with the remaining set of data

Outliers can be caused because ofMeasurement or execution errorResult of inherent data variability

Can be used in fraud detectionOutlier detection and analysis is referred to

as outlier mining.

Applications of outlier miningFraud detectionCustomized marketing for identifying the

spending behavior of customers with extremely low or high incomes.

Medical analysis for finding unusual responses to various medical treatments.

What is outlier mining?

Given a set of n data points or objects and k, the expected number of outliers find the top k objects that are dissimilar, exceptional or inconsistent with respect to remaining data

There are two subproblemsDefine what data can be considered as

inconsistent in a given data setMethod to mine the outliers

Methods of outlier detectionStatistical approachdistance-based approachDensity-based local outlier approachDeviation-based approach

Statistical DistributionIdentifies outliers with respect to a discordancy

testDiscordancy test examines a working hypothesis

and an alternative hypothesisIt verifies whether an object oi, is significantly

large in relation to the distribution F.This helps in accepting the working hypothesis

or rejecting it (alternative distribution)Inherent alternative distributionMixture alternative distributionSlippage alternative distribution

Procedures for detecting outliersBlock procedures: All suspect objects are

treated as outliers or all of then are accepted as consistent

Consecutive procedures: object that is least likely to be an outlier is tested first. If it is found to be an outlier then all of the more extreme values are also considered as outliers. Else the next most extreme object is tested and so on

Questions in Clustering

top related