mixed numeric and categorical attribute clustering algorithm

12
MIXED NUMERIC AND CATEGORICAL A TTRIBUTE CLUSTERING ALGORITHM MODELING DR. ASOKA K ORALE, C.ENG. MIET & MIESL

Upload: asoka-korale

Post on 22-Jan-2018

115 views

Category:

Data & Analytics


3 download

TRANSCRIPT

MIXED NUMERIC AND CATEGORICAL

ATTRIBUTE CLUSTERING ALGORITHM MODELING

DR. ASOKA KORALE, C.ENG. MIET & MIESL

ADVANTAGES TO NUMERIC AND CATEGORICAL ATTRIBUTE CLUSTERING

Slide | 2

Improved Targeting in Campaigns & Insight in

to Segments

Currently clustering on numeric variables Age,

Net Stay, ARPU PRIMARY ATTRIBUTES THAT CAN BE INCLUDED

WITH MIXED ATTRIBUTE TYPE CLUSTERING –

ACCOUNT TYPE, GENDER, GEO LOCATION, ……

Currently Fuzzy C – Means Algorithm used in

Clustering

Digital Advertizing SEGMENTATIONS

INCREASINGLY BASED ON CLUSTERING

Include other Categorical attributes depending

on Interest segment to create –”Micro

Segments”

WIDENING POTENTIAL INSIGHTS THROUGH CATEGORICAL CLUSTERING

Slide | 3

Improved

Targeting in

Campaigns &

Insight in to

All Attributes Can be

Clustered – leading to

very specific and

wider array of

segments

Geographic attribute clustering

to incorporate Income/ARPU

hotspots at micro level

CONCEPT UNDERLYING THE MIXED K PROTOTYPES ALGORITHM [1]

Slide | 4

point “d” and point “c” may switch sides depending on how similar the numeric part and categorical part of the point is similar to the numeric and categorical part of the centroid (prototype)

Influence or contribution of Numeric and Categorical Attributes of a data point can be controlled via a parameter “gamma”

Point “a” may switch if the categorical part is closer to the categorical centroid (prototype) more than its numeric part is close to the numeric part of the centroid.

Numeric and Categorical Attributes parts of a data point can be considered separately and two sets of centroids act as attractors for each Attribute type in each cluster

Numeric Attribute1

Shapes represent two values of a single categorical variable

Numeric Attribute2

[1]. Huang, CSIRO, Australia

MIXED K PROTOTYPES ALGORITHM [1]

Slide | 5

Distance measure to a prototype (center) of two parts – numeric and categorical

Numeric Attributes - Euclidian Distance Categorical Attributes – Dissimilarity Measure

Centroid of Numeric Attributes – a simple average of the points in that cluster

Includes “Yij” a fuzzy membership function if we wish to go in that direction

MIXED K PROTOTYPES ALGORITHM [1]

Slide | 6

Minimize the total cost “E” which is the sum of the distances to the numeric and categorical parts of the centroid (prototype)

Centroid of Categorical attributes determined on highest frequency of attribute value in each cluster

Slide | 7

CONVERGENCE PERFORMANCE

0 5 10 15 20 25 30 35 400

200

400

600

800

1000

1200

1400

1600Total no of switches at each iteration

Iteration Number

0 5 10 15 20 25 30 35 401.2

1.3

1.4

1.5

1.6

1.7

1.8x 10

4

Iteration Number

Total Distance at each iteration

1

2

3

4

5

6

7

8

0 5 10 15 20 25 30 35 400.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Iteration Number

Total Categorical Distance at each iteration

1

2

3

4

5

6

7

8

Slide | 8

CLUSTER & SEGMENT PROFILE

1 2 3 4 5 6 7 80

200

400

600

800

1000

1200Number of Cx in each Cluster

Cluster ID

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8Cluster/Segment ID

Age

0

50

100

150

200

250

1 2 3 4 5 6 7 8Cluster/Segment ID

Net Stay

0

0.5

1

1.5

2

2.5

3

3.5

4

x 104

1 2 3 4 5 6 7 8Cluster/Segment ID

ARPU

Slide | 9

VALIDATION WITH DISTRIBUTION ANALYSIS

Cluster IDCx in

Cluster Avg. AgeSpread

AgeAvg. Net-

StaySpred

Net-Stay Avg. ARPUSpread ARPU Post Paid Pre Paid Female Male

1 913 27 5 28 26 1231 1427 90 823 913 0

2 930 28 5 19 16 1407 1699 159 771 0 930

3 407 53 8 46 35 1095 1303 34 373 407 0

4 409 54 8 34 24 967 919 66 343 0 409

5 556 36 11 82 43 2601 2399 546 10 556 0

6 542 32 5 95 27 1031 927 0 542 67 475

7 1116 36 9 96 44 2917 2669 1116 0 0 1116

8 348 57 7 131 33 1205 853 147 201 33 315

15 20 25 30 35 40 45 50 55 60 65 70 75 80 850

50

100

150

200

Histogram Cx Age, Male

Age (years)

Fre

qu

ency

15 20 25 30 35 40 45 50 55 60 65 70 750

50

100

150

Histogram Cx Age, Female

Age (years)

Fre

qu

ency

Due to a certain bi-modal nature, clustering able to identify the modes in the Age histograms

Slide | 10

Cluster ID

Datapoints in Cluster Avg. Age

Spread Age

Avg. Net-Stay

Spred Net-Stay Avg. ARPU

Spread ARPU

Number Post Paid

Number Pre Paid

Number Female

Number Male

1 913 27 5 28 26 1231 1427 90 823 913 0

2 930 28 5 19 16 1407 1699 159 771 0 930

3 407 53 8 46 35 1095 1303 34 373 407 0

4 409 54 8 34 24 967 919 66 343 0 409

5 556 36 11 82 43 2601 2399 546 10 556 0

6 542 32 5 95 27 1031 927 0 542 67 475

7 1116 36 9 96 44 2917 2669 1116 0 0 1116

8 348 57 7 131 33 1205 853 147 201 33 315

0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 2400

50

100

150

200

250

Histogram Cx Network Stay

Net Stay (months)

Fre

qu

ency

No identifiable structure in Net Stay distribution

VALIDATION WITH DISTRIBUTION ANALYSIS

Cluster Segment Profile

Slide | 11

CLUSTERING NUMERIC PART OF SEGMENTS IN 3D

-20

24

-50

5-5

0

5

10

15

20

Age (normalized)

Segmental Analysis: Age, Net Stay and ARPU

Net-Stay (normalized)

AR

PU

(norm

aliz

ed)

1

2

3

4

5

6

7

8

Slide | 12

NOTABLE POINTS

• Allows us to cluster most attributes (within reason)

• Particularly if the categorical attributes do not have many different component

values

• Reasonable convergence performance both in terms of run time and number

of iterations

• Different dissimilarity measures and distance criteria will give differing results

• The influence of the categorical part via gamma may also need to change with

the method used

• Algorithm somewhat sensitive to initial conditions –

initialization of centroids

• Explore likelihood of falling in to a local minima and getting trapped there leading to a

sub optimal final solution

• To do…..

• Each drop can result in a non unique final result but will not impact the underlying

trends and insights in to each segment