moving to real time segmentation: efficient computation of geodemographic classification

Moving to real time segmentation: efficient computation of Geodemographic

classification

Adnan, M., Singleton, A.D., Brunsdon, C., Longley, P.A.

Presentation Outline

• Need for real time Geodemographics.• What are real time Geodemographics?• Computational challenges• Clustering Algorithms

– K-means– Clara and GA (Genetic Algorithm)

• Comparison of Clustering Algorithms

• Web based clustering tool demo

Need for real time Geodemographics

• Current classifications are created using static data sources.

• Rate and scale of current population change is making large surveys (census) increasingly redundant.• Significant hidden value in transactional data

• Data is increasingly available in near real time

e.g. ONS NESS API.• Application specific (bespoke) classifications have

demonstrated utility.

What are real time Geodemographics ?

Computational challenges

• Integration of large and possibly disparate databases.

• Data normalisation and optimization for fast transactions.

• Minimizing computational time of clustering algorithms (Very Important)!

Some Clustering algorithms

• K-Means• PAM (Partitioning Around Medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)• K-Means++• Fuzzy Clustering Algorithms

This paper: K-means, CLARA, and GA.

K-means

• Attempts to find out cluster centroids by minimising within sum of squares distance.

• K-means is unstable due to its initial seeds assignment.

• Creating a Geodemographic classification requires running algorithm multiple times.• Computationally expensive in a real time environment.

K-means (100 runs of k-means on OAC data set for k=5)

An example of bad clustering result

PAM, CLARA and Genetic Algorithm

• PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers.

• CLARA draws multiple samples of the dataset, applies PAM to each sample and returns the best result.

• GA (Genetic Algorithm) is inspired by models of biological evolution. Produce results through a breeding procedure.

Comparing computational efficiency…

PAM, and GA on the three geographic aggregations of a dataset covering London.

Figure 1: OA(Output Area) level results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97

No. of Clusters

Tim

e (s

) K-Means

PAM

GA

Figure 2 : LSOA (Lower Super Output Area) level results Figure 3: Ward level results

0

0.5

1

1.5

2

2.5

3

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

No. of Clusters

Tim

e (s

) K-Means

CLARA (PAM)

GA

0

0.5

1

1.5

2

2.5

3

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

No. of Clusters

Tim

e (s

) K-Means

CLARA (PAM)

GA

0

0.5

1

1.5

2

2.5

3

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

No. of Clusters

Tim

e (s

) K-Means

CLARA (PAM)

GA

Comparing classification optimisation efficiencyFigure 4 : OA (Output Area) level results

Figure 5: LSOA (Lower Super Output Area) level results Figure 6: Ward level results

0

0.05

0.1

0.15

0.2

0.25

0.3

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

No. of Clusters

Avg

. S

ilh

ou

ette

Wid

thK-Means

CLARA (PAM)

GA

0

0.05

0.1

0.15

0.2

0.25

0.3

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

No.of Clusters

Avg

. Silh

ouet

ter

Wid

th

K-Means

CLARA (PAM)

GA

0

0.05

0.1

0.15

0.2

0.25

0.3

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

No. of Clusters

Avg

. S

ilh

ou

ette

Wid

thK-Means

CLARA (PAM)

GA

Algorithm Stability (w.r.t. Computational time)Figure 7: Running k-means on OA (Output Area) for 120 times on each iteration

Figure 8: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 9: Running GA on OA (Output Area) for 120 times on each iteration

Some Outcomes

For Larger datasets:• Computational (Time) Efficiency => PAM• Classification (Better Clustering) Efficiency =>

Genetic Clustering

For Smaller datasets:• Computational (Time) Efficiency => K-Means• Classification (Better Clustering) Efficiency => PAM

K-means and Principle Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

Figure 10: K-means result for 41 “OAC variables” Figure 11: K-means result for 26 “OAC Principle Components”

K-means and Principle Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

Figure 10: K-means result for 4 1 “OAC variables” Figure 11: K-means result for 26 “OAC Principle Components”

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

No. of clusters

Tim

e (s

)

Kmeans

PCA_Kmeans

Conclusion and Future work

• CLARA and GA are plausible alternative to k-means in a real time Geodemographic classification system.

• K-means might be combines with PCA for enhanced computation power.

• In an online environment k-means is better for small data sets.

• In a real time geodemographic classification system, a clustering algorithm can be chosen at run time.

Web based clustering tool demo

moving to real time segmentation: efficient computation of geodemographic classification

Technology