moving to real time segmentation: efficient computation of geodemographic classification
DESCRIPTION
Moving to real time segmentation: efficient computation of geodemographic classification,GISRUK 2009, University of Durham, Durham.M Adnan, A D Singleton, C Brunsdon and P A LongleyTRANSCRIPT
Moving to real time segmentation: efficient computation of Geodemographic
classification
Adnan, M., Singleton, A.D., Brunsdon, C., Longley, P.A.
Presentation Outline
• Need for real time Geodemographics.• What are real time Geodemographics?• Computational challenges• Clustering Algorithms
– K-means– Clara and GA (Genetic Algorithm)
• Comparison of Clustering Algorithms
• Web based clustering tool demo
Need for real time Geodemographics
• Current classifications are created using static data sources.
• Rate and scale of current population change is making large surveys (census) increasingly redundant.• Significant hidden value in transactional data
• Data is increasingly available in near real time
e.g. ONS NESS API.• Application specific (bespoke) classifications have
demonstrated utility.
What are real time Geodemographics ?
Computational challenges
• Integration of large and possibly disparate databases.
• Data normalisation and optimization for fast transactions.
• Minimizing computational time of clustering algorithms (Very Important)!
Some Clustering algorithms
• K-Means• PAM (Partitioning Around Medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)• K-Means++• Fuzzy Clustering Algorithms
This paper: K-means, CLARA, and GA.
K-means
• Attempts to find out cluster centroids by minimising within sum of squares distance.
• K-means is unstable due to its initial seeds assignment.
• Creating a Geodemographic classification requires running algorithm multiple times.• Computationally expensive in a real time environment.
K-means (100 runs of k-means on OAC data set for k=5)
An example of bad clustering result
An example of bad clustering result
An example of bad clustering result
PAM, CLARA and Genetic Algorithm
• PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers.
• CLARA draws multiple samples of the dataset, applies PAM to each sample and returns the best result.
• GA (Genetic Algorithm) is inspired by models of biological evolution. Produce results through a breeding procedure.
Comparing computational efficiency…
PAM, and GA on the three geographic aggregations of a dataset covering London.
Figure 1: OA(Output Area) level results
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
No. of Clusters
Tim
e (s
) K-Means
PAM
GA
Figure 2 : LSOA (Lower Super Output Area) level results Figure 3: Ward level results
0
0.5
1
1.5
2
2.5
3
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
No. of Clusters
Tim
e (s
) K-Means
CLARA (PAM)
GA
0
0.5
1
1.5
2
2.5
3
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
No. of Clusters
Tim
e (s
) K-Means
CLARA (PAM)
GA
0
0.5
1
1.5
2
2.5
3
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
No. of Clusters
Tim
e (s
) K-Means
CLARA (PAM)
GA
Comparing classification optimisation efficiencyFigure 4 : OA (Output Area) level results
Figure 5: LSOA (Lower Super Output Area) level results Figure 6: Ward level results
0
0.05
0.1
0.15
0.2
0.25
0.3
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
No. of Clusters
Avg
. S
ilh
ou
ette
Wid
thK-Means
CLARA (PAM)
GA
0
0.05
0.1
0.15
0.2
0.25
0.3
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
No.of Clusters
Avg
. Silh
ouet
ter
Wid
th
K-Means
CLARA (PAM)
GA
0
0.05
0.1
0.15
0.2
0.25
0.3
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
No. of Clusters
Avg
. S
ilh
ou
ette
Wid
thK-Means
CLARA (PAM)
GA
Algorithm Stability (w.r.t. Computational time)Figure 7: Running k-means on OA (Output Area) for 120 times on each iteration
Figure 8: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 9: Running GA on OA (Output Area) for 120 times on each iteration
Some Outcomes
For Larger datasets:• Computational (Time) Efficiency => PAM• Classification (Better Clustering) Efficiency =>
Genetic Clustering
For Smaller datasets:• Computational (Time) Efficiency => K-Means• Classification (Better Clustering) Efficiency => PAM
K-means and Principle Component Analysis
• PCA can be used to facilitate K-means clustering by reducing dimensions.
(Ding, C., He, X., 2004)
Figure 10: K-means result for 41 “OAC variables” Figure 11: K-means result for 26 “OAC Principle Components”
K-means and Principle Component Analysis
• PCA can be used to facilitate K-means clustering by reducing dimensions.
(Ding, C., He, X., 2004)
Figure 10: K-means result for 4 1 “OAC variables” Figure 11: K-means result for 26 “OAC Principle Components”
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
No. of clusters
Tim
e (s
)
Kmeans
PCA_Kmeans
Conclusion and Future work
• CLARA and GA are plausible alternative to k-means in a real time Geodemographic classification system.
• K-means might be combines with PCA for enhanced computation power.
• In an online environment k-means is better for small data sets.
• In a real time geodemographic classification system, a clustering algorithm can be chosen at run time.
Web based clustering tool demo