mining regional knowledge in spatial dataset
TRANSCRIPT
Data Mining & Machine Learning Group CS@UHADMA09
Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta
University of Houston, Texas, USA
A Framework for Multi-objective Clustering andIts Application to Co-location Mining
Beijing, China August 17, 2009
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Talk Outline1.What is unique about this work with respect to
clustering? 2.Multi-objective Clustering (MOC)—Objectives
and an Architecture3.Clustering with Plug-in Fitness Functions4.Filling the Repository with Clusters5.Creating Final Clusterings6.Related Work 7.Co-location Mining Case Study8.Conclusion and Future Work
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta1. What is unique about this work with respect to clustering?
Clustering algorithms that support plug-in fitness function are used.
Clustering algorithms are run multiple times to create clusters.
Clusters are stored in a repository that is updated on the fly; cluster generation is separated from creating the final clustering.
The final clustering is created from the clusters in the repository based on user preferences.
Our approach needs to seeks for alternative, overlapping clusters.
Data Mining & Machine Learning Group CS@UH
2. Multi-Objective Clustering (MOC)
The particular problem investigated in this work: Input: Given a spatial dataset & a set of objectives Task: Find sets of clusters that a good with respect to two
or more objectives
Dataset:(longitude,latitude,<concentrations>+)
Multi-ObjectiveClustering
Texas
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Survey MOC Approach Clustering algorithms are run multiple times
maximizing different subsets of objectives that are captured in compound fitness functions.
Uses a repository to store promising candidates. Only clusters that satisfying two or more objectives are
considered as candidates. After a sufficient number of clusters has been created,
final clustering are generated based on user-preferences.
5
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
An Architecture for MOC
6
ClusterSummarization
Unit
Storage Unit
Clustering Algorithm
Goal-driven Fitness Function Generator
A SpatialDataset
MQ’
Q’
X
M’
Steps in multi-run clustering:
S1: Generate a compound fitness function. S2: Run a clustering algorithm. S3: Update the cluster repository M. S4: Summarize clusters discovered M’.
S1 S2
S3 S4
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
3. Clustering with Plug-in Fitness Functions
Motivation: Finding subgroups in geo-referenced datasets has many
applications. However, in many applications the subgroups to be searched
for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation.
Domain or task knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup.
Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for.
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Current Suite of Spatial Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG Agglomerative: MOSAIC Density-based: SCDE, DCONTOUR (not really plug-in but some fitness
functions can be simulated)
Clustering Algorithms
Density-based
Agglomerative-basedRepresentative-based
Grid-based
Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
4. Filling the Repository with Clusters Plug-in Reward functions Rewardq(x) are used to
assess to which extend an objective q is satisfied for a cluster x.
User defined thresholds q are used to determine if an objective q is satisfied by a cluster x (Rewardq (x)>q).
Only clusters that satisfy 2 or more objectives are stored in the repository.
Only non-dominated clusters are stored in the repository.
Dominance relations only apply to pairs of clusters that have a certain degree of agreement (overlap) sim.
Data Mining & Machine Learning Group CS@UH
Dominance between clusters x and y with respect to multiple objectives Q.
Dominance Constraint with Respect to the Repository
10
Dominance and Multi-Objective Clusters
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Compound Fitness Functions
The goal-driven fitness function generator selects a subset Q’(Q) of the objectives Q and creates a compound fitness function qQ’ relying on a penalty function approach [Baeck et al. 2000].
CmpReward(x)= (qQ’ Rewardq(x)) * Penalty(Q’,x)
11
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Updating the Cluster Repository
12
M:= clusters in the repositoryX:= “new” clusters generated by a single run of the clustering algorithm
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
5. Creating a Final Clustering Final clusterings are subsets of the clusters in the repository M. Inputs: The user provides her own individual objective function
RewardU and a reward threshold U and cluster similarity threshold rem that indicates how much cluster overlap she likes to tolerate.
Goal: Find XM that maximizes:
subject to: 1. xXx’X (xx’ Similarity(x,x’)<rem)
2. xX (RewardU(x)>U) Our paper introduces MO-Dominance-guided Cluster Reduction
algorithm (MO-DCR) to create the final clustering.
Xc
U crewardXq )()(
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
The algorithm loops over the following 2 steps until M is empty:
1. Include dominant clusters D which are the highest reward clusters in M’
2. Remove D and their dominated clusters in the rem-proximity from M.
MO-Dominance-guided Cluster Reduction(MO-DCR) algorithm (MO-DCR)
14
A
E
F
A E
Dominance graphs
: a dominant cluster: dominatedclusters
A
B
C
D
E
F
sim(A,B)=0.8
0.70.6
rem=0.5
Remark: AB RewardU(A)>RewardU(B) Similarity(A,B)> rem
M’
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
6. Related Work Multi-objective clustering based on evolutionary algorithms
(MOEA): VIENNA [Handl and Knowles 2004] , MOCLE [Faceli et al. 2007]
In comparison, MOC relies on clustering algorithms with plug-in fitness functions and multi-run clustering that explores different combinations of fitness objectives.
Moreover, MOC relies on cluster repositories that store individual clusters and not clusterings and summarization algorithms to create the final clustering.
15
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
7. Case Study: Co-location Mining
Goal: Finding regional co-location patterns where high concentrations of Arsenic are co-located with a lot of other factors in Texas.
Remark: Each binary co-location is treated as a single objective.
Dataset: TWDB has monitored water quality and collected the data
for 105,814 wells in Texas over last 25 years. we use a subset of Arsenic_10_avg data set: longitude and
latitude, Arsenic (As), Molybdenum (Mo), Vanadium (V), Boron (B), Fluoride (F-), Chloride (Cl-), Sulfate (SO4
2-) and Total
Dissolved Solids (TDS). 16
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Objective Functions Used
17
RewardB(x) = (B,x)|x|
.
Q = {q{As,Mo}, q{As,V}, q{As,B}, q{As,F-}, q{As,Cl
-}, q{As,SO4
2-}, q{As,TDS}}
Q’ Q
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Steps of the Experiment
18
Spatialdatasetand fitness functions (Q)
MOCStep 1-3
MOC
Step 4
Regions (M)
Regions M’ (M)
with associated
co-location pattern
MOC Users
Queries
Step 1-3: use CLEVER with all pairs of 7 different objective functions:
q{As,Mo}, q{As,V}, q{As,B}, q{As,F-}, q{As,Cl
-}, q{As,SO4
2-}, q{As,TDS}.
Step 4: query clusters in the repository by separately using the given single-objective functions, the removal threshold rem = 0.1 and the following user-defined reward thresholds (7 final clusterings):
q{As,Mo}=13, q{As,V}=15, q{As,B}=10, q{As,F-}=25, q{As,Cl
-}=7,
q{As, SO42-
}=6, q{As,TDS}=8.
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Experimental Results MOC is able to identify:
Multi-objective clusters
Alternative clusters e.g. Rank1 regions of (a) and Rank2 regions of (b)
Nested clusters e.g. in (b) Rank3-5 regions are sub-regions of Rank1 region.
Particularly discriminate among companion elements such as Vanadium (Rank3 region), or Chloride, Sulfate and Total Dissolved Solids (Rank4 region).
19
(a) (b)
Fig. 7.6 The top 5 regions and patterns with respect to two queries: query1={As,Mo} and query2={As,B} are shown in Figure (a) and (b), respectively.
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
8. Conclusion and Future Work
Building blocks for Future Multi-Objective Clustering Systems were provided in this work; namely: A dominance relation for problems in which only a subset of the
objectives can be satisfied was introduced. Clustering algorithms with plug-in fitness functions and the
capability to create compound fitness functions are excessively used in our approach.
Initially, a repository of potentially useful clusters is generated based on a large set of objectives. Individualized, specific clusterings are then generated based on user preferences.
The approach is highly generic and incorporates specific domain needs in form of single-objective fitness functions.
The approach was evaluated in a case study and turned out more suitable than a single-objective clustering approach that was used for the same application in a previous paper [ACM-GIS 2008].
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Challenges in Multi-objective Clustering (MOC)
1. Find clusters that are individually good with respect to multiple objectives in an automated fashion.
2. Provide search engine style capabilities to summarize final clustering obtained from multiple runs of clustering algorithms.
21
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Traditional Clustering Algorithms & Fitness Functions
1. Traditional clustering algorithms consider only domain independent and task independent characteristics to form a solution.
2. Different domain tasks require different fitness functions.
22
No Fitness Function
Provides Plug-inFitness Function
Fixed Fitness Function
DBSCANHierarchical Clustering
Implicit Fitness Function
K-Means CHAMELEONOur Work
PAM
ClusteringAlgorithms
Traditional Clustering Algorithms
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Code MO-DCR Algorithm
23
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Challenges Cluster Summarization
24
XX
A
B XA
B
C
A
B
C
OriginalClusters
TypicalOutput
DCR Output
: Eliminated clustersX
C
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
Interestingness of a Pattern
Interestingness of a pattern B (e.g. B= {C, D, E}) for an object o,
Interestingness of a pattern B for a region c,
Bp
opzoBi ),(),(
),(*
,
, cBpurityc
oBi
cB co
Remark: Purity (i(B,o)>0) measures the percentage of objects that exhibit pattern B in region c.
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta
26
Rank Region Id Size Reward Interestingness
1 98 184 3,741.03 1.49
2 93 162 423.62 0.20
3 30 165 41.89 0.01
4 220 8 27.99 1.23
5 74 122 20.19 0.01
Table 7.7 Top 5 Regions Ranked by Reward of the Query {As,Mo}
Rank Region Id Size Reward Interestingness
1 27 147 1,828.2 1.03
2 122 179 350.95 0.15
3 25 11 51.09 1.40
4 138 5 40.02 3.58
5 178 6 10.88 0.74
Table 7.8 Top 5 Regions Ranked by Reward of the Query {As, B}
Characteristics of the Top5 Regions
Data Mining & Machine Learning Group CS@UH
Representative-based Clustering
Attribute2
Attribute1
1
2
3
4
Objective: Find a set of objects OR such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Properties: Cluster shapes are convex polygonsPopular Algorithms: K-means. K-medoids
Data Mining & Machine Learning Group CS@UHADMA, Beijing 09
Jiamthapthaksin, Eick, Vilalta5. CLEVER (ClustEring using representatiVEs and Randomized hill climbing)
Is a representative-based, sometimes called prototype-based clustering algorithm
Uses variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity.
Searches for optimal number of clusters