mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce yaobin he, haoyu...
TRANSCRIPT
MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm
using MapReduce
Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan
INTRODUCTION
This paper is mainly focus on “Parallel Density-based Data Clustering” on shared-nothing cluster environment.
Data clustering is essential data mining technique which can view macroscopic patterns of data.Due to the size of datasets, there is a needs to develop parallel data clustering algorithm.
In this paper, the authors propose an parallel dens-ity-based clustering algorithm and implement it by a 4-stages MapReduce paradigm
Adopt a quick partitioning strategy for large scale non-indexed dataStudy the metric of merge among bordering partitions and optim-izationsEvaluate on real large scale datasets (approx. 1.9 billion GPS log)
3
Introduction
Clustering techniquesPros of DBScan
• Divide data into clusters with arbitrary shapes
• Does not require the number of the clusters a priori
• Insensitive to the order of the points in the dataset
Cons of DBScan• The sizes of the datasets are growing so
that they can not be held on a single machine
• Much higher computation complexity compared with K-means
=> PARALLELIZE using MapRe-duce!! (what a simple..)
4
Background : DBScan
DBSCAN (Martin Ester et al, KDD, 1996)The key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts)
Directly density-reachable (DDR): o is DDR p if p ∈ NEps(o) and Card(NEps(o)) ≤ MinPts.Density-reachable (DR): if there is a chain of points {pi|i = 0, .., n} that each pi is DDR pi+1, then pi is DR t, where t ∈ {pj |j = i + 1, ..., n}. (canonical extension)Density-connected (DC): if o is DR p and o is DR q, then p is DC q. (symmetric version)
5
Background : DBScan
Class of point :- Unclassified- Core- Border- Noise
6
Background : MapReduce
Borrows from functional programmingUsers should implement two primary methods:
Map: (k1, v1) → list(k2, v2)Reduce: (k2, list(v2)) → list(k3, v3)]
7
Background : MapReduce
8
Design And Implementation
Problem StatementGiven a set of d-dimensional points DB = {p1, p2, ..., pn}, a mini-mal density of clusters defined by Eps and MinPts, and a set of computer CP = {C1, C2, ...,Cn} managed by Map-Reduce plat-form; find the density-based clusters with respect to the given Eps and MinPts values.
Overall Framework
9
Stage 1 : Preprocessing
Summary spatial distribution, and then genenrate grid based partitionMain challenges for a partitioning strategy
1) Load balancing2) Minimized communication
One of the possible solutions is to build an efficient spatial index
However the authors does not take well-known indexing method such as R-Tree, KD-Tree, … Because, iterating recursion to get a hierarchical structure is not practical in MapReduc paradigm
The authors uses partition algorithm on MapReduce adjusted from the grid file.
10
Stage 1 : Preprocessing
Raw Data
Bucket Counting(in example, 10 bucket which created by interval 0.1)
Compute Spatial distribution for each dimension
Partitioning- Proposed Metrics : avg, m
Bucket ID
Count
11
Stage 1 : Preprocessing
Shape of the Partitonnecessity of the access to remote data
• For a given Eps, and MinPts D 5, if there is no support of accessing remote data, then the neighborhood of object p1 would contain only 3 points which is less than MinPts, and therefore p1 would not be a core point.
• Therefore, to obtain correct clustering results, a “view” over the border of partitions is necessary
So, the shape of the partition is S + halo
S1 or i S2 or i+1
halo
Outer haloInner halo
Eps
12
Stage 2 : Local DBSCAN
The algorithm in Local DBSCAN is very similar with DBSCAN
Differences is..A non-noise point q on outer halo, in this point we does not know whether q is a core point or border point.
• (because computing node is on shared-nothing environment)
Those points are classified “Onqueue” status and put into MergeCandidates set (MC)
13
Stage 3 : Find Merging Mapping
Character of MC setThe composition of MC set
The Completeness of MC set
q is not in halo
q is core point More than one neighbor are on halo
O is Core point or border point on halo
14
Stage 3 : Find Merging Mapping
Merging clusters of adjacent spaces are needed or not
15
Stage 3 : Find Merging Mapping
Let MC1(C, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border pointsTheorem 1: Let MC1(C1, S1) = {AP1∪BP1}, where AP1 is the set of core points and BP1 is the set of border points w.r.t. space constraint S1. MC2(C2, S2) = AP2 ∪ BP2, where AP2 is the set of core points and BP2 is the set of border points w.r.t. space constraint S2. If S1 and S2 are bordering
16
Stage 3 : Find Merging Mapping
17
Stage 4 : Merge
Build Global Mapping -> Merge and Relabel
18
Evaluation
Experiment environment13-node clusterEach node has 3.0GHz i7 950 (quad-core), 8GB ram, 2TB hddUbuntu 10.10Hadoop 0.20.2Block size : 64MB
Data SetSanghai taxi GPS logs
19
Evaluation
Each location point is normalized into range [0, 1)Two DBSCAN configuration
WL-1• Eps : 0.002, MinPts : 1,000
WL-2• Eps : 0.0002, MinPts : 100
ds-4
20
Evaluation
WL-1SPD=120
12-node
ds4 ds3 ds2 ds1 (2/12) (4/12) (6/12)
Conclusions
In this paper, implement an efficient parallel DBScan algo-rithm in a 4-stages MapReduce paradigm. We analyze and propose a practical data partition strategy for large scale non-indexed spatial data. We apply our work on a real world spatial dataset, which contains over 1.9 billion GPS raw records, and run our experiment on a lab-size 13-nodes cluster. Result from experiment shows the speedup and scale-up performance are very efficient.We observe that roadmap based spatial data will highly skew in the road network. If a main road happens lying in the replication area after partitioning, computation and data replication will increase dramatically. One of the fu-ture works is to improve the partitioning strategy to aware of this observation and minimize the size of MC sets. The challenge is that its performance is still highly restricted by the distribution of raw spatial data.