mr. scan: efficient clustering with mrnet and gpus

Paradyn Project

Paradyn / Dyninst WeekMadison, Wisconsin

April 29-May 3, 2013

Mr. Scan: Efficient Clustering with MRNet and GPUs

Evan Samanas and Ben Welton

Density-based clusteringo Discovers the number of clusterso Finds oddly-shaped clusters

2Mr. Scan: Efficient Clustering with MRNet and GPUs

Goal: Find regions that meet minimum density and spatial

distance characteristics

The two parameters that determine if a point is in a cluster

is Epsilon (Eps), and MinPtsIf the number of points in Eps is > MinPts, the point is a core point.For every discovered point, this

same calculation is performed until the cluster is fully expanded

Clustering Example (DBSCAN[1])

EpsMinPts

MinPts: 3

[1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996)

Scaling DBSCANo PDBSCAN (1999)[2]

o Quality equivalent to single DBSCANo Linear speedup up to 8 nodes

o DBDC (2004)[3]

o Sacrifices qualityo ~30x speedup on 15 nodes

o PDSDBSCAN (2012) [4]

o Quality equivalent to single node DBSCANo 5675x Speedup on 8192 nodes (72 Million Points)

o 2 Map/Reduce attempts (2011, 2012)o Quality equivalent to single node DBSCANo 6x speedup on 12 nodes

[2] X. Xu et. al., A fast Parallel Clustering Algorithm for Large Spatial Databases (1999)[3] E. Januzaj et. al., DBDC: Density Based Distributed Clustering (2004)[4] M Patwary et. al., A new scalable parallel DBSCAN algorithm using the disjoint-set data structure (2012)

Challenges of scaling DBSCANo Data distribution

o How do we effectively take an input file and create partitions that can be clustered by DBSCAN?oDistributed 2-D partitioner reading from a distributed file

systemo Load balancing

o How to keep variance in clustering times across nodes to a minimum?oDense Box

o Mergeo How do we reduce the amount of data needed for

the merge while keeping accuracy high?oRepresentative points

MRNet – Multicast / Reduction Networko General-purpose

TBON APIo Network: user-defined

topologyo Stream: logical data channel

o to a set of back-endso multicast, gather, and custom

reductiono Packet: collection of datao Filter: stream data operator

o synchronizationo transformation

o Widely adopted by HPC toolso CEPBA toolkit o Cray ATP & CCDBo Open|SpeedShop & CBTFo STATo TAU

…… …BE

appappappapp

CP CP CP CP

F(x1,…,xn)

TBON Computation

appappappapp

Ideal Characteristics:o Filter output size constant or decreasing

o Computation rate similar across levels

o Adjustable for load balance

Data Size: 10MB per BE

Packet Size: ≤10 MB

Packet Size:≤10 MB

~10 sec

~40 sec

~10 sec

Total Time: ~30 sec

Total Time: ~60 sec

Intro to Mr. Scan

BE BE BE

DBSCAN

Mr. Scan Phases

Partition: Distributed

DBSCAN: GPU(@ BE)

Merge: CPU (x #levels)

Sweep: CPU (x #levels)FE

BE BE BE BE

Mr. Scan Architecture

Time: 0 Time: 18.2 Min

Partitioner

Merge &

Clustering 6.5 Billion Points

FS Read 224 Secs

FS Write489 Secs

MRNetStartup

130 Secs

FS Read: 24 Secs

DBSCAN168 Secs

Merge Time: 6 Secs

Sweep Time: 4 Secs

Write Output: 19 Secs

Partition Phaseo Goal: Partitions computationally equivalent to

DBSCANo Algorithm:

o Form initial partitionso Add shadow regionso Rebalance

Distributed Partitioner

GPU DBSCAN Filter

DBSCAN is performed in two distinct steps

Step 1: Detect Core Points

Block 1

Block 2

Block 900

Block 1T1

Block 2T1

Block 900T1

Step 2: Expand core points and color

Dense Box

• One significant scalability issue is dealing with dense regions of data

• Density increases the computation cost of DBSCAN

R2 Requires morecomparison operations

• We reduce the computation cost of high density regions by pre-clustering these regions

KD-Tree

Look at each leaf bounding box looking for boxes with point count > minpts and size < 0.35 * eps

DBSCAN no longer needs to expand these regions

Merge Algorithmo Merge overlapping clusters found on

different nodes. o Two steps in the merge operation

1. Select Representative points (BE)2. Merge operation

Representative Pointso These are points that represent the core

points in the dataset. o Create a boundary which at least one

core point shared between overlapping clusters must be contained.

Representative points are the points closest to the corners and middle of the side of the eps box

These points create a boundary (shaded region) which a point must fall in to merge overlapping clusters

Merge Algorithm

• Merge algorithm is responsible for merging overlapping clusters detected on different DBSCAN nodes.

• Need to handle the merge with low overhead and without the full dataset

Node 1 Node 2

Core Point

Non-Core Point

1. Core/Core overlap

Core Point in common. 64 operations to detect.

Node 1 Node 2

Core Point

Non-Core Point

2. Non-core/Core overlap

Core point seen as non-core by one node. MinPts * 2 operations required to detect

Sweep Stepo Get cluster identifiers and file offsets

down to BE’s to write final clusters. o FE gives each cluster a unique ID and a

file offset. o This data is passed back down to the BE

that holds the data in the cluster.o Data is written out to disk by the BE.

Experiment Setupo Dataset: Generated data with

distribution from real Twitter datao Measuring:

oWeak Scaling up to 8192 GPUso Strong ScalingoQuality compared to single-threaded

DBSCAN

Results

Weak Scaling: 4096x data/compute increase 18.48x-31.68x time increase

Results Breakdown – Partition Phase@ 6.5 Billion Points: 65.9% of Mr. Scan’s time

94.6% I/O time

Results Breakdown – GPU Cluster Time

Strong Scaling

Quality

Future Worko Remove partitioner’s I/O bottlenecko Multiple dimensions

Conclusiono Clustered 6.5 billion points with

DBSCAN in 18.2 minuteso Controlled computational variance of

DBSCANo Partitioner I/O = scaling enemy

Questions?

26A Brief Discussion of Ways and Means

Summary of previous Mr. Scan implementation

BE BE BE

DBSCAN

Algorithm Steps

SpatialDecomp: CPU(@ FE)

DBSCAN: CPU or GPU(@ BE)

DrawBoundBox: CPU or GPU

MergeCluster: CPU (x #levels)

MergeCluster

mr. scan: efficient clustering with mrnet and gpus

efficient clustering

distributed clustering

clustering times

quality equivalent

number nodes

densitybased algorithm

nodes dbdc

tweet slide

Documents

improving the scalability of comparative debugging with...

clustering. overview definition of clustering existing...

parallel clustering coefficient computation using gpus v d...

survey - university of...

gpus graal

clustering iv. outline impossibility theorem for clustering...

gpus and accelerators

affinity clustering: hierarchical clustering at...

stack trace analysis for large scale debugging using mrnet

mr. scan: efficient clustering with mrnet and gpus€¦ ·...

mrnet api programmer's guide - paradyn

amber and kepler gpus

questions about gpus

introduction to gpus

programming for gpus

presentación gpus maeb 2012

evolution of gpus

introduction to gpus · gpus today gpus are widely deployed...

deep learning with gpus

scalable clustering using multiple gpus