high-level interfaces for scalable data mining ruoming jin gagan agrawal department of computer and...

High-level Interfaces for Scalable Data Mining

Ruoming JinGagan Agrawal

Department of Computer and Information SciencesOhio State University

Motivation Languages, compilers, and runtime systems

for high-end computing Typically focus on scientific applications

Can commercial applications benefit ? A majority of top 500 parallel configurations are used

as database servers Is there a role for parallel systems research ?

Parallel relational databases – probably not Data mining, OLAP, decision support – quite likely

Data Mining Extracting useful models or patterns from large

datasets Includes a variety of tasks - mining associations,

sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each

Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for

application development

Project Overview

Project Components A middleware system called FREERIDE

(Framework for Rapid Implementation of Datamining Engines) (SDM 01, SDM 02)

Performance modeling and prediction (for parallelization strategy selection) SIGMETRICS 2002

Data parallel compilation (under submission) Translation from mining operators (not yet ) Focus on design and evaluation of the interface

for shared memory parallelization in this paper

Outline Key observation from mining algorithms Parallelization challenge, techniques and

trade-offs Programming Interface Experimental Results

K- means Apriori

Summary and future work

Common Processing Structure

Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }

Applies to major association mining, clustering and decision tree construction algorithms

How to parallelize it on a shared memory machine?

Challenges in Parallelization Statically partitioning the reduction object to

avoid race conditions is generally impossible. Runtime preprocessing or scheduling also

cannot be applied Can’t tell what you need to update w/o processing

the element The size of reduction object means significant

memory overheads for replication Locking and synchronization costs could be

significant because of the fine-grained updates to the reduction object.

Parallelization Techniques Full Replication: create a copy of the reduction

object for each thread Full Locking: associate a lock with each

element Optimized Full Locking: put the element and

corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all

elements in a cache block

Memory Layout for Various Locking Schemes

Full Locking Fixed Locking

Optimized Full Locking Cache-Sensitive Locking

Lock Reduction Element

Programming Interface: k-means example

Initialization Function

void Kmeans::initialize() {

for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim+2); }

{* Initialize Centers *} }

k-means example (contd.) Local Reduction Functionvoid Kmeans::reduction(void *point) { for (int I=0;I<k;I++) { dis=distance(point,I); if (dis<min) { min=dis; min_index=I; } objectID=clusterID[min_index]; for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); }}

Implementation from the Common Specification

Template<class T> inline void Reducible<T>::Reduc(int objectID, int Offset, void (*func)(void *,void*), int *param) { T* group_address=reducgroup[ObjectID]; switch (TECHNIQUE) { case FULL_REPLICATION: func(group_address[Offset],param); break; case FULL_LOCKING: offset=abs_offset(ObjectID,Offset); S_LOCK(&locks[offset]); func(group_address[Offset],param); S_UNLOCK(&locks[offset]); break; case OPTIMIZED_FULL_LOCKS: S_LOCK(& group_address[Offset*2]); func(group_address[Offset*2+1],param); S_UNLOCK(& group_address[Offset*2]); break; } }

Experimental Platform Small SMP machine

Sun Ultra Enterprise 450 4 X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory

Large SMP machine Sun Fire 6800 24 X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per

processor 24 GB main memory

Results

0

1000

2000

3000

4000

5000

1 2 3 4

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

Scalability and Middleware Overhead for Apriori: 4 Processor SMP Machine

Results

0

5000

10000

15000

20000

25000

30000

1 2 4 8 12 16

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

Scalability and Middleware Overhead for Apriori: Large SMP Machine

Results

Scalability and Middleware Overhead for K-means: 4 Process SMP Machine

0

1000

2000

3000

4000

5000

6000

1 2 3 4

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

200MB dataset, k=1000

Results

Scalability and Middleware Overhead for K-means: Large SMP Machine

0

500

1000

1500

2000

1 2 4 8 12 16

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

Compiler Support

Use a data parallel dialect of Java Well suited for expressing common mining

algorithms Main computational loops are data parallel Use the notion of reduction interface to implement

reduction objects

Our compiler generates middleware code

Experimental Evaluation Currently limited to distributed memory

parallelization

high-level interfaces for scalable data mining ruoming jin gagan agrawal department of computer and...

Documents