high-level interfaces for scalable data mining ruoming jin gagan agrawal department of computer and...

20
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Upload: angela-hoover

Post on 17-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

Data Mining Extracting useful models or patterns from large datasets Includes a variety of tasks - mining associations, sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for application development

TRANSCRIPT

Page 1: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

High-level Interfaces for Scalable Data Mining

Ruoming JinGagan Agrawal

Department of Computer and Information SciencesOhio State University

Page 2: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation Languages, compilers, and runtime systems

for high-end computing Typically focus on scientific applications

Can commercial applications benefit ? A majority of top 500 parallel configurations are used

as database servers Is there a role for parallel systems research ?

Parallel relational databases – probably not Data mining, OLAP, decision support – quite likely

Page 3: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Data Mining Extracting useful models or patterns from large

datasets Includes a variety of tasks - mining associations,

sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each

Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for

application development

Page 4: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Project Overview

Page 5: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Project Components A middleware system called FREERIDE

(Framework for Rapid Implementation of Datamining Engines) (SDM 01, SDM 02)

Performance modeling and prediction (for parallelization strategy selection) SIGMETRICS 2002

Data parallel compilation (under submission) Translation from mining operators (not yet ) Focus on design and evaluation of the interface

for shared memory parallelization in this paper

Page 6: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Outline Key observation from mining algorithms Parallelization challenge, techniques and

trade-offs Programming Interface Experimental Results

K- means Apriori

Summary and future work

Page 7: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Common Processing Structure

Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }

Applies to major association mining, clustering and decision tree construction algorithms

How to parallelize it on a shared memory machine?

Page 8: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Challenges in Parallelization Statically partitioning the reduction object to

avoid race conditions is generally impossible. Runtime preprocessing or scheduling also

cannot be applied Can’t tell what you need to update w/o processing

the element The size of reduction object means significant

memory overheads for replication Locking and synchronization costs could be

significant because of the fine-grained updates to the reduction object.

Page 9: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Parallelization Techniques Full Replication: create a copy of the reduction

object for each thread Full Locking: associate a lock with each

element Optimized Full Locking: put the element and

corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all

elements in a cache block

Page 10: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Memory Layout for Various Locking Schemes

Full Locking Fixed Locking

Optimized Full Locking Cache-Sensitive Locking

Lock Reduction Element

Page 11: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Programming Interface: k-means example

Initialization Function

void Kmeans::initialize() {

for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim+2); }

{* Initialize Centers *} }

Page 12: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

k-means example (contd.) Local Reduction Functionvoid Kmeans::reduction(void *point) { for (int I=0;I<k;I++) { dis=distance(point,I); if (dis<min) { min=dis; min_index=I; } objectID=clusterID[min_index]; for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); }}

Page 13: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Implementation from the Common Specification

Template<class T> inline void Reducible<T>::Reduc(int objectID, int Offset, void (*func)(void *,void*), int *param) { T* group_address=reducgroup[ObjectID]; switch (TECHNIQUE) { case FULL_REPLICATION: func(group_address[Offset],param); break; case FULL_LOCKING: offset=abs_offset(ObjectID,Offset); S_LOCK(&locks[offset]); func(group_address[Offset],param); S_UNLOCK(&locks[offset]); break; case OPTIMIZED_FULL_LOCKS: S_LOCK(& group_address[Offset*2]); func(group_address[Offset*2+1],param); S_UNLOCK(& group_address[Offset*2]); break; } }

Page 14: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Experimental Platform Small SMP machine

Sun Ultra Enterprise 450 4 X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory

Large SMP machine Sun Fire 6800 24 X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per

processor 24 GB main memory

Page 15: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Results

0

1000

2000

3000

4000

5000

1 2 3 4

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

Scalability and Middleware Overhead for Apriori: 4 Processor SMP Machine

Page 16: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Results

0

5000

10000

15000

20000

25000

30000

1 2 4 8 12 16

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

Scalability and Middleware Overhead for Apriori: Large SMP Machine

Page 17: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Results

Scalability and Middleware Overhead for K-means: 4 Process SMP Machine

0

1000

2000

3000

4000

5000

6000

1 2 3 4

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

200MB dataset, k=1000

Page 18: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Results

Scalability and Middleware Overhead for K-means: Large SMP Machine

0

500

1000

1500

2000

1 2 4 8 12 16

Number of threads

Tim

e(s)

fr-int

fr-man

ofl-int

ofl-man

csl-int

csl-man

Page 19: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Compiler Support

Use a data parallel dialect of Java Well suited for expressing common mining

algorithms Main computational loops are data parallel Use the notion of reduction interface to implement

reduction objects

Our compiler generates middleware code

Page 20: High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Experimental Evaluation Currently limited to distributed memory

parallelization