high-level interfaces for scalable data mining ruoming jin gagan agrawal department of computer and...
DESCRIPTION
Data Mining Extracting useful models or patterns from large datasets Includes a variety of tasks - mining associations, sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for application developmentTRANSCRIPT
High-level Interfaces for Scalable Data Mining
Ruoming JinGagan Agrawal
Department of Computer and Information SciencesOhio State University
Motivation Languages, compilers, and runtime systems
for high-end computing Typically focus on scientific applications
Can commercial applications benefit ? A majority of top 500 parallel configurations are used
as database servers Is there a role for parallel systems research ?
Parallel relational databases – probably not Data mining, OLAP, decision support – quite likely
Data Mining Extracting useful models or patterns from large
datasets Includes a variety of tasks - mining associations,
sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each
Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for
application development
Project Overview
Project Components A middleware system called FREERIDE
(Framework for Rapid Implementation of Datamining Engines) (SDM 01, SDM 02)
Performance modeling and prediction (for parallelization strategy selection) SIGMETRICS 2002
Data parallel compilation (under submission) Translation from mining operators (not yet ) Focus on design and evaluation of the interface
for shared memory parallelization in this paper
Outline Key observation from mining algorithms Parallelization challenge, techniques and
trade-offs Programming Interface Experimental Results
K- means Apriori
Summary and future work
Common Processing Structure
Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }
Applies to major association mining, clustering and decision tree construction algorithms
How to parallelize it on a shared memory machine?
Challenges in Parallelization Statically partitioning the reduction object to
avoid race conditions is generally impossible. Runtime preprocessing or scheduling also
cannot be applied Can’t tell what you need to update w/o processing
the element The size of reduction object means significant
memory overheads for replication Locking and synchronization costs could be
significant because of the fine-grained updates to the reduction object.
Parallelization Techniques Full Replication: create a copy of the reduction
object for each thread Full Locking: associate a lock with each
element Optimized Full Locking: put the element and
corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all
elements in a cache block
Memory Layout for Various Locking Schemes
Full Locking Fixed Locking
Optimized Full Locking Cache-Sensitive Locking
Lock Reduction Element
Programming Interface: k-means example
Initialization Function
void Kmeans::initialize() {
for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim+2); }
{* Initialize Centers *} }
k-means example (contd.) Local Reduction Functionvoid Kmeans::reduction(void *point) { for (int I=0;I<k;I++) { dis=distance(point,I); if (dis<min) { min=dis; min_index=I; } objectID=clusterID[min_index]; for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); }}
Implementation from the Common Specification
Template<class T> inline void Reducible<T>::Reduc(int objectID, int Offset, void (*func)(void *,void*), int *param) { T* group_address=reducgroup[ObjectID]; switch (TECHNIQUE) { case FULL_REPLICATION: func(group_address[Offset],param); break; case FULL_LOCKING: offset=abs_offset(ObjectID,Offset); S_LOCK(&locks[offset]); func(group_address[Offset],param); S_UNLOCK(&locks[offset]); break; case OPTIMIZED_FULL_LOCKS: S_LOCK(& group_address[Offset*2]); func(group_address[Offset*2+1],param); S_UNLOCK(& group_address[Offset*2]); break; } }
Experimental Platform Small SMP machine
Sun Ultra Enterprise 450 4 X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory
Large SMP machine Sun Fire 6800 24 X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per
processor 24 GB main memory
Results
0
1000
2000
3000
4000
5000
1 2 3 4
Number of threads
Tim
e(s)
fr-int
fr-man
ofl-int
ofl-man
csl-int
csl-man
Scalability and Middleware Overhead for Apriori: 4 Processor SMP Machine
Results
0
5000
10000
15000
20000
25000
30000
1 2 4 8 12 16
Number of threads
Tim
e(s)
fr-int
fr-man
ofl-int
ofl-man
csl-int
csl-man
Scalability and Middleware Overhead for Apriori: Large SMP Machine
Results
Scalability and Middleware Overhead for K-means: 4 Process SMP Machine
0
1000
2000
3000
4000
5000
6000
1 2 3 4
Number of threads
Tim
e(s)
fr-int
fr-man
ofl-int
ofl-man
csl-int
csl-man
200MB dataset, k=1000
Results
Scalability and Middleware Overhead for K-means: Large SMP Machine
0
500
1000
1500
2000
1 2 4 8 12 16
Number of threads
Tim
e(s)
fr-int
fr-man
ofl-int
ofl-man
csl-int
csl-man
Compiler Support
Use a data parallel dialect of Java Well suited for expressing common mining
algorithms Main computational loops are data parallel Use the notion of reduction interface to implement
reduction objects
Our compiler generates middleware code
Experimental Evaluation Currently limited to distributed memory
parallelization