exploiting domain-specific high-level runtime support for parallel code generation xiaogang li...
TRANSCRIPT
![Page 1: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/1.jpg)
Exploiting Domain-Specific High-level Runtime Support for Parallel
Code Generation
Xiaogang Li Ruoming Jin
Gagan Agrawal Department of Computer and
Information SciencesOhio State University
![Page 2: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/2.jpg)
Motivation
Languages, compilers, and runtime systems for high-end computing
Typically focus on scientific applications Can commercial applications benefit ?
A majority of top 500 parallel configurations are used as database servers
Is there a role for parallel systems research ? Parallel relational databases – probably not Data mining, OLAP, decision support – quite likely
![Page 3: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/3.jpg)
Data Mining
Extracting useful models or patterns from large datasets
Includes a variety of tasks - mining associations, sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each
Both compute and data intensive Algorithms are well suited for parallel execution High-level interfaces can be useful for
application development
![Page 4: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/4.jpg)
Project Overview
![Page 5: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/5.jpg)
Project Components
A middleware system called FREERIDE (Framework for Rapid Implementation of Datamining Engines) (SDM 01, SDM 02)
Performance modeling and prediction (for parallelization strategy selection) SIGMETRICS 2002
Runtime and compiler support for shared memory parallelization (LCPC 02)
Translation from mining operators (not yet ) Focus on language and compiler support for
distributed memory parallelization in this talk
![Page 6: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/6.jpg)
Common Processing Structure
Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }
Applies to major association mining, clustering and decision tree construction algorithms
Parallelization approach Compute local copy of reduction objects Perform global reduction
![Page 7: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/7.jpg)
Middleware Support for Distributed Memory Parallelization
Interface Requires: Specification of an iterator and termination condition Local reduction for each parallel loop Global reduction for each loop
Functionality Fetch data elements chunk by chunk, apply local
reduction Broadcast the reduction object after finishing one pass
on data Perform global reduction, broadcast the results Check termination condition, move to next iteration
![Page 8: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/8.jpg)
Compilation Approach
Support a general high-level language Use middleware functionality in compilation Exploit the domain-specific common structure
Reduction loop with associative and commutative operations
Disk-resident input datasets, smaller output
![Page 9: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/9.jpg)
· A data parallel dialect of Java: to give compiler information about independent collections of objects, parallel loops and reduction operations — domain & rectdomain — foreach loop — reduction variables:
- can only be updated inside a foreach loop by operations
that are associative & commutative - intermediate value of the reduction variables may not be
used within the loop, except for self-updates
Language Support
![Page 10: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/10.jpg)
Example code
public class kNN { static buffer kbuffer; public static void main(String[] args) { double dis; Point<3> lowend = … Point<3> hiend = … Point<3> p; RectDomain<3> InputDomain=[lowend:hiend]; kPoint[3d] Input=new kPoint[InputDomain];
foreach (p in InputDomain) { if (Input[p].inRange(R)) { dis=Input[p].distance(W); kbuffer.insert(Input[p],dis); }
![Page 11: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/11.jpg)
Compilation Task
Extract local reduction function Simple from body of data parallel loop
Extract an iterator and termination condition
Simple from the overall code Extract a global reduction function
Can be quite challenging in the presence of complex control flow and data-structures
A new algorithm developed
![Page 12: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/12.jpg)
Extracting Global Reduction from Local Reduction : Motivating Example
I = k – 1 ; While (newdis < distance) && I >= 0) { if(I>0) { x1[I] = x1[I-1] ; x2[I] = x2[I-1] ; … } I = I – 1 ; } If(I < k-1) { x1[I+1] = kpoint.x1 ; x2[I+1] = kpoint.x2 ; … }
I = k – 1 ; While (kpoint.dis < distance) && I >= 0) { if(I>0) { x1[I] = x1[I-1] ; x2[I] = x2[I-1] ; … } I = I – 1 ; } If(I < k-1) { x1[I+1] = kpoint.x1 ; x2[I+1] = kpoint.x2 ; … }
For( j = 0; j < k ; j++) { I = k – 1 ; While (buf.dis[j] < distance) && I >= 0) { if(I>0) { x1[I] = x1[I-1] ; x2[I] = x2[I-1] ; … } I = I – 1 ; } If(I < k-1) { x1[I+1] = buf..x1[j] ; x2[I+1] = buf..x2[I] ; … } }
![Page 13: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/13.jpg)
Overall Approach
Classify each assignment to a data member of reduction object into following types:
O.x = g(e), where e is the input element O.x = O.x op g(e), op is an associative and
commutative operator Expression involving loop constants and other
members of the reduction object Classify control dependence on any of the
above assignment statements as: Loop constant Non-loop constant
![Page 14: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/14.jpg)
Code Generation: Handling Different Types of Assignment Statements
Three types of assignment statements: O.x = g(e) (Type a) If x can represent many fields, iterate over all of
them O.x = O.x op g(e) (Type b) Replace by O.x = O.x op O1.x If x can represent many fields, iterate over all of
them Expression involving loop constants and other data
members (Type c) Keep as it is
![Page 15: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/15.jpg)
Handling Control Flow
Control predicates for Type (b) assignments: Remove non-loop constant control predicates Keep loop constant control predicates
Control predicates for Type (a) and Type (c) statements:
Keep loop constant control predicates Classify non-loop constant into two types:
Predicate involves a value that is assigned to a data member
Replace that value by the data member Other predicates - Simply remove
![Page 16: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/16.jpg)
Experimental Platform
Cluster of Workstations Sun Ultra Enterprise 450 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory Myrinet as the interconnect
![Page 17: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/17.jpg)
Results from k-means clustering
0
20
40
60
80
100
120
140
160
1 2 4 8 nodes
compinlinecomp + inline
1 GB dataset with 3 dimensional pointsK = 3
![Page 18: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/18.jpg)
Results from Apriori Association Mining
0
2000
4000
6000
8000
10000
12000
1 2 4 8 nodes
compmanual
3 GB dataset
![Page 19: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/19.jpg)
Results from k-nearest neighbors
0
20
40
60
80
100
120
140
160
1 2 4 8 nodes
compmanual
1 GB dataset 3 dimensional pts. k = 100
![Page 20: Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information](https://reader036.vdocuments.us/reader036/viewer/2022072011/56649e305503460f94b21684/html5/thumbnails/20.jpg)
Summary
Focus on a new class of applications Exploit the common structure within the class Develop a runtime system supporting this
structure Use it as a compiler target
Very simple compiler implementation (< 1000 lines of code)
A new algorithm for synthesizing global reduction functions
Performance of compiler generated code is very competitive