streamx10: a stream programming framework on x10 haitao wei 2012-06-14 school of computer science at...
TRANSCRIPT
HUST
StreamX10: A Stream Programming Framework on
X10
Haitao Wei2012-06-14
School of Computer Science at Huazhong University of Sci&Tech
2
Outline
Introduction and Background1
COStream Programming Language2
Stream Compilation on X103
Experiments4
Conclusion and Future Work5
Background and motivition
Stream Programming A high level programming model that has been
productively applied Usually, depends on the specific architectures which
makes it difficult to port between different platforms X10
a productive parallel programming environment isolates the different architecture details provides a flexible parallel programming abstract layer
for stream programming
StreamX10: try to make the stream program portable based on X10
4
Outline
Introduction and Background1
COStream Programming Language2
Stream Compilation on X103
Experiments4
Conclusion and Future Work5
COStream Language
5
stream FIFO queue connecting operators
operator Basic func unit—actor node in stream graph
Multiple inputs and multiple outputs Window
– like pop, peek, push operations
Init and work function
composite Connected operators—subgraph of actors A stream program is composed of composites
COStream and Stream Graph
6
Composite Main{ graph
stream<int i> S = Source(){ state :{ int x;}
init :{x=0;} work :{
S[0].i = x; x++;
} window S:tumbling,count(1);
} streamit<int j> P = MyOp(S){
param pn:N } () as SinkOp = Sink(P){ state :{int r;} work :{ r = P[0].j; println(r); } window P: tumbling,count(1); }}
Composite MyOp(output Out ; input In){ param attribute:pn graph stream<int j> Out = Averager(In){ work :{ int sum=0,i; for(i=0;i<pn;i++) sum += In[i],j; Out[0].j = (sum/pn); } window In: sliding,count(10),count(1); Out:tumbling,count(1); }}
stream
operator
composite
Sourc
e
SinkAverager
push=1
peek=10pop=1
push=1
pop=1
S P
7
Outline
Introduction and Background1
COStream Programming Language2
Stream Compilation on X103
Experiments4
Conclusion and Future Work5
Compilation flow of StreamX10
Phrase Function
Front-endTranslates the COStream syntax into abstract syntax tree.
InstantiationInstantiates the composites hierarchically to static flattened
operators.
Static Stream GraphConstructs static stream graph from flattened operators.
SchedulingCalculates initialization and steady-state execution orderings of
operators.
PartitioningPerforms partitioning based on X10 parallelism models for load
balance.
Code GenerationGenerates X10 code for COStream programs.
The Execution Framework
9
Place 0 Place 1 Place 2
activity activity activity
Local buffer object Global buffer object
Data flow intra place Data flow inter place
threads pool
The node is partitioned between the places Each node is mapped to an activity The nodes use the pipeline fashion to exploit the parallelisms The local and Global FIFO buffer are used
Work Partition Inter-place
10Objective:Minimized Communication and Load Balance (Using Metis)
10
2 2 2
2
10
2
1
55
55
5
5
1
Comp. work=10
Comp. work=10
Comp. work=10
Speedup: 30/10 =3Communication: 2
Global FIFO implementation
11
Producer Consumer
10 n… 10 n…
10 n…
push peek/pop
copy copy
Local Array DistArrayPlace0 Place1
Each Producer/Consumer has its own local buffer the producer uses push operation to store the data to the local
buffer The consumer uses peek/pop operation to fetch data from the local
buffer When the local buffer is full/empty is data will be copied
automatically
X10 code in the Back-end
12
Spawn activities for each node at place according to the partition
Call the work function in initial and steady schedule
Define the work function
//main.x10 control codepublic static def main( ) { ... finish for (p in Place.places()) async at (p) {
switch(p.id){ case 0:
val a_0 = new Source_0(rc); a_0.run(); break; case 1:
val a_2 = new MovingAver_2(rc); a_2.run(); break; case 2: val a_1 = new Sink_1(rc); a_1.run(); break; default: break; }
} …}
//Source.x10 code ... def work(){ ... push_Source_0_Sink_1(0).x=x; x+=1.0; pushTokens(); popTokens(); } public def run(){ initWork();//init // initSchedule for(var j:Int=0;j<Source_0_init;j++) work(); //steadySchedule for(var i:Int=0;i<RepeatCount;i++) for(var j:Int=0;j<Source_0_steady;j++) work(); flush(); } ...
13
Outline
Introduction and Background1
COStream Programming Language2
Stream Compilation on X103
Experiments4
Conclusion and Future Work5
Experimental Platform and Benchmarks
14
Platform Intel Xeon processor (8 cores ) 2.4 GHZ with 4GB
memory Radhat EL5 with Linux 2.6.18 X10 compiler and runtime used are 2.2.0
Benchmarks Rewrite 11 benchmarks from StreamIt
The throughputs comparison
15
Throughputs of 4 different configurations (NPLACE*NTHREAD=8) Normalized to 1 place with 8 threads
0
1
2
3
4
5
6
7
8
9
10
Thro
ughp
ut n
orm
aliz
ed to
1 p
lace
wit
h 8
thre
ads
NPLACES=1, NTHREADS=8
NPLACES=2, NTHREADS=4
NPLACES=4, NTHREADS=2
NPLACES=8, NTHREADS=1
• for most benchmarks, CPU utilization increases from 24% to 89% ,when places varies from 1 to 4, except for the benchmark with low computation/communication ratio
• benefits are little or worse when the number of places increases from 4 to 8
Observation and Analysis
16
The throughput goes up when the number of places increases. This is because that multiple places increase the CPU utilization
Multiple places show parallelism but also bring more communication overhead
Benchmarks with more computation workload like DES and Serpent_full can still benefit form the number of places increasing
17
Outline
Introduction and Background1
COStream Programming Language2
Stream Compilation on X103
Experiments4
Conclusion and Future Work5
Conclusion
We proposed and implemented StreamX10, a stream programming language and compilation system on X10
A raw partitioning optimization is proposed to exploit the parallelisms based on X10 execution model
Preliminary experiment is conducted to study the performance
18
Future Work
How to choose the best configuration (# of places and # of threads) automatically for each benchmark
How to decrease the thread switching overhead by mapping multiple nodes to the single activity
19
Acknowledgment
X10 Innovation Award founding support
QiMing Teng, Haibo Lin and David P. Grove at IBM for their help on this research
20