profile guided deployment of stream programs on multicores s. m. farhad the university of sydney...
TRANSCRIPT
![Page 1: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/1.jpg)
Profile Guided Deployment of Stream
Programs on MulticoresS. M. Farhad
The University of Sydney
Joint work with
Yousun Ko
Bernd Burgstaller
Bernhard Scholz
![Page 2: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/2.jpg)
2
Outline
Motivation Multicore trend Stream programming
Research Questions How to profiling communication overhead on
Multicores? How to deploy stream programs?
Related works
2
![Page 3: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/3.jpg)
3
Motivation
1
1975
2
4
8
16
32
64
128
256
512
1980 1985 1990 1995 2000 2005 2010
400480088080 8086 286 386 486 Pentium P2 P3 P4
Athlon Itanium Itanium2
Power4 PA8800400480088080
PA8800
Opteron CoreDuo
Power6Xbox 360
BCM 1480Opteron 4P
Xeon
Niagara Cell
RAW
RAZA XLR Cavium
Unicore
Homogeneous Multicore
Heterogeneous MulticoreCISCO CSR1
Larrabee
PicoChip AMBRIC
AMD Fusion
NVIDIA G80
Core
Core2Duo
Core2Quad
# co
res/
chip
Courtesy: Scott’08
C/C++/Java
CUDA
X10Peakstream
Fortress
Accelerator
Ct
C T M
Rstream
Rapidmind
Stream Programming
3
![Page 4: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/4.jpg)
4
Stream Programming Paradigm Programs expressed as stream
graphs
Streams: Infinite sequence of data elements
Actors: Functions applied to streams
4
Actor
Stream
Stream
![Page 5: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/5.jpg)
5
Properties of Stream Program Regular and repeating
computation Independent actors with explicit
communication Producer / Consumer
dependencies
5
Adder
Speaker
AtoD
FMDemod
LPF1
Splitter
Joiner
LPF2 LPF3
HPF1 HPF2 HPF3
![Page 6: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/6.jpg)
6
StreamIt Language
An implementation of stream prog.
Hierarchical structure
Each construct has single input/output stream
parallel computation
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
6
![Page 7: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/7.jpg)
7
Outline
Motivation Multicore trend Stream programming
Research Questions How to profiling communication overhead on
Multicores? How to deploy stream programs?
Related works
7
![Page 8: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/8.jpg)
How to Estimate the Communication Overhead on Multicores?
8
![Page 9: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/9.jpg)
Problems to Measure Communication Overhead on Multicores Reasons:
Multicores are non-communication exposed architecture
Complex cache hierarchy Cache coherence protocols
Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring
the execution time of actors
9
![Page 10: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/10.jpg)
Measuring the Communication Overhead of an Edge
10
i k
Processor 1
No communication cost
Processor 1
With communication cost
Processor 2
ki
kkiiki ttttC ),(
it ktit kt
![Page 11: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/11.jpg)
How to Minimize the Required Number of Experiments
11
A
B
C
1
2
Pipeline
GraphColoring
Requires2+1 Steps
A
B
C
D
Processor 1 Processor 2
1
2
3
E
F
5
4
Even edgesacross partition
Processor 1
A
D
B
C
E
Processor 2
1
3
2
4
Odd edgesacross partition
![Page 12: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/12.jpg)
Obs. 1: There is no loop of three actors in a stream graph
12
i k
l
Processor 1 Processor 2
![Page 13: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/13.jpg)
Obs. 2: There is no interference of adjacent nodes between edges
13
A
B
C D
E
F
For blue color edges
P-1
P-2
P-3
P-4
![Page 14: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/14.jpg)
Remove Interference
Convert to a line graph
Add interference edges
Use vertex coloring algorithm
14
A
B
C D
E
F
AB
BC
BDCE
DE
EF
Line graphStream graph
AB
BC
BDCE
DE
EF
![Page 15: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/15.jpg)
Processor Leveling Graph
15
A
B
C D
E
F
For blue colored edge Processor leveling graph
A
B, C, D, E
F
![Page 16: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/16.jpg)
Coloring the Processor Labelling Graph
16
A
B, C, D, E
F
Processor 2Processor 1
A
B, C, D, E
F
A
B, C, D, E
F
![Page 17: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/17.jpg)
Measuring the Communication Cost
17
A
B
C D
E
F
A
B, C, D, E
F
Processor 2Processor 1
)()(
)()(
),(
),(
FFEEFE
BBAABA
ttttC
ttttC
At
Bt
Et
Ft
For blue colored edge
![Page 18: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/18.jpg)
Profiling Performance
Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13
GM 17% 15%
18
![Page 19: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/19.jpg)
19
Outline
Motivation Multicore trend Stream programming
Research Questions How to profiling communication overhead? How to deploy stream programs?
Related works
19
![Page 20: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/20.jpg)
Deployment of Stream Programs
20
A (5)
B (40)
C (40)
D(5)
Processor 1 Processor 2
25
25
5
5
25
25
A (5)
B (40)
C (40)
D(5)
525
25
25
255
Load = (5 + 40) + 5 = 50
Load = (40 + 5) + 5 = 50
Makespan = 50, Speedup = 90/50 = 1.8
![Page 21: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/21.jpg)
Deploying Stream Programs without Considering Communication
21
A (5)
B (40)
C (40)
D(5)
Processor 1 Processor 2
A (5)
C (40)
B (40)
D(5)
5
25
25
5
5
25
25
2525
2525
5
Load = (5+40) + (25+5+25) = 100
Load = (40+5) + (25+5+25) = 100
Makespan = 100, Speedup = 90/100 = 0.9
Compare = (100 – 50)x100%/50 = 100%
![Page 22: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/22.jpg)
Deployment Performance
Benchmark m (us) m (us) (m – m)/m%
SAR 45.54 45.54 0
MatrixMult 67.80 111.14 64
MergeSort 1.63 6.99 329
FMRadio 1.57 7.00 346
DCT 4.64 7.68 66
RadixSort 1.49 3.08 107
FFT 18.28 34.15 87
MPEG 37.26 37.26 0
Channel 89.00 91.20 2
BeamFormer 7.29 7.29 0
22
![Page 23: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/23.jpg)
Speedups obtained for 2, 4 and 6 processors
23
![Page 24: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/24.jpg)
Summary
We propose an efficient profiling technique for multicore that minimizes profiling steps
We propose ILP based approach that minimizes the makespan
We conducted experiments The number of profiling steps is on the average only
17% The profiling scheme shows only 15% error on the
average in the random mapping test Obtains speedup of 3.11x for 4 processors and a
speedup of 4.02x for 6 processors
24
![Page 25: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/25.jpg)
25
Related Works
[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]
[2] StreamIt: A language for streaming applications [Thies ‘02]
[3] Phased Scheduling of Stream Programs [Thies ’03]
[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in
Stream Programs [Thies ‘06]
[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]
[6] Software Pipelined Execution of Stream Programs on GPUs
[Udupa‘09]
[7] Synergistic Execution of Stream Programs on Multicores with
Accelerators [Udupa ‘09]
[8] Orchestration by approximation [Farhad ‘11]
25
![Page 26: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/26.jpg)
Questions?
![Page 27: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/27.jpg)
Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform
cache hierarchy We pin the threads using likwidpin tools
27
![Page 28: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz](https://reader036.vdocuments.us/reader036/viewer/2022062500/5697bfa01a28abf838c95444/html5/thumbnails/28.jpg)
Cache Topology of Processor
28
Core #0 Core #1 Core #2 Core #3 Core #4 Core #5
L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB
L2: 512kB
L2: 512kB
L2: 512kB
L2: 512kB
L2: 512kB
L2: 512kB
L3: 6MB
800MHz hexa-core AMD Phenom(tm) II X6 1090T