university of michigan electrical engineering and computer science flextream: adaptive compilation...
TRANSCRIPT
University of MichiganElectrical Engineering and Computer Science
Flextream: Adaptive Compilation of Streaming
Applications for Heterogeneous Architectures
Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2, Trevor Mudge1, and Scott Mahlke1
1 University of Michigan
2 IBM T.J. Watson Research Lab.
3 NVIDIA Corp.
University of MichiganElectrical Engineering and Computer Science
Courtesy: Gordon’06
Cores are the New Gates
1
1975
2
4
8
16
32
64
128
256
512
1980 1985 1990 1995 2000 2005 2010
400480088080 8086 286 386 486 Pentium P2 P3 P4
Athlon Itanium Itanium2
Power4 PA8800400480088080
PA8800
Opteron CoreDuo
Power6Xbox 360
BCM 1480Opteron 4P
Xeon
Niagara Cell
RAW
RAZA XLR Cavium
UnicoreHomogeneous MulticoreHeterogeneous Multicore
CISCO CSR1
Larrabee
PicoChip AMBRIC
AMD Fusion
NVIDIA G80
Core
Core2Duo
Core2Quad
# co
res/
chip
(Shekhar Borkar, Intel)
Courtesy: Gordon’06
C/C++/Java
CUDA
X10Peakstream
Fortress
AcceleratorCt
C T M
Rstream
Rapidmind
Stream Programming
University of MichiganElectrical Engineering and Computer Science
Streaming Computing Is Everywhere!
• Prevalent computing domain with applications in embedded systems, desktops and high-end servers
University of MichiganElectrical Engineering and Computer Science
StreamIt
• Main Constructs:– Filter: Encapsulate computation.
• Stateful• Stateless
– Pipeline Expressing pipeline parallelism.
– Splitjoin Expressing task/data-level parallelism.
• Exposes different types of parallelism
pipeline
filter
splitjoin
University of MichiganElectrical Engineering and Computer Science
StreamIt Graph Tuning
• Parallelism can be tuned in streaming programs– Horizontal Replication
– Horizontal Fusion
– Vertical Fusion
University of MichiganElectrical Engineering and Computer Science
StreamIt ExampleA
B1 B2
F
E
D
C
Splitter
Joiner
4343
10
246
326
566
10
6
6
A
F
CDE
10
1138
10
B86
A
B1 B3
F
E
D
C
Splitter
Joiner
21.521.5
10
246
326
566
10
6
6
B2 B421.5 21.5
University of MichiganElectrical Engineering and Computer Science
Core
What Are We Solving?A
B1 B3
F
E
C
Splitter
Joiner
B2 B4
D1 D2
Splitter
Joiner
Memory
Core
Memory
Core
Memory
Core
Memory
?• Performing graph modulo
scheduling on a stream graph statically.
• What happens in case of dynamic resource changes?
University of MichiganElectrical Engineering and Computer Science
Target Architecture
DMA
Master Processor
Slave
Local Store
DMA
Interconnect
Memory
. . .
. . .
Slave
Local Store
DMA
Local Store
Slave
DMA
Local Store
Slave
• Master processor acts as a controller.
• Each slave processor has its own local store and DMA engine.
• An interconnect network connects all the components together.
University of MichiganElectrical Engineering and Computer Science
Overview of Flextream
Prepass Replication
Work Partitioning
Partition Refinement
Stage Assignment
Buffer Allocation
Static
Dynam
ic
Streaming Application
MSL Commands
Adjust the amount of parallelism for the target system by replicating actors.
Performs light-weight adaptation of the schedule for the current configuration of the target hardware.
Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)
Find an optimal schedule for a virtualized member of a family of processors.
Specifies how actors execute in time in the new actor-processor mapping.
Find optimal modulo schedule for a virtualized member of a family of processors.
Tries to efficiently allocate the storage requirements of the new schedule into available memory units.
Goal: To perform Adaptive Stream Graph Modulo Scheduling.
University of MichiganElectrical Engineering and Computer Science
MSL : Multi-Core Streaming Layer
• Instruction set for heterogeneous multi-core systems
• A set of high-level commands for :– Actor Commands(Loading/Unloading)– Buffer Commands(Allocating local/global buffers)– Data Transfer Commands(Managing DMAs)
• Flextream’s online layer uses these commands to adapt the static schedule
University of MichiganElectrical Engineering and Computer Science
Overall Execution Flow
• For every application may see multiple iterations of:
Resou
rce ch
ange
Req
uest
Resou
rce ch
ange
Gran
ted
University of MichiganElectrical Engineering and Computer Science
Prepass Replication [static 1]
A B C D
E F E0E1
P0 : 10 P1 : 86 P2 : 246 P3 : 326
P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283
D0
D1
P6 : 283 P7 : 163
P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163
P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163
E0
E1 E2 E3
C0 C1 C2C3
A
F
E
D
C
10
246
326
566
10
B86
C0 C2
S0
J0
61.5
6
6
C1 C3
D0
S1
J1
163
6
6
D1
E0 E2
S2
J2
6
6
E1 E3141.5
21
21
22
22
22
22
University of MichiganElectrical Engineering and Computer Science
Work Partitioning [static 2]
• Finds optimal actor to processor mapping considering:– Actors’ work estimates– Communication cost– DMA cost– Memory requirements
• At the end, each actor is assigned to exactly one processor.
University of MichiganElectrical Engineering and Computer Science
Partition Refinement [dynamic 1]
• Available resources at runtime can be more limited than resources in static target architecture.
• Partition refinement tunes actor to processor mapping for the active configuration.
• A greedy iterative algorithm is used to achieve this goal.
University of MichiganElectrical Engineering and Computer Science
Partition Refinement Example
• Pick processors with most number of actors.
• Sort the actors
• Find processor with max work
• Assign min actors until threshold
A B
P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5
P4 : 151.5 P5 : 173 P7 : 159.5
D0D1
P6 : 140
E0
E1
E2E3C0 C1
C2
C3
S1S2
J0S0
J1 J2
BE2 C2C0 C3C1 S1S2 J0S0J1 J2
P5 : 183
S0C3
S1S2
J1J2
C2
C1
B
C0
E2
FJ0
P5 : 193P5 : 270.5P4 : 274.5
P1 : 283 P3 : 289
University of MichiganElectrical Engineering and Computer Science
Stage Assignment [dynamic 2]
• Processor assignment only specifies how actors are overlapped across processors.
• Stage assignment finds how actors are overlapped in time.
• Relative start time of the actors is based on stage numbers.
• DMA operations will have a separate stage.
University of MichiganElectrical Engineering and Computer Science
Stage Assignment ExampleA
F
B
C0 C2
S0
J0
C1 C3
D0
S1
J1
D1
E0 E2
S2
J2
E1 E3
A D0D1
E0
E1
E3
S0C3
S1S2
J1J2
C2
C1
B
C0
E2
FJ0
0
2
4
6
108
12
16
18
14
University of MichiganElectrical Engineering and Computer Science
Buffer Allocation [dynamic 3]
• Slave processors have limited local store.
• Local store is faster than main memory.
• Utilize local stores first and then spill to main memory
• In case of spilling, DMAs have to be adjusted
University of MichiganElectrical Engineering and Computer Science
Methodology• StreamIt Compiler
• Metis for graph partitioning
• 32 core heterogeneous distributed memory multi-core system
• Each slave core has a DMA engine and 128K local store
• System simulator to simulate the interconnect traffic.
University of MichiganElectrical Engineering and Computer Science
Performance Comparison (DES)
0
5
10
15
20
25
30
35
Full Static Graph Partitioning Flextream
Number of Cores
Rel
ativ
e Sp
eedu
p
University of MichiganElectrical Engineering and Computer Science
Performance Comparison
bitonic dct des fft filter bank
fm matrix mult.
mpeg2 ser-pent
tde av-er-age
0
5
10
15
20
25Graph Partitioning ApproachFlextream Approach
Slow
dow
n ( %
)
University of MichiganElectrical Engineering and Computer Science
Dynamic Approach Time Comparison
bitonic dct des fft filter bank
fm matrix mult.
mpeg2 ser-pent
tde av-er-age
0
2
4
6
8
10
12Flextream Refinement ApproachGraph Partitioner Approach
Tim
e (m
s)
University of MichiganElectrical Engineering and Computer Science
Overhead Comparison
bitonic dct des fft filter bank
fm matrix mult.
mpeg2 serpent tde average0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
3735
301
1117
125
705 887274
4588
695403
1283
2.6
4.3
2.3
2.7 4.1
1.3
4.9
2.8
3.3
5.2
11.3
6.9
5.9 8.9
5.8
7.38.4
7.68.1
8.43
7.4 5.4 3.2 4.5 3.6 5
Prepass Replication Work Refinement TimeStage Assignment Time Buffer Allocation Time
Frac
tion
of
Tim
e A
lloca
ted
University of MichiganElectrical Engineering and Computer Science
Conclusion
• Static scheduling approaches are promising but not enough.
• Dynamic adaptation is necessary for future systems.
• Flextream provides a hybrid static/dynamic approach to improve efficiency.
University of MichiganElectrical Engineering and Computer Science
Overhead Comparison
bitonic dct des fft filter bank
fm matrix mult.
mpeg2 serpent tde average0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
3735
301
1117
125
705 887274
4588
695403
1283
2.6
4.3
2.3
2.7 4.1
1.3
4.9
2.8
3.3
5.2
11.3
6.9
5.9 8.9
5.8
7.38.4
7.68.1
8.43
7.4 5.4 3.2 4.5 3.6 5
Prepass Replication Work Refinement TimeStage Assignment Time Buffer Allocation Time
Frac
tion
of
Tim
e A
lloca
ted
bitonic dct des fft filter bank
fm matrix mpeg2 serpent tde average0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Effect of Buffer Allocation on Performance
Min Mem Min Mem + (Max Mem - Min Mem)/5Min Mem + 2(Max Mem - Min Mem)/5 Min Mem + 3(Max Mem - Min Mem)/5Min Mem + 4(Max Mem - Min Mem)/5 Max Mem
Rel
ativ
e P
erfo
rman
ce
University of MichiganElectrical Engineering and Computer Science
Prepass Replication
A
F
E
D
C
10
246
326
566
10
B86
C0 C2
S0
J0
61.5
6
6
C1 C3
D0
S1
J1
163
6
6
D1
E0 E2
S2
J2
6
6
E1 E3141.5
21
21
22
22
22
22
University of MichiganElectrical Engineering and Computer Science
A B C D
E F E0E1
P0 : 10 P1 : 86 P2 : 246 P3 : 326
P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283
D0
D1
P6 : 283 P7 : 163
P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163
P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163
E0
E1 E2 E3
C0 C1 C2C3
University of MichiganElectrical Engineering and Computer Science
Outline
• Streaming Background
• Flextream’s Approach– Static phase– Dynamic phase
• Evaluation
• Conclusion
University of MichiganElectrical Engineering and Computer Science
Introduction
• Single core performance stopped to scale.
• Multi-core and Many-core systems are every where.
• These systems have different configurations.
• Resource management is a challenging problem.
Cell Processor
Intel Larrabee