maven: a data-parallel architecture for par...
TRANSCRIPT
Maven: A Data-Parallel Architecture for Par Lab
Yunsup LeeChristopher Batten (Now at Cornell)Rimas AvizienisChristopher CelioAlex BisharaRichard XiaKrste Asanovic
Par Lab Winter Retreat 2010
Motivation
Architectural Patterns - Multithreading (MT) - Traditional Vector (TVEC) - Single-Instruction Multi-Threading (SIMT) - Vector-Threading (VT)
Maven Single-Lane VT Core
Maven Evaluation
Conclusion
Architectural Design Patterns:Multithreading & Traditional Vector
Programmer’sView
(ProgrammingModel)
MachineImplementation
Multithreading Traditional Vector
Architectural Design Patterns:Single-Instruction Multi-Threading & Vector-Threading
MachineImplementation
SIMT (GPU Style) Vector-Threading
Programmer’sView
(ProgrammingModel)
Motivation
Architectural Patterns
Maven Single-Lane VT Core - Maven Instruction Set Architecture - Maven µArchitecture - Maven Programming Methodology
Maven Evaluation
Conclusion
Maven Programming Methodology:Compiler Support
Goal: Minimum changes to standard scalar compiler to enable a high-level explicitly data-parallel programming methodology
1. Start with most recent GCC toolchain (4.4.1) with MIPS32 backend
2. Change MIPS32 backend to support unified integer and floating-point registers, add new multiply and divide instructions, and remove unsupported instructions
3. Modify SIMD extensions to support much longer vector
4. Add support for vector registers with standard register allocator
5. Add intrinsics for vector commands
6. Add Maven pipeline model and ability to tune any function for either the control processor or a micro-thread
Caveats
• Half baked results: Early stage results
• Instruction fetch energy not considered
• Data access energy not considered
• Both penalizes results for the vector machines
• SIMT (GPU Style) machines are approximated with VT machines
• Irregular data parallel microbenchmarks (masked filter, binary search) for the traditional vector machine are hand-coded assembly
• Other microbenchmarks are all compiled
Data-Parallel Core Area Comparison
0123456789
101112131415
Nor
mal
ized
Are
a
32 64 128256 32 64
128
25632
64
128
256
32
64
128
256
32 64128
256
Microbenchmark: Complex Multiply
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
0.5
1
1.5
2
2.5
3
Normalized Itrs / Second
Nor
mal
ized
Ene
rgy
/ Itr
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Ener
gy p
er It
r (nJ
)
SIMT-32
SIMT-64SIMT-128
SIMT-256
VT-32
VT-64
VT-128
VT-256
32
64
128
256
3264128
256
3264
128
256
3264
128
25632
64128
256
Microbenchmark: Masked Filter
0
0.2
0.4
0.6
0.8
1
Ener
gy p
er It
r (nJ
)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
0.5
1
1.5
2
2.5
3
Normalized Itrs / Second
Nor
mal
ized
Ene
rgy
/ Itr
SIMT
VT
Conclusion
PerformanceEnergy
EfficiencyApplication
SpaceProgramming
DifficultyCompilerSupport
Better
Worst
TVEC,VT
SIMT
MT
High
Low
TVEC,VT
SIMT
MT
Wide
Narrow
MT
VT,SIMT
TVEC
Easy
Hard
MT,SIMT
TVEC
VT
Easy
Hard
MT
VT,SIMT
TVEC
Future Work
Short Term
• Optimize vector control overhead
• Explore banked register file designs
• Evaluate impact of density time execution
• Experiment with application kernels
• Fab a test chip
Long Term
• Investigate tightly integrating general-purpose cores with vector-thread data-parallel cores
• SEJITS backend for Maven