programming the cell processor: achieving high performance and efficiency
DESCRIPTION
Programming the Cell Processor: Achieving High Performance and Efficiency. Jeremy S. Meredith Future Technologies Computer Science and Mathematics Division Research supported by the Department of Energy’s Office of Science Office of Advanced Scientific Computing Research. - PowerPoint PPT PresentationTRANSCRIPT
Presented by
Programming the Cell Processor:Achieving High Performance and Efficiency
Jeremy S. Meredith
Future TechnologiesComputer Science and Mathematics Division
Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research
2 Meredith_Cell_0611
One POWER architecture processing element (PPE)
Eight synergistic processing elements (SPEs)
All connected through a high-bandwidth element interconnect bus (EIB)
Over 200 GFLOPS (single precision) on one chip
Cell broadband engine processor:An overview
3 Meredith_Cell_0611
Cell broadband engine processor:Details
EightSPEs
Dual-threaded
Vector instructions (VMX)
Dual-issue pipeline
Simple, instruction set heavily focused on single instruction multiple data (SIMD)
Capability for double precision, but optimization for single
Uniform 128-bit 128-register file
256-k fixed-latency local store
Memory flow controller with direct memory access (DMA) engine to access main memory or other SPE local stores
One64-bit PPE
4 Meredith_Cell_0611
Genetic algorithm, traveling salesman:Single vs double precision performance
The cell processor has a much higher latency fordouble precision results
The "if" test in the sorting predicate is highly penalized for double precision
0.0
0.5
1.0
1.5
2.0
2.5
Pentium 42.8 GHz
PPE only 1 SPE 1 SPE,avoiding “if”
test
Ru
n t
ime
(sec
)
Single precision
Double precision
Replacing this testwith extra arithmetic
results in a large speedup
Replacing this testwith extra arithmetic
results in a large speedup
5 Meredith_Cell_0611
Genetic algorithm, Ackley’s function:Using SPE-optimized math libraries
Ackley’s function involvescos, exp, sqrt
0.047
0.681
0.248
0.064
0.01
0.1
1
Original Fastcosine
Fastexp/sqrt
SIMD
Ru
n t
ime
(s
ec
) [l
og
sc
ale
]
Switching to a math library optimized for
SPEs results in a more than 10x
improvement for single precision
6 Meredith_Cell_0611
The cell processor can overlap communication with computation
Covariance matrix creation has a low ratio of communication to computation
However, even with an SIMD-optimized implementation, the high bandwidth of the cell’s EIB makes this overhead negligible
0
5
10
15
20
25
30
35
Scalar SIMD
Optimizations
Ru
n t
ime
usi
ng
1 S
PE
(s
ec)
Synchronous DMASynchronous DMA
Overlapped DMAOverlapped DMA
6 Meredith_Cell_0611
Covariance matrix creation:DMA communication overhead
7 Meredith_Cell_0611
0.0
1.0
2.0
3.0
4.0
5.0
Original-O3 Simplify arrayindexing (1)
Simplify arrayindexing (2)
Loopunrolling
Instructionreordering
Ru
n t
ime (
sec)
Stochastic boolean SAT solver:Hiding latency in logic-intensive apps
PPE: attempting to hide latency manually works against the compilerPPE: attempting to hide latency manually works against the compiler
SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible
SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible
8 Meredith_Cell_0611
Support vector machine:Parallelism and concurrent bandwidth
As more simultaneous SPE threads are added, total run time decreases
Total DMA time also decreases, showing that the concurrent bandwidth to all SPEs is higher than to any one SPE
Thread launch time increases, but the latest Linux kernel for the cell system does reduce this overhead
0
1
2
3
4
5
6
7
1 2 4 8 8, newkernel
Number of SPE threads
Ru
n t
ime
(sec
)
Full execution
Launch + DMA
Thread launch
8 Meredith_Cell_0611
9 Meredith_Cell_0611
Threads launched only on first call
Molecular dynamics:SIMD intrinsics and PPE-to-SPE signaling
Using SIMD intrinsics easily achieved 2x speedups in total run time
Using PPE-SPE mailboxes, SPE threads can be re-used across iterations
Thus, thread launch overheads will be completely amortized on longer runs
0.0
0.5
1.0
1.5
2.0
2.5
1 SPE,original
1 SPE,SIMDized
1 SPE 2 SPEs 4 SPEs 8 SPEs
Ru
n t
ime
(sec
) Full run timeFull run time
Thread launch overheadThread launch overhead
10 Meredith_Cell_0611
Conclusion
Be aware of arithmetic costs Use the optimized math libraries from the SDK if it helps
Double precision requires different kinds of optimizations
The cell has a very high bandwidth to the SPEs Use asynchronous DMA to overlap communication and computation
for applications that are still bandwidth-bound
Amortize expensive SPE thread launch overheads Launch once, and signal SPEs to start the next iteration
Use of SIMD intrinsics can result in large speedups Manual loop unrolling and instruction reordering can help even if no
other SIMDization is possible
10 Meredith_Cell_0611
11 Meredith_Cell_0611
Contact
Jeremy MeredithComputer Scientist, Computer Science and Mathematics Division (865) [email protected]
11 Meredith_Cell_0611