programming the cell processor: achieving high performance and efficiency

Presented by

Programming the Cell Processor:Achieving High Performance and Efficiency

Jeremy S. Meredith

Future TechnologiesComputer Science and Mathematics Division

Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research

2 Meredith_Cell_0611

One POWER architecture processing element (PPE)

Eight synergistic processing elements (SPEs)

All connected through a high-bandwidth element interconnect bus (EIB)

Over 200 GFLOPS (single precision) on one chip

Cell broadband engine processor:An overview


Cell broadband engine processor:Details

EightSPEs

Dual-threaded

Vector instructions (VMX)

Dual-issue pipeline

Simple, instruction set heavily focused on single instruction multiple data (SIMD)

Capability for double precision, but optimization for single

Uniform 128-bit 128-register file

256-k fixed-latency local store

Memory flow controller with direct memory access (DMA) engine to access main memory or other SPE local stores

One64-bit PPE


Genetic algorithm, traveling salesman:Single vs double precision performance

The cell processor has a much higher latency fordouble precision results

The "if" test in the sorting predicate is highly penalized for double precision

0.0

0.5

1.0

1.5

2.0

2.5

Pentium 42.8 GHz

PPE only 1 SPE 1 SPE,avoiding “if”

test

Ru

n t

ime

(sec

)

Single precision

Double precision

Replacing this testwith extra arithmetic

results in a large speedup

Replacing this testwith extra arithmetic

results in a large speedup


Genetic algorithm, Ackley’s function:Using SPE-optimized math libraries

Ackley’s function involvescos, exp, sqrt

0.047

0.681

0.248

0.064

0.01

0.1

1

Original Fastcosine

Fastexp/sqrt

SIMD

Ru

n t

ime

(s

ec

) [l

og

sc

ale

]

Switching to a math library optimized for

SPEs results in a more than 10x

improvement for single precision


The cell processor can overlap communication with computation

Covariance matrix creation has a low ratio of communication to computation

However, even with an SIMD-optimized implementation, the high bandwidth of the cell’s EIB makes this overhead negligible

0

5

10

15

20

25

30

35

Scalar SIMD

Optimizations

Ru

n t

ime

usi

ng

1 S

PE

(s

ec)

Synchronous DMASynchronous DMA

Overlapped DMAOverlapped DMA


Covariance matrix creation:DMA communication overhead


0.0

1.0

2.0

3.0

4.0

5.0

Original-O3 Simplify arrayindexing (1)

Simplify arrayindexing (2)

Loopunrolling

Instructionreordering

Ru

n t

ime (

sec)

Stochastic boolean SAT solver:Hiding latency in logic-intensive apps

PPE: attempting to hide latency manually works against the compilerPPE: attempting to hide latency manually works against the compiler

SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible

SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible


Support vector machine:Parallelism and concurrent bandwidth

As more simultaneous SPE threads are added, total run time decreases

Total DMA time also decreases, showing that the concurrent bandwidth to all SPEs is higher than to any one SPE

Thread launch time increases, but the latest Linux kernel for the cell system does reduce this overhead

0

1

2

3

4

5

6

7

1 2 4 8 8, newkernel

Number of SPE threads

Ru

n t

ime

(sec

)

Full execution

Launch + DMA

Thread launch



Threads launched only on first call

Molecular dynamics:SIMD intrinsics and PPE-to-SPE signaling

Using SIMD intrinsics easily achieved 2x speedups in total run time

Using PPE-SPE mailboxes, SPE threads can be re-used across iterations

Thus, thread launch overheads will be completely amortized on longer runs

0.0

0.5

1.0

1.5

2.0

2.5

1 SPE,original

1 SPE,SIMDized

1 SPE 2 SPEs 4 SPEs 8 SPEs

Ru

n t

ime

(sec

) Full run timeFull run time

Thread launch overheadThread launch overhead


Conclusion

Be aware of arithmetic costs Use the optimized math libraries from the SDK if it helps

Double precision requires different kinds of optimizations

The cell has a very high bandwidth to the SPEs Use asynchronous DMA to overlap communication and computation

for applications that are still bandwidth-bound

Amortize expensive SPE thread launch overheads Launch once, and signal SPEs to start the next iteration

Use of SIMD intrinsics can result in large speedups Manual loop unrolling and instruction reordering can help even if no

other SIMDization is possible



Contact

Jeremy MeredithComputer Scientist, Computer Science and Mathematics Division (865) [email protected]


programming the cell processor: achieving high performance and efficiency

Documents