experiences programming the cell across a diverse set of...

26
Jeremy Meredith Jeremy Meredith Future Technologies Group Future Technologies Group Experiences Programming the Cell Experiences Programming the Cell Across a Diverse Set of Applications Across a Diverse Set of Applications

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Jeremy MeredithJeremy MeredithFuture Technologies GroupFuture Technologies Group

Experiences Programming the Cell Experiences Programming the Cell Across a Diverse Set of ApplicationsAcross a Diverse Set of Applications

Page 2: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

OutlineOutline

Overview of the application kernelsOverview of the application kernels– Scientific, imaging, cognitive algorithms

Optimization strategiesOptimization strategies– “Asymmetric-Thread Runtime Model”– Parallelism, overheads, latencies, etc.

Performance resultsPerformance results– 2.4GHz Cell– 2.2GHz Opteron

Page 3: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Application KernelsApplication Kernels

Monte Carlo Light IntegrationMonte Carlo Light Integration

Molecular DynamicsMolecular Dynamics

Covariance Matrix CreationCovariance Matrix Creation

Boolean Satisfiability SolverBoolean Satisfiability Solver

Genetic AlgorithmsGenetic Algorithms

Page 4: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Monte Carlo Light PropagationMonte Carlo Light Propagation

Simulation of pointSimulation of point--source heating in an infinite source heating in an infinite isotropic scattering mediumisotropic scattering medium– (from Oregon Medical Laser Center)

Fixed number of photons (outer loop)Fixed number of photons (outer loop)

Variable number of steps per photon (inner loop)Variable number of steps per photon (inner loop)

Page 5: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Molecular DynamicsMolecular Dynamics

Force evaluation and integration between atomsForce evaluation and integration between atoms– Lennard-Jones potential interaction model– Velocity Verlet integration algorithm

Number of interacting atoms changes over timeNumber of interacting atoms changes over time

N^2 search over atom pairs for interacting atomsN^2 search over atom pairs for interacting atoms– Force evaluation over only those within cutoff limit– Search over atom pairs is bottleneck

Page 6: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Covariance Matrix CreationCovariance Matrix Creation

For each <For each <a,ba,b> entry in the L> entry in the L××L matrix L matrix CovCov,,

Applications include hyperspectral imagingApplications include hyperspectral imaging– Can build concise model of background for

subtraction from the HSI data cube

Known loop counts, heavy data streaming, Known loop counts, heavy data streaming, straightforward computationstraightforward computation

Cova,b = ∑∑= =

×N

i

M

jbjiaji inputinput

1 1,,,,

Page 7: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Boolean Satisfiability (SAT) SolverBoolean Satisfiability (SAT) Solver

Is there an assignment to a set of variables in a Boolean Is there an assignment to a set of variables in a Boolean expression to make the entire expression true?expression to make the entire expression true?– Many problems, like planning, can be reduced to SAT

Unit PropagationUnit Propagation– Stochastic solvers repeatedly change the value of a variable,

updating the scores of clauses which refer to that variable– Main loop in solvers like GSAT, WalkSAT, HSAT– Inefficient to check all clauses; instead, update only clauses

containing that variable

Essentially no computation involvedEssentially no computation involved– lookup, read, modify, write of random memory locations

Page 8: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Genetic AlgorithmGenetic Algorithm

Parallel optimization on a large populationsParallel optimization on a large populationsIndividuals selected for breeding by their fitnessIndividuals selected for breeding by their fitness– replication, mutation, combination of “chromosomes”

Fitness evaluation is typically bottleneckFitness evaluation is typically bottleneckTwo functions from Two functions from GENEsYsGENEsYs package:package:– Ackley’s function (more computation)– Traveling salesman (more logic, with sorting)

Page 9: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Optimization Strategies and IssuesOptimization Strategies and Issues

TaskTask--level parallelismlevel parallelismSPE thread launch overheadSPE thread launch overheadSIMD optimizationsSIMD optimizationsConcurrent DMA bandwidthConcurrent DMA bandwidthOverlapping communication and computationOverlapping communication and computationLatency hiding and loop unrollingLatency hiding and loop unrollingSDK optimized math librariesSDK optimized math librariesDouble precision penaltiesDouble precision penalties

Page 10: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Monte Carlo Light Propagation:Monte Carlo Light Propagation:TaskTask--level Parallelismlevel Parallelism

0

5000

10000

15000

20000

25000

30000

35000

40000

PPE 1SPE 2SPE 4SPE 8SPE Overlap

Run

time

(mse

c)

Page 11: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Molecular Dynamics:Molecular Dynamics:Thread Launch OverheadThread Launch Overhead

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1 SPE 8 SPEs 1 SPE 8 SPEs

Run

time

(sec

)Total Runtime

SPE Launch Overhead

Respawn every time step Launch only first time step

Page 12: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Molecular Dynamics:Molecular Dynamics:Thread Launch Overhead (cont.)Thread Launch Overhead (cont.)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1 SPE 8 SPEs 1 SPE 8 SPEs

Run

time

(sec

)Total Runtime

SPE Launch Overhead

Respawn every time step Launch only first time step

Page 13: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Molecular Dynamics:Molecular Dynamics:SIMD OptimizationSIMD Optimization

0.00

0.05

0.10

0.15

0.20

original replace "if"with

"copysign"

SIMDunit cell

reflection

SIMDdirectionvector

SIMDlength

calculation

SIMDacceleration

Run

time

(sec

)

Page 14: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Covariance Matrix Creation:Covariance Matrix Creation:Concurrent BandwidthConcurrent Bandwidth

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

1 2 4 8

Number of SPE Threads

Run

time

(sec

)

Full ExecutionLaunch+DMAThread Launch

Page 15: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Covariance Matrix Creation:Covariance Matrix Creation:Asynchronous DMA CommunicationAsynchronous DMA Communication

0.0

1.0

2.0

3.0

4.0

5.0

6.0

NoComputation

SomeComputation

AllComputation

Run

time

usin

g 1

SPE

(sec

onds

)

Synchronous DMA

Overlapping DMA

Page 16: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Boolean Satisfiability (SAT) Solver:Boolean Satisfiability (SAT) Solver:Latency Hiding and Loop UnrollingLatency Hiding and Loop Unrolling

0.0

0.5

1.0

1.5

2.0

2.5

3.0

original -O3 simplify array

indexing

loop unrolling

instructionreordering

Run

tim

e (s

ec)

PPESPE

Page 17: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Genetic Algorithm, AckleyGenetic Algorithm, Ackley’’s Function:s Function:SDK Optimized Math LibrariesSDK Optimized Math Libraries

SPE Optimizations

0.681 s

0.248 s

0.064 s0.047 s

0.01

0.1

1

Original Fast cosine Fast exp/sqrt SIMD

Run

time

(sec

) [l

og s

cale

]

Page 18: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

0.0

0.5

1.0

1.5

2.0

2.5

1 SPE using 'if' test 1 SPE using 'copysign'

Run

time

(sec

)

Double PrecisionSingle Precision

Genetic Algorithm, Traveling Salesman:Genetic Algorithm, Traveling Salesman:Double Precision PenaltiesDouble Precision Penalties

Page 19: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Performance ResultsPerformance Results

Monte Carlo Light IntegrationMonte Carlo Light Integration

Molecular DynamicsMolecular Dynamics

Covariance Matrix CreationCovariance Matrix Creation

Boolean Satisfiability SolverBoolean Satisfiability Solver

Genetic AlgorithmsGenetic Algorithms

Page 20: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Monte Carlo Light PropagationMonte Carlo Light Propagation

# Photons 1000 10000 100000Opteron 24 ms 232 ms 2384 msCell, 8 SPEs 38 ms 357 ms 3112 msCell, PPE only 288 ms 2843 ms 28384 ms

Good scaling across all 8 SPEs canGood scaling across all 8 SPEs can’’t help the t help the variable length, short inner loop, almost no variable length, short inner loop, almost no SIMD, and heavy reliance on random number SIMD, and heavy reliance on random number generationgeneration

Page 21: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

A fair amount of A fair amount of SIMDizableSIMDizable computation lets computation lets even a single SPE beat the Opteron.even a single SPE beat the Opteron.All 8 SPEs are about 5x faster than the Opteron.All 8 SPEs are about 5x faster than the Opteron.

Molecular DynamicsMolecular Dynamics

# Atoms 512Opteron 0.925 secCell, 1 SPE 0.816 secCell, 8 SPEs 0.181 secCell, PPE only 4.701 sec

Page 22: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

High concurrent bandwidth and straightforward High concurrent bandwidth and straightforward computation allows efficient use of all 8 SPEs.computation allows efficient use of all 8 SPEs.8 SPEs are almost 20x faster than the Opteron.8 SPEs are almost 20x faster than the Opteron.

Covariance Matrix CreationCovariance Matrix Creation

Data Set Size 256×65kOpteron 12.308 secCell, 8 SPEs 0.662 secCell, PPE only 88.290 sec

Page 23: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

No computation, indirect lookups let a single No computation, indirect lookups let a single SPE only barely beat even the PPESPE only barely beat even the PPEA single SPE is 3.4x slower than the OpteronA single SPE is 3.4x slower than the Opteron– However, multiple SPEs could search independent

parts of the problem space

Boolean Satisfiability (SAT) SolverBoolean Satisfiability (SAT) Solver

# Vars 800# Flips 10MOpteron 0.571 secCell, 1 SPE 1.961 secCell, PPE Only 1.998 sec

Page 24: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

ComputeCompute--intensive Ackleyintensive Ackley’’s function:s function:one SPE is 4x faster; eight SPEs are 21x fasterone SPE is 4x faster; eight SPEs are 21x faster

LogicLogic--intensive traveling salesman:intensive traveling salesman:one SPE is 4x slower; eight SPEs are 2x fasterone SPE is 4x slower; eight SPEs are 2x faster

Genetic AlgorithmGenetic Algorithm

Population Size 262k 1.05MOpteron 0.645 sec 2.514 secCell, 1 SPE 0.165 sec 0.637 secCell, 8 SPEs 0.060 sec 0.119 secCell, PPE only 2.797 sec 11.146 sec

Population Size 131k 524kOpteron 0.466 sec 1.876 secCell, 1 SPE 1.697 sec 6.761 secCell, 8 SPEs 0.248 sec 0.884 secCell, PPE only 3.802 sec 15.209 sec

Page 25: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Optimization Strategy SummaryOptimization Strategy Summary

Parallelization across the SPEs is criticalParallelization across the SPEs is criticalBe aware of arithmetic costsBe aware of arithmetic costs– Use the optimized math libraries from the SDK if it helps– Double precision requires different kinds of optimizations

The EIB has a very high bandwidth to the SPEsThe EIB has a very high bandwidth to the SPEs– Use asynchronous DMA to overlap communication and

computation for apps with heavy bandwidth needs– But for many apps, it may simply waste space in the SPE LS

Amortize expensive SPE thread launch overheadsAmortize expensive SPE thread launch overheads– Launch once, and signal SPEs to start the next iteration

Use of SIMD intrinsics can result in large speedupsUse of SIMD intrinsics can result in large speedups– Manual loop unrolling and instruction reordering can help even if

no other SIMDization is possible

Page 26: Experiences Programming the Cell Across a Diverse Set of ...cavazos/cisc879-spring2008/papers/Cel… · Respawn every time step Launch only first time step. Molecular Dynamics: Thread

Acknowledgements and More InfoAcknowledgements and More Info

This research was sponsored by the Office of Mathematical, InforThis research was sponsored by the Office of Mathematical, Information, mation, and Computational Sciences, Office of Science, U.S. Department oand Computational Sciences, Office of Science, U.S. Department of Energy f Energy under Contract No. DEunder Contract No. DE--AC05AC05--00OR22725 with UT00OR22725 with UT--Battelle, LLC. Battelle, LLC. Accordingly, the U.S. Government retains a nonAccordingly, the U.S. Government retains a non--exclusive, royaltyexclusive, royalty--free free license to publish or reproduce the published form of this contrlicense to publish or reproduce the published form of this contribution, or ibution, or allow others to do so, for U.S. Government purposes.allow others to do so, for U.S. Government purposes.

http://www.csm.ornl.gov/fthttp://www.csm.ornl.gov/[email protected]@ornl.gov