structured grid motif

1

238

Applying finite difference methods to PDEs on

structured grids produces stencil operators that

must be applied to all points in the discretized grid

Challenged by bandwidth, temporal reuse, efficient

SIMD, etc… but trivial to (correctly) parallelize

Most optimizations can be independently

implemented

– Not performance independent

Core (cache) blocking and cache bypass were

clearly integral to performance

Structured Grid Motif

239

• Adaptive meshes

• Refinement done by estimating errors

• Refine mesh if too large

• Parallelism

• Mostly between “patches” assigned to

processors for load balance

• May exploit parallelism within a patch

Adaptive Mesh Refinement

2

240 Challenges of Irregular

Meshes How to generate them in the first place

– Start from geometric description of object

– Triangle – a 2D mesh partitioner

– 3D is significantly harder

How to partition them

– ParMetis – a parallel graph partitioner

How to design iterative solvers

– PETSc – a Portable Extensible Toolkit for Scientific Computing

– Prometheus – a multigrid solver for finite element problems on

irregular meshes

How to design direct solvers

– SuperLU – parallel sparse Gaussian elimination

These are challenges to do sequentially, more so in parallel

241

LBMHD simulates charged plasmas in a magnetic field (MHD) via Latice

Boltzmann Method (LBM) applied to CFD and Maxwell’s equations

To monitor density, momentum, and magnetic field, it requires maintaining two

“velocity” distributions – 27 (scalar) element velocity distribution for momentum

– 15 (Cartesian) element velocity distribution for magnetic field

– 632 bytes / grid point / time step

Jacobi‐like time evolution requires ~1300 flops and ~1200 bytes of memory traffic

Latice Boltzmann Methods

3

242

Distributed Memory & Hybrid

– MPI, MPI+pthreads, MPI+OpenMP

(SPMD, SPMD2, SPMD+Fork/Join)

For this large problem auto‐tuning

flat MPI delivered a significant

boost (2.5x)

Extending auto‐tuning to include

the domain decomposition and

balance between threads and

processes provided an extra 17%

2 processes with 2 threads was

best

– True for Pthreads and OpenMP

Latice Boltzmann Methods

Parallelization & Performance

243

Rather than calculating O(N2) forces, calculate impact of particles on

field and field on particles → O(N)

– Particle‐to‐grid interpolation (scatter‐add) – the most challenging step

– Poisson solver

– Grid‐to‐particle/push interpolation (gather) – EP

Used in a number of simulations including Heart and Fusion

Trivial simplification would be a 2D histogram

These codes can be challenging to parallelize in shared memory

Particle Method Motif –

Particle-in-Cell

4

245

Alternate (tree‐based) approach for calculating forces

Kernel Independent FMM (KIFMM) is challenged by 7 computational

phases (kernels) including list computations and tree traversals

List computations vary from those requiring direct particle‐particle

interactions to those based on many small FFTs

Different architectures (CPUs, GPUs…) may require different codes for

each phase

FMM is parameterized by the number of particles per box in the octtree – More particles/box → more flops (direct calculations)

– Fewer particle/box → fewer flops (but more work in tree traversals)

Fast Multipole Method

246

Different architectures showed speedups for different phases

from conventional auto‐tuning

Tuning algorithmic parameters showed different architectures

preferred different sweet spot:

– Nehalem’s sweet spot was around 250 particles/box

– GPUs required up to 4000 particles/box to attain similar performance: cope

with poor tree traversal performance GPU’s had to perform 16x as many flop’s

Fast Multipole Method (2)

5

247

Different architectures showed speedups for different phases

from conventional auto‐tuning

Tuning algorithmic parameters showed different architectures

preferred different sweet spot:

– Nehalem’s sweet spot was around 250 particles/box

– GPUs required up to 4000 particles/box to attain similar performance: cope

with poor tree traversal performance GPU’s had to perform 16x as many flop’s

Fast Multipole Method (3)

248

“Design spaces” for algorithms and

implementations are large and growing

Finding the best algorithm/implementation by

hand is hard and getting harder

Ideally, we would have a database of

“techniques” that would grow over time

– Search automatically whenever a new input

and/or machine comes along

Still lots of work to do…

Motifs Summary

6

249

Gedae

– Bring multicore processing to the masses

by automating the implementation of

software for multiprocessor and multicore

systems

Heterogeneous architectures

FPGAs

RapidMind (Intel)

– Portable software development platform for

multi-core and many-core processors

Rapid Prototyping Platforms

250

SW / HW System

Gedae SDK Component’s

Threaded

Application

Hardware

Model

Compiler

Implementation

Specification

Functional

Model

Developer

Analysis Tools

New

Language

Specification

Tools

Thread Manager

7

251

Compiler

Guiding Principle for Evolution of

Multicore SW Development Tools

Functional

model

Architecture-

specific details

Libraries

Implementation

specification Implementation

Complexity

Let the compiler

build parallelism Minimize the

code size and

inefficiency of

conditionals

Use rule base to

aid developer

One application –

many targets

Give the

compiler

flexibility to

optimize.

Minimize the

effort to port

compiler to new

architectures.

252 Language – Invariant

Functionality Functionality must be free of

implementation policy

– For example. decomposition of data or

processing kernels cannot be a part of the

invariant functionality

Functionality must be easy to express

– Scientist and engineers want a thinking tool

Functional expressiveness must be

complete

– Some algorithms are hard if the appropriate

language feature is not available

8

253

Language Features for Expressiveness

and Invariant Functionality

Stream data (time based data) *

Segmented streams with software reset on

segment boundaries *

Persistent data – state* extends to databases ‡

Algebraic equations (HLL most similar to

Mathcad) ‡

Conditionals †

Iteration ‡

State behavior †

Procedural * * These are mature language features

† These are currently directly supported in the language but will continue to evolve

‡ Support for directly expressing these behaviors. while possible to implement in the current product. will be added to the

language and compiler in a release later this year.

255

Language – Block Diagram

9

256 Language – Symbolic

Expressions

out[i][j](t) = sar(in[i][j](t). Taylor[j].

Azker[i*2]) {

range i2=2*size(i)

t1[i][j] = Taylor[j] * in[i][j]

range[i] = fft(t1[i])

cturn[j][i] = range[i][j]

adjoin[j][i2](t) = i2 < R ? cturn[j][i2](t) :

cturn[j][i2](t-1)

t2[j] = ifft(adjoin[j])

t3[j][i2] = Azker[i2] * t2[j][i2]

azimuth[j] = fft(t3[j])

out[j][i] = azimuth[j][i]

}

257 Language for Specifying

Dynamic Behavior Gedae’s data flow language features are

ideal for automatic resource allocation

– Load balancing

– Fault tolerance control

Resource allocation can be controlled

based on:

– Priority

– Latency requirements

– Temperature

– Power

– Balancing load

10

258 Implementation Tools – Summary of

Implementation Tools

Select from

4 memory

packers

Select whether to

prototype application

in SDK or as

separate executables

View the

hardware model

Automate setting

of queue sizes

between

dynamically

related threads

Implementation

tools for every

aspect of the

software

Select location

of command

program

Choose structure

of product View compiler

status

259 Implementation Tools –

Partitioning Tool

Tabular listing of all

functional components

in system

Equation based

partitioning of set of

functional components

Hierarchical partitioning

11

260 Implementation Tools – IPC

Specification Tool

Set the buffer size if

there is a buffer

associated with the

transfer mechanism

Gedae reports the source and

destination logical processor #s

Set the transfer type

from those specified in

the embedded

configuration file

Set the number of send

and receive buffers for

multi-buffering

Hierarchical list of data transfers between

partitions required to implement flow graph

261

Analysis Tools – Execution Trace

One application –

many targets

One application –

many targets

One application –

many targets

Summary by

processing core

Details by software

component

12

262 Analysis Tools – Event and

Processor Statistics

Detailed events and

events statistics are

available

Detailed events and


available

Detailed events and


available

263 Analysis Tools – Interprocessor

Communications Trace

Red and green are

sends and receives for

inter-processor and

inter-memory transfers

13

264 Analysis Tools – Distributed

Debugging

Processor controlled

individually or global stop

Gedae instruments code

with more or less detail

User can

add events

Breakpoints can be added on

any sensible event – like the 4th

firing of the FFT on partition p2

Probes can be added at

any point in

265

Analysis Tools – Memory Map

Every data

structure in

memory is

preplanned.

Black gaps are a

result of memory

alignment

requirements

14

266 Hardware Model – Components

that Affect Software Structure

Memory – Hierarchical – local store or cache

– Distributed

Processors / Processing Cores – Working definition is that each core has it own instruction

stream Fits the decomposition of the problem:

Getting the data into out of fast memory

Processing it efficiently when in fast memory

Interconnect between memories and processors – Characterization memory layout and buffer sizes required

for efficiency

Optimized vector functions – Characterization of memory layout required for efficiency

267 Hardware Model – Example

Architecture

Core

Core

Processor set

LS

SYSMEM Bridge

Processor set

SYSMEM

IPC IPC

Duplicate or

heterogeneous

Subsystems

Core Bridge

Core

LS

Core

LS

Core

LS

Core

LS

Core

LS

Core

LS

Core

LS

15

268

Compiler

A multithreading compiler

– Verification

– Buffer definition

– Thread definition

– Thread decomposition for distribution

– Add distribution infrastructure

– Concurrency control

– Deadlock avoidance

– Memory sharing among threads

– Memory optimization within threads

Product creation

– Function library

– Standalone executable

269

Verification

Gedae applications are correct by construction

Gedae constructs the distributed implementation so

that it is functionally equivalent to the single

processor implementation

Issues addressed include

– Ordering sends and receives so that deadlock is avoided

when using blocking transfers

– Ensuring the gains from interpolation/decimation are

equivalent when multiple arcs meet together

– Runtime queue resizing to detect blocking due to

dynamic and variable threshold queues

16

270 Automated Memory Planning

and Packing

Gedae uses data dependency

and locality to plan memory use

and primitive order of execution

All memory allocation is static

High reuse of buffers (packing)

Kernel

1

Kernel

2

Kernel

4 Kernel

3

B

A

1

2

C

3

4

S

Ord

er

of

Execu

tio

n

Kernel

Src A

Kernel

Src B

Kernel

Src C

Kernel

Sink

Memory

271

Supported Platforms

Workstations

– Intel x86. SPARC. PowerPC

Multicores

– Cell/B.E. processor

– Intel Core 2 with SSE3

– Blue Gene

– Freescale (Coming Soon)

DSPs

– PowerPC AltiVec

– TigerSHARC

Custom hardware through the BSP Development Kit

FPGAs and GPUs (Future)

17

272

Gedae Simulation

Gedae has a simulation tool that allows for

hardware software co-simulation

Hardware model is built using the same

development environment as software

– Model software running on hardware

– Model physical characteristics of hardware – temperature,

power, etc.

Report in trace table

– Model external systems

People

Planes

Physics (e.g. RADAR environmental simulator)

273

Gedae Simulation - Example Software functionality and hardware simulation

and run in Gedae simulation and connect just like

software connecting to target hardware

Hardware, OS and

system software all

must be modeled.

IPC HW and SW is

modeled including the

feedback between

sender and receiver

18

274

Gedae Simulation - Example

Hardware events like

data transfers OS and system software

events like initiating IPC

Application software events like

FFTs and vector multiplies

275

Gedae Roadmap

Symbolic Expression language

– Specification of programs as expressions

– Automated decomposition for data parallelism

– Equivalent to flow graph language, but entirely

textual

Data Analysis and Display (plots, images,

3-d, etc.)

19

276

Gedae Roadmap

Code overlays

– Increased program size

– Improving startup time Status shown in table – times in uSec

– Support very complex overlay schemes

– All the data movement tools in Gedae apply

Stage (8 SPEs) Initial Current Future

Range ~100,000 989 80

Corner Turn ~100,000 628 80

Azimuth ~100,000 702 80

277 Implementation Tools –

Automatic Implementation

Threaded

Application

Hardware Model

with

Characterization

Compiler

Functional

Model

Rule Based

Engine

Analysis

Tools++

Software

Characterization

on HW Model

Developer

Implementation

Specification

SW / HW System

Thread Manager

20

278 Software Architecture

Recommendations / Templates

Automatic implementation that is being

introduced will contain knowledge about

software architecture for various software /

hardware systems

Gedae is introducing a software and

hardware characterization tool

– Raw data already available in trace collection

– Full characterization of hardware

– Full characterization of software on hardware

279 Cell/B.E. Benchmark Results -

Summary

Monte Carlo Black-Scholes Simulation

– Same as performance of hand optimized code

Matrix multiply

– Block data layout

– 194 gflops – 95% of theoretical max

SAR (synthetic aperture RADAR)

– End to end timing including 0 flop cornerturn

– Sustained 88 gflops/sec

– 87x algorithm on a 500 Mhz quad altivec

board (normalized clock speed – 13.6 x)

21

280

Portable software development platform for multi-

core and many-core processors

Single-source solution for portable, high-

performance parallel programming

Supports high productivity development

Safe and deterministic general-purpose structured

programming model (SPMD stream)

Scalable to an arbitrary number of cores

Can be used to target both accelerators and

multicore processors

Integrates with existing C++ compilers

RapidMind Platform

281

RapidMind Platform (2)

22

282

A good programming technology should:

1. Provide an accurate conceptual model of the

hardware

2. Clearly expose the most important policy

decisions and architectural elements of the

hardware

3. Provide structure and modularity

4. Automate what can be automated, and not

overload the programmer with trivia

5. Provide drill-down mechanisms for use when

necessary

High Productivity Parallel

Programming

283

Usage

– Include platform header

– Link to runtime library

Data

– Values

– Arrays

– Data abstraction

Programs

– Defined dynamically

– Execute on

coprocessors

– Code abstraction

RapidMind Summary

#include <rapidmind/platform.hpp>

using namespace rapidmind;

Value1f f = 2.0f;

Array<2,Value3f> a(512,512);

Array<2,Value3f> b(512,512);

Program prog = BEGIN {

In<Value3f> r, s;

Out<Value3f> q;

q = (r + s) * f;

} END;

a = prog(a,b);

f = 3.0f;

stride(a,2,2) = prog(

slice(a,0,255,0,255),

slice(b,256,511,0,255));

23

284

RapidMind Examples

Fluid Simulation

Crowd Simulation

285

Multi-core/Many-core is a major disruption

– All computers will be massively parallel

– All programmers will have to write parallel programs

– Focus will be on developers to scale performance

– Software development challenge

Programming models are important!

– Threading difficult, unsafe, and has poor scalability

– Structured parallelism has scalability and safety advantages

– Want system to compose multiple deterministic patterns

Programming platforms:

– Not necessary to introduce completely new languages

– Can obtain similar performance and expressiveness within standard

C++ and existing compilers

Prototyping Platforms

Conclusions

structured grid motif

Documents