design strategies for irregularly adapting parallel applications leonid oliker future technologies...

Design Strategies for Design Strategies for Irregularly Adapting Parallel ApplicationsIrregularly Adapting Parallel Applications

Leonid Oliker

Future Technologies Group

Computational Research Division

LBNLwww.nersc.gov/~oliker

Rupak Biswas and Hongzhang Shan

OverviewOverview

Success of parallel computing in solving large-scale practical applications relies on efficient mapping and execution on available architectures

Most real-life applications are complex, irregular, and dynamic

Generally believed that unstructured methods will constitute significant fraction of future high-end computing

Evaluate existing and emerging architectural platforms in the context of irregular and dynamic applications

Examine the complex interactions between high-level algorithms, programming paradigms, and architectural platforms.

2D Unstructured Mesh Adaptation2D Unstructured Mesh Adaptation

Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)

Complicated logic and data structures

Difficult to parallelize efficiently Irregular data access patterns (pointer chasing) Workload grows/shrinks at runtime (dynamic load balancing)

Three types of element subdivision

Parallel Code DevelopmentParallel Code Development

Programming paradigms Message passing (MPI) Shared memory (OpenMP-style pragma compiler directives) Multithreading (Tera compiler directives)

Architectures Cray T3E SGI Origin2000 Cray (Tera) MTA

Critical factors Runtime Scalability Programmability Portability Memory overhead

Test ProblemTest Problem

Computational mesh to simulate flow over airfoil

Mesh geometrically refined 5 levels in specific regions to better capture fine-scale phenomena

14,605 vertices28,404 triangles

488,574 vertices1,291,834 triangles

Serial Code

6.4 secs on 250 MHz R10K

Distributed-Memory ImplementationDistributed-Memory Implementation

512-node T3E (450 MHz DEC Alpha procs)

32-node Origin2000 (250 MHz dual MIPS R10K procs)

Code implemented in MPI within PLUM framework Initial dual graph used for load balancing adapted meshes Parallel repartitioning of adapted meshes (ParMeTiS) Remapping algorithm assigns new partitions to processors Efficient data movement scheme (predictive & asynchronous)

Three major steps (refinement, repartitioning, remapping)

Overhead Programming (to maintain consistent D/S for shared objects) Memory (mostly for bulk communication buffers)

INITIALIZATION

Initial Mesh

Partitioning

Mapping

Overview of PLUMOverview of PLUM

Y

Y

N

N

Repartitioning

Reassignment

Remapping

Balanced?

Expensive?

LOAD BALANCER

Refinement

MESH ADAPTOR

Edge Marking

Coarsening

FLOW SOLVER

Performance of MPI CodePerformance of MPI Code

More than 32 procs required to outperform serial case Reasonable scalability for refinement & remapping Scalable repartitioner would improve performance Data volume different due to different word sizes

Time (secs) Data Vol (MB)

System P Refine Partition Remap Total Max Total

8 4.53 1.47 12.97 18.97 68.04 286.8064 0.78 1.49 1.81 4.08 6.88 280.30

160 0.61 1.70 0.69 3.00 4.24 284.41T3E

512 0.14 4.70 0.25 5.09 0.99 310.40

2 13.12 1.30 24.89 39.31 50.11 60.648 8.31 1.39 10.23 19.93 30.21 151.75O2K

64 1.41 2.30 1.69 5.40 4.17 132.34

Node Architecture Communication Architecture

Origin2000 (Hardware Cache Coherency)Origin2000 (Hardware Cache Coherency)

Memory

Hub

L2Cache

Directory

Dir (>32P)

R12K

Router

L2Cache

R12K

Shared-Memory ImplementationShared-Memory Implementation

32-node Origin2000 (250 MHz dual MIPS R10K procs)

Complexities of partitioning & remapping absent Parallel dynamic loop scheduling for load balance

GRAPH_COLOR strategy (significant overhead) Use SGI’s native pragma directives to create IRIX threads Color triangles (new ones on the fly) to form independent sets All threads process each set to completion, then synchronize

NO_COLOR strategy (too fine grained) Use low-level locks instead of graph coloring When thread processes triangle, lock its edges & vertices Processors idle while waiting for blocked objects

GRAPH_COLOR NO_COLOR

P Refine Color Total Total

1 20.8 21.1 41.9 8.24 17.5 24.0 41.5 21.18 17.0 22.6 39.6 38.4

16 17.8 22.0 39.8 56.832 23.5 25.8 49.3 107.064 42.9 29.6 72.5 160.9

Performance of Shared-Memory CodePerformance of Shared-Memory Code

Poor performance due to flat memory assumption System overloaded by false sharing Page migration unable to remedy problem Need to consider data locality and cache effects to

improve performance (require partitioning & reordering) For GRAPH_COLOR

Cache misses15 M (serial) to85 M (P=1)

TLB misses7.3 M (serial) to53 M (P=1)

Cray (Tera) MTA Multithreaded Cray (Tera) MTA Multithreaded ArchitectureArchitecture

255 MHz MTA uses multithreading to hide latency (100-150 cycles per word) & keep processors saturated with work

no data cache hashed memory mapping (data layout impossible) near uniform data access from any processor to any memory

location Each processor has 128 hardware streams (32 registers & PC)

context switch on each cycle, choose next instruction from ready streams

a stream can execute an instruction only once every 21 cycles, even if no instructions reference memory

Synchronization between threads accomplished using full / empty bits in memory, allowing fine-grained threads

No explicit load balancing required since dynamic scheduling of work to threads can keep processor saturated

No difference between uni- and multiprocessor parallelism

Multithreaded ImplementationMultithreaded Implementation

8-processor 250 MHz Tera MTA 128 streams/proc, flat hashed memory, full-empty bit for sync Executes pipelined instruction from different stream at each

clock tick

Dynamically assigns triangles to threads Implicit load balancing Low-level synchronization variables ensure adjacent triangles

do not update shared edges or vertices simultaneously

No partitioning, remapping, graph coloring required Basically, the NO_COLOR strategy

Minimal programming to create multithreaded version

Performance of Multithreading CodePerformance of Multithreading Code

Sufficient instruction level parallelism exists to tolerate memory access overhead and lightweight synchronization

Number of streams changed via compiler directive

Streams per processor

P 1 40 60 80 100

1 150.1 3.82 2.72 2.22 2.042 1.98 1.40 1.15 1.064 1.01 0.74 0.64 0.596 0.69 0.51 0.43 0.408 0.55 0.41 0.37 0.35

Schematic of Different ParadigmsSchematic of Different Paradigms

Shared memory MultithreadingDistributed memory

Before and after adaptation (P=2 for distributed memory)

Mesh Adaptation:Mesh Adaptation:Comparison and ConclusionsComparison and Conclusions

Different programming paradigms require varying numbers of operations and overheads

Multithreaded systems offer tremendous potential for solving some of the most challenging real-life problems on parallel computers

Program Best Code Mem Scala- Porta-Paradigm System Time P Incr Incr bility bility

Serial R10000 6.4 1MPI T3E 3.0 160 100% 70% Medium HighMPI O2K 5.4 64 100% 70% Medium High

Shared-mem O2K 39.6 8 10% 5% None MediumMultithreading MTA 0.35 8 2% 7% High* Low

Comparison of Programming ModelsComparison of Programming Modelson Origin2000on Origin2000

A1A0

Load/Store

A1 = A0

CC-SAS P0 P1

A A

CommunicationLibrary

Put

Put or Get, not both

SHMEM P0 P1

AA

P0 P1

MPI

CommunicationLibrary

Send Receive

Send-Receive pair

We focus on adaptive applications (D-mesh, N-body)

Characteristics of the ModelCharacteristics of the Model

MPI SHMEM SAS

Naming for remote data

Can not

By symmetric local variables

+ Remote process Id

Same as local variables

Data replication

and coherence

Explicit , need both source

and destination processes

Explicit , need source or

destination process

Implicit

Transparency (for implementation) Increasing

D-Mesh AlgorithmD-Mesh Algorithm

INITIALIZATION

Initial Mesh

Partitioning

MESH ADAPTOR

Edge Marking

Refinement

LOAD BALANCER

Re-mapping

Partitioning

Balanced ?

FLOW SOLVER

Matrix Transform

Iterative Solver

NY

Implementation of Flow SolverImplementation of Flow Solver

SAS (logical partition): MPI (physical partition):

P0

P1P1

P0

P1P1

Matrix Transform: Easier in CC-SAS Iterative Solver: Conjugate Gradient, SPMV

Performance of SolverPerformance of Solver

0

50

100

150

200

16 32 64 16 32 64

156K 1.3M

Sp

eed

up

s

MPI

SHMEM

CC-SAS

Most of the time is spent on iterative solver

P1

P0CC-SAS: MPI/SHMEM:P0

P1

Data Re-mappingin D-Mesh

SAS provides substantial ease of programming in conceptual and orchestration level , far beyond implicit load/store vs. explicit messages

P0P1

P0

P1

Logically partition Physically partition

Implementation of Load BalancerImplementation of Load Balancer

Performance of AdaptorPerformance of Adaptor

0

10

20

30

40

50

16 32 64 16 32 64

156K 1.3M

Sp

eed

up

s MPI

SHMEM

CC-SAS

CC-SAS suffers from the poor spatial locality of shared data

Performance of Load BalancerPerformance of Load Balancer

0

400

800

1200

1600

16P 32P 64P

Tim

e(m

s)

Remap

ParMetis

Performance of D-MeshPerformance of D-Mesh

0

20

40

60

80

16 32 64 16 32 64 16 32 64 16 32 64

156K 441K 1M 1.3M

Sp

eed

up

s

MPI

SHMEM

SAS

CC-SAS suffers from poor spatial locality for smaller data setsCC-SAS benefits from the ease of programming for larger data sets

N-Body Simulation:N-Body Simulation:Evolution of Two Plummer BodiesEvolution of Two Plummer Bodies

Barnes-Hut arises in many areas of science and engineering such as astrophysics, molecular dynamics, and graphics

Update body positions and velocities

Time Steps

Compute forces on all bodies

Build the Oct-tree

Compute forces based on the tree

N-Body Simulation (Barnes-Hut)N-Body Simulation (Barnes-Hut)

N-Body:N-Body:Tree Building MethodTree Building Method

CC-SAS:Shared Tree

P0:

MPI/SHMEM:“Locally Essential” Tree

Distribute/Collect

Cells/Bodies

P0:

Performance of N-BodyPerformance of N-Body

• For 16-processors, performance is similar• For 64-processors:

• CC-SAS is better for smaller data sets• but worse for larger data sets

0

20

40

60

80

16 32 64 16 32 64 16 32 64 16 32 64

16K 64K 256K 1024K

Sp

eed

up

s

MPISHMEMCC-SAS

NBODY:NBODY:Time Breakdown for (64P, 16K)Time Breakdown for (64P, 16K)

SHMEM

0 15 30 45 60

MPI

0

30

60

90

120

1500 15 30 45 60

Tim

e (m

s)

CC-SAS

0 15 30 45 60

SYNC

RMEM

LMEM

BUSY

Less BUSY time for CC-SAS due to ease of programming

Processor Identifier

N-BODY:N-BODY:TIME BREAKDOWN (64P,1024K)TIME BREAKDOWN (64P,1024K)

MPI

0

2

4

6

8

0 15 30 45 60

Tim

e (s

)

SHMEM

0 15 30 45 60

SAS

0 15 30 45 60

SYNC

RMEM

LMEM

BUSY

Processor Identifier

High MEM time for CC-SAS

N-Body :N-Body :Improved ImplementationImproved Implementation

SAS: Shared Tree

MPI/SHMEM:Locally Essential Tree

Duplicate high-level cells

Duplicate high-level cells

N-Body:N-Body:Improved Performance Improved Performance

0

20

40

60

80

16 32 64

Number of Processors

Sp

eed

up

s MPI

SHMEM

SAS

SAS-NEW

Performance becomes close to message-passing

N-Body:N-Body:Essential source code lineEssential source code line

N-Body

Mesh Adaptor Load Balancer Flow Solver Total Total

MPI 5,337 4,615 6,603 16,015 1,371

SHMEM 5,579 4,100 5,906 15,585 1,322

CC-SAS 2,563 2,142 3,725 8,430 1,065

D-Mesh

SHMEM is similar to MPI, but much higher than CC-SAS

CC-SAS CONCLUSIONSCC-SAS CONCLUSIONS

Possible to achieve MPI performance with CC-SAS for irregular and dynamic problems by carefully following the same high level strategies, through methods such as partitioning, remapping, and replication

CC-SAS advantages:

Substantial ease of programming, which can be translated directly into performance gain

Efficient hardware support CC-SAS disadvantages:

Suffer from the poor spatial locality of shared data Needs application restructuring

CC-SAS take much less time to develop

Relevant PublicationsRelevant Publicationswww.nersc.gov/~olikerwww.nersc.gov/~oliker

Summary Paper: “Parallel Computing Strategies for Irregular Algorithms”, (Oliker, Biswas, Shan)Annual Review of Scalable Computing, 2003

Dynamic Remeshing: “ Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms”, (Oliker, Biswas) Supercomputing 1999

Dynamic Remeshing and N-Body Simulation:“A Comparison of Three Programming Models for Adaptive Applications on the Origin2000”, (Shan, Singh, Oliker, Biswas) Supercomputing 2000

Conjugate Gradient: “Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations”, (Oliker, Biswas, Husbands, Li) Siam Review Journal, 2002.

Algorithms ArchitecturesAlgorithms Architecturesand Programming Paradigmsand Programming Paradigms

Several parallel architectures with distinct programming methodologies and performance characteristics have emerged

Examined three irregular algorithms N-Body Simulation, Dynamic Remeshing, Conjugate Gradient

Parallel Architectures Cray T3E, SGI Origin2000, IBM SP, Cray (Tera) MTA

Programming Paradigms Message-Passing, Shared Memory, Hybrid, Multithreading

Partitioning and/or ordering strategies to decompose domain multilevel (MeTiS), linearization (RCM, SAW), combination

(MeTiS+SAW)

Sparse Conjugate GradientSparse Conjugate Gradient

CG oldest and best-known Krylov subspace method to solve sparse linear system (Ax = b)

starts from an initial guess of x successively generates approximate solutions in the Krylov

subspace & search directions to update the solution and residual slow convergence for ill-conditioned matrices (use preconditioner)

Sparse matrix vector multiply (SPMV) usually accounts for most of the flops within a CG iteration, and is one of most heavily-used kernels in large-scale numerical simulations

if A is O(n) with nnz nonzeros, SPMV is O(nnz) but DOT is O(n) flops

To perform SPMV (y Ax) assume A stored in compressed row storage (CRS) format dense vector x stored sequentially in memory with unit stride

Various numberings of mesh elements result in different nonzero patterns of A, causing different access patterns for x

PCG Algorithm Compute r0 = b - Ax0, p0 = z0 = M-1r0, for some initial guess x0

for j = 0, 1, …, until convergence j = (rj, zj) / (Apj, pj); xj+1 = xj + j pj; rj+1 = rj - j Apj; zj+1 = M-1rj+1;

j = (rj+1, zj+1) / (rj, zj); pj+1 = zj+1 + j pj; end for

Each PCG iteration involves 1 SPMV iteration for Apj 1 solve with preconditioner M (we consider ILU(0)) 3 vector updates (AXPY) for xj+1, rj+1, pj+1 3 inner products (DOT) for update scalars j, j

For symmetric positive definite linear systems, these conditions minimize distance between approximate and true solutions

For most practical matrices, SPMV and triangular solves dominate

Preconditioned Conjugate GradientPreconditioned Conjugate Gradient

Graph Partitioning Strategy: MeTiSGraph Partitioning Strategy: MeTiS

Most popular class of multilevel partitioners Objectives

balance computational workload minimize edge cut (interprocessor communication)

Algorithm collapses vertices & edges using heavy-edge matching scheme applies greedy partitioning algorithm to coarsest graph uncoarsens it back using greedy graph growing + Kernighan-Lin

Coarsening phase Refinement phase

Initial partitioning

Linearization Strategy:Linearization Strategy:Reverse Cuthill-McKee (RCM)Reverse Cuthill-McKee (RCM)

Matrix bandwidth (profile) has significant impact on efficiency of linear systems and eigensolvers

Geometry-based algorithm that generates a permutation so that non-zero entries are close to diagonal

Good preordering for LU or Cholesky factorization (reduces fill)

Improves cache performance (but does not explicitly reduce edge cut)

Linearization Strategy:Linearization Strategy:Self-Avoiding Walks (SAW)Self-Avoiding Walks (SAW)

Mesh-based technique similar to space-filling curves Two consecutive triangles in walk share edge or vertex (no

jumps) Visits each triangle exactly once, entering/exiting over

edge/vertex Improves parallel efficiency related to locality (cache reuse) and

load balancing, but does not explicitly reduce edge cuts Amenable to hierarchical coarsening & refinement

Heber, Biswas, Gao: Concurrency: Practice & Experience, 12 (2000) 85-109

MPI Distributed-Memory ImplementationMPI Distributed-Memory Implementation

Each processor has local memory that only it can directly access; message passing required to access memory of another processor

User decides data distribution and organizes comm structure

Allows efficient code design at the cost of higher complexity

CG uses Aztec sparse linear library;PCG uses BlockSolve95 (Aztec does not have ILU(0) routine)

Matrix A partitioned into blocks of rows; each block to processor

Associated component vectors (x, b) distributed accordingly

Communication needed to transfer some components of x

AXPY (local computation); DOT (local sum, global reduction)

T3E (450 MHz Alpha processor, 900 Mflops peak, 245 MB main memory, 96 KB secondary cache, 3D torus interconnect)

SPMV Locality & Communication SPMV Locality & Communication StatisticsStatistics

Performance results using T3E hardware performance monitor

ORIG ordering has large edge cut (interprocessor communication) and poor data locality (high number of cache misses)

MeTiS minimizes edge cut; SAW minimizes cache misses

SPMV and CG Runtimes:SPMV and CG Runtimes:Performance on T3EPerformance on T3E

Smart ordering / partitioning required to achieve good performance and high scalability

For this combination of apps & archs, improving cache reuse is more important than reducing interprocessor communication

Adaptivity will require repartitioning (reordering) and remapping

TriSolve and PCG Runtimes:TriSolve and PCG Runtimes:Performance on T3EPerformance on T3E

Initial ordering / partitioning significant, even though matrix further reordered by BlockSolve95

TriSolve dominates, and sensitive to ordering

SAW has slight advantage over RCM & MeTiS;an order faster than ORIG

(OpenMP) Shared-Memory (OpenMP) Shared-Memory ImplementationImplementation

Origin2000 (SMP of nodes, each with dual 250 MHz R10000 processor & 512 MB local memory)

hardware makes all memory equally accessible from software perspective

non-uniform memory access time (depends on # hops) each processor has 4 MB secondary data cache when processor modifies word, all other copies of cache line

invalidated OpenMP-style directives (requires significantly less effort than

MPI) Two implementation approaches taken (identical kernels)

FLATMEM: assume Origin2000 has uniform shared-memory (arrays not explicitly distributed, non-local data handled by cache coherence)

CC-NUMA: consider underlying architecture by explicit data distribution

Each processor assigned equal # rows in matrix (block) No explicit synchronization required since no concurrent writes Global reduction for DOT operation

CG Runtimes:CG Runtimes:Performance on Origin2000Performance on Origin2000

CC-NUMA performs significantly better than FLATMEM

RCM & SAW reduce runtimes compared to ORIG

Little difference between RCM & SAW, probably due to large cache

CC-NUMA (with ordering) and MPI runtimes comparable, even though programming methodologies quite different

Adaptivity will require reordering and remapping

Hybrid (MPI+OpenMP) ImplementationHybrid (MPI+OpenMP) Implementation

Latest teraflop-scale systems designs contain large number of SMP nodes

Mixed programming paradigm combines two layers of parallelism

OpenMP within each SMP MPI among SMP nodes

Allows codes to benefit from loop-level parallelism & shared-memory algorithms in addition to coarse-grained parallelism

Natural mapping to underlying architecture Currently unclear if hybrid performance gains compensate for

increased programming complexity and potential loss of portability

Incrementally add OpenMP directives to Aztec, and some code reorganization (including temp variables for correctness)

IBM SP (222 MHz Power3, 8-way SMP, current switch limits 4 MPI tasks per node)

CG Runtimes:CG Runtimes:Performance on SPPerformance on SP

Intelligent orderings important for good hybrid performance

MeTiS+SAW best strategy, but not dramatically

For a given processor count, varying MPI tasks & OpenMP threads have little effect

Hybrid implementation does not offer noticeable advantage

MTA Multithreaded ImplementationMTA Multithreaded Implementation

Straightforward implementation (only requires compiler directives)

Special assertions used to indicate no loop-carried dependencies

Compiler then able to parallelize loop segments

Load balancing by OS (dynamically assigns matrix rows to threads)

Other than reduction for DOT, no special synchronization constructs required for CG

Synchronization required however for PCG

No special ordering required to achieve good parallel performance

SPMV and CG Runtimes:SPMV and CG Runtimes:Performance on MTAPerformance on MTA

Both SPMV and CG show high scalability (over 90%) with 60 streams per processor

sufficient TLP to tolerate high overhead of memory access

8-proc MTA faster than 32-proc Origin2000 & 16-proc T3E with no partitioning / ordering overhead; but will scaling continue beyond 8 procs?

Adaptivity does not require extra work to maintain performance

PCG Runtimes:PCG Runtimes:Performance on MTAPerformance on MTA

Developed multithreaded version of TriSolve matrix factorization times not included use low-level locks to perform on-the-fly dependency analysis

TriSolve responsible for most of the computational overhead Limited scalability due to insufficient TLP in our dynamic

dependency scheme

CG SummaryCG Summary

Examined four different parallel implementations of CG and PCG using four leading programming paradigms and architectures

MPI most complicated compared graph partitioning & linearization strategies improving cache reuse more important than reducing communication

Smart ordering algorithms significantly improve overall performance

possible to achieve message passing performance using shared memory constructs through careful data ordering & distribution

Hybrid paradigm increases programming complexity with little performance gains

MTA easiest to program no partitioning / ordering required to obtain high efficiency &

scalability no additional complexity for dynamic adaptation limited scalability for PCG due to lack of thread level parallelism

Overall ObservationsOverall Observations

Examined parallel implementations of irregularly structured (PCG) and dynamically adapting codes (N-Body and D-Mesh) using various paradigms and architectures

Generally, MPI has best performance but most complex implementation

Showed linearization outperforms traditional graph partitioning as a data distribution method for our algorithm

Possible to achieve MPI performance with CC-SAS for irregular and dynamic problems by carefully following the same high level strategies, through methods such as partitioning, remapping, and replication

Hybrid code offers no significant performance advantage over MPI, but increases programming complexity and reduces portability

Multithreading offers tremendous potential for solving some of the most challenging real-life problems on parallel computers

design strategies for irregularly adapting parallel applications leonid oliker future technologies...

Documents

r10k slide

hongzhang shan slide

node origin2000

parallel repartitioning

dynamic applications

sharedmemory implementation

shared objects memory

overhead programming