design strategies for irregularly adapting parallel applications leonid oliker future technologies...
TRANSCRIPT
Design Strategies for Design Strategies for Irregularly Adapting Parallel ApplicationsIrregularly Adapting Parallel Applications
Leonid Oliker
Future Technologies Group
Computational Research Division
LBNLwww.nersc.gov/~oliker
Rupak Biswas and Hongzhang Shan
OverviewOverview
Success of parallel computing in solving large-scale practical applications relies on efficient mapping and execution on available architectures
Most real-life applications are complex, irregular, and dynamic
Generally believed that unstructured methods will constitute significant fraction of future high-end computing
Evaluate existing and emerging architectural platforms in the context of irregular and dynamic applications
Examine the complex interactions between high-level algorithms, programming paradigms, and architectural platforms.
2D Unstructured Mesh Adaptation2D Unstructured Mesh Adaptation
Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)
Complicated logic and data structures
Difficult to parallelize efficiently Irregular data access patterns (pointer chasing) Workload grows/shrinks at runtime (dynamic load balancing)
Three types of element subdivision
Parallel Code DevelopmentParallel Code Development
Programming paradigms Message passing (MPI) Shared memory (OpenMP-style pragma compiler directives) Multithreading (Tera compiler directives)
Architectures Cray T3E SGI Origin2000 Cray (Tera) MTA
Critical factors Runtime Scalability Programmability Portability Memory overhead
Test ProblemTest Problem
Computational mesh to simulate flow over airfoil
Mesh geometrically refined 5 levels in specific regions to better capture fine-scale phenomena
14,605 vertices28,404 triangles
488,574 vertices1,291,834 triangles
Serial Code
6.4 secs on 250 MHz R10K
Distributed-Memory ImplementationDistributed-Memory Implementation
512-node T3E (450 MHz DEC Alpha procs)
32-node Origin2000 (250 MHz dual MIPS R10K procs)
Code implemented in MPI within PLUM framework Initial dual graph used for load balancing adapted meshes Parallel repartitioning of adapted meshes (ParMeTiS) Remapping algorithm assigns new partitions to processors Efficient data movement scheme (predictive & asynchronous)
Three major steps (refinement, repartitioning, remapping)
Overhead Programming (to maintain consistent D/S for shared objects) Memory (mostly for bulk communication buffers)
INITIALIZATION
Initial Mesh
Partitioning
Mapping
Overview of PLUMOverview of PLUM
Y
Y
N
N
Repartitioning
Reassignment
Remapping
Balanced?
Expensive?
LOAD BALANCER
Refinement
MESH ADAPTOR
Edge Marking
Coarsening
FLOW SOLVER
Performance of MPI CodePerformance of MPI Code
More than 32 procs required to outperform serial case Reasonable scalability for refinement & remapping Scalable repartitioner would improve performance Data volume different due to different word sizes
Time (secs) Data Vol (MB)
System P Refine Partition Remap Total Max Total
8 4.53 1.47 12.97 18.97 68.04 286.8064 0.78 1.49 1.81 4.08 6.88 280.30
160 0.61 1.70 0.69 3.00 4.24 284.41T3E
512 0.14 4.70 0.25 5.09 0.99 310.40
2 13.12 1.30 24.89 39.31 50.11 60.648 8.31 1.39 10.23 19.93 30.21 151.75O2K
64 1.41 2.30 1.69 5.40 4.17 132.34
Node Architecture Communication Architecture
Origin2000 (Hardware Cache Coherency)Origin2000 (Hardware Cache Coherency)
Memory
Hub
L2Cache
Directory
Dir (>32P)
R12K
Router
L2Cache
R12K
Shared-Memory ImplementationShared-Memory Implementation
32-node Origin2000 (250 MHz dual MIPS R10K procs)
Complexities of partitioning & remapping absent Parallel dynamic loop scheduling for load balance
GRAPH_COLOR strategy (significant overhead) Use SGI’s native pragma directives to create IRIX threads Color triangles (new ones on the fly) to form independent sets All threads process each set to completion, then synchronize
NO_COLOR strategy (too fine grained) Use low-level locks instead of graph coloring When thread processes triangle, lock its edges & vertices Processors idle while waiting for blocked objects
GRAPH_COLOR NO_COLOR
P Refine Color Total Total
1 20.8 21.1 41.9 8.24 17.5 24.0 41.5 21.18 17.0 22.6 39.6 38.4
16 17.8 22.0 39.8 56.832 23.5 25.8 49.3 107.064 42.9 29.6 72.5 160.9
Performance of Shared-Memory CodePerformance of Shared-Memory Code
Poor performance due to flat memory assumption System overloaded by false sharing Page migration unable to remedy problem Need to consider data locality and cache effects to
improve performance (require partitioning & reordering) For GRAPH_COLOR
Cache misses15 M (serial) to85 M (P=1)
TLB misses7.3 M (serial) to53 M (P=1)
Cray (Tera) MTA Multithreaded Cray (Tera) MTA Multithreaded ArchitectureArchitecture
255 MHz MTA uses multithreading to hide latency (100-150 cycles per word) & keep processors saturated with work
no data cache hashed memory mapping (data layout impossible) near uniform data access from any processor to any memory
location Each processor has 128 hardware streams (32 registers & PC)
context switch on each cycle, choose next instruction from ready streams
a stream can execute an instruction only once every 21 cycles, even if no instructions reference memory
Synchronization between threads accomplished using full / empty bits in memory, allowing fine-grained threads
No explicit load balancing required since dynamic scheduling of work to threads can keep processor saturated
No difference between uni- and multiprocessor parallelism
Multithreaded ImplementationMultithreaded Implementation
8-processor 250 MHz Tera MTA 128 streams/proc, flat hashed memory, full-empty bit for sync Executes pipelined instruction from different stream at each
clock tick
Dynamically assigns triangles to threads Implicit load balancing Low-level synchronization variables ensure adjacent triangles
do not update shared edges or vertices simultaneously
No partitioning, remapping, graph coloring required Basically, the NO_COLOR strategy
Minimal programming to create multithreaded version
Performance of Multithreading CodePerformance of Multithreading Code
Sufficient instruction level parallelism exists to tolerate memory access overhead and lightweight synchronization
Number of streams changed via compiler directive
Streams per processor
P 1 40 60 80 100
1 150.1 3.82 2.72 2.22 2.042 1.98 1.40 1.15 1.064 1.01 0.74 0.64 0.596 0.69 0.51 0.43 0.408 0.55 0.41 0.37 0.35
Schematic of Different ParadigmsSchematic of Different Paradigms
Shared memory MultithreadingDistributed memory
Before and after adaptation (P=2 for distributed memory)
Mesh Adaptation:Mesh Adaptation:Comparison and ConclusionsComparison and Conclusions
Different programming paradigms require varying numbers of operations and overheads
Multithreaded systems offer tremendous potential for solving some of the most challenging real-life problems on parallel computers
Program Best Code Mem Scala- Porta-Paradigm System Time P Incr Incr bility bility
Serial R10000 6.4 1MPI T3E 3.0 160 100% 70% Medium HighMPI O2K 5.4 64 100% 70% Medium High
Shared-mem O2K 39.6 8 10% 5% None MediumMultithreading MTA 0.35 8 2% 7% High* Low
Comparison of Programming ModelsComparison of Programming Modelson Origin2000on Origin2000
A1A0
Load/Store
A1 = A0
CC-SAS P0 P1
A A
CommunicationLibrary
Put
Put or Get, not both
SHMEM P0 P1
AA
P0 P1
MPI
CommunicationLibrary
Send Receive
Send-Receive pair
We focus on adaptive applications (D-mesh, N-body)
Characteristics of the ModelCharacteristics of the Model
MPI SHMEM SAS
Naming for remote data
Can not
By symmetric local variables
+ Remote process Id
Same as local variables
Data replication
and coherence
Explicit , need both source
and destination processes
Explicit , need source or
destination process
Implicit
Transparency (for implementation) Increasing
D-Mesh AlgorithmD-Mesh Algorithm
INITIALIZATION
Initial Mesh
Partitioning
MESH ADAPTOR
Edge Marking
Refinement
LOAD BALANCER
Re-mapping
Partitioning
Balanced ?
FLOW SOLVER
Matrix Transform
Iterative Solver
NY
Implementation of Flow SolverImplementation of Flow Solver
SAS (logical partition): MPI (physical partition):
P0
P1P1
P0
P1P1
Matrix Transform: Easier in CC-SAS Iterative Solver: Conjugate Gradient, SPMV
Performance of SolverPerformance of Solver
0
50
100
150
200
16 32 64 16 32 64
156K 1.3M
Sp
eed
up
s
MPI
SHMEM
CC-SAS
Most of the time is spent on iterative solver
P1
P0CC-SAS: MPI/SHMEM:P0
P1
Data Re-mappingin D-Mesh
SAS provides substantial ease of programming in conceptual and orchestration level , far beyond implicit load/store vs. explicit messages
P0P1
P0
P1
Logically partition Physically partition
Implementation of Load BalancerImplementation of Load Balancer
Performance of AdaptorPerformance of Adaptor
0
10
20
30
40
50
16 32 64 16 32 64
156K 1.3M
Sp
eed
up
s MPI
SHMEM
CC-SAS
CC-SAS suffers from the poor spatial locality of shared data
Performance of Load BalancerPerformance of Load Balancer
0
400
800
1200
1600
16P 32P 64P
Tim
e(m
s)
Remap
ParMetis
Performance of D-MeshPerformance of D-Mesh
0
20
40
60
80
16 32 64 16 32 64 16 32 64 16 32 64
156K 441K 1M 1.3M
Sp
eed
up
s
MPI
SHMEM
SAS
CC-SAS suffers from poor spatial locality for smaller data setsCC-SAS benefits from the ease of programming for larger data sets
N-Body Simulation:N-Body Simulation:Evolution of Two Plummer BodiesEvolution of Two Plummer Bodies
Barnes-Hut arises in many areas of science and engineering such as astrophysics, molecular dynamics, and graphics
Update body positions and velocities
Time Steps
Compute forces on all bodies
Build the Oct-tree
Compute forces based on the tree
N-Body Simulation (Barnes-Hut)N-Body Simulation (Barnes-Hut)
N-Body:N-Body:Tree Building MethodTree Building Method
CC-SAS:Shared Tree
P0:
MPI/SHMEM:“Locally Essential” Tree
Distribute/Collect
Cells/Bodies
P0:
Performance of N-BodyPerformance of N-Body
• For 16-processors, performance is similar• For 64-processors:
• CC-SAS is better for smaller data sets• but worse for larger data sets
0
20
40
60
80
16 32 64 16 32 64 16 32 64 16 32 64
16K 64K 256K 1024K
Sp
eed
up
s
MPISHMEMCC-SAS
NBODY:NBODY:Time Breakdown for (64P, 16K)Time Breakdown for (64P, 16K)
SHMEM
0 15 30 45 60
MPI
0
30
60
90
120
1500 15 30 45 60
Tim
e (m
s)
CC-SAS
0 15 30 45 60
SYNC
RMEM
LMEM
BUSY
Less BUSY time for CC-SAS due to ease of programming
Processor Identifier
N-BODY:N-BODY:TIME BREAKDOWN (64P,1024K)TIME BREAKDOWN (64P,1024K)
MPI
0
2
4
6
8
0 15 30 45 60
Tim
e (s
)
SHMEM
0 15 30 45 60
SAS
0 15 30 45 60
SYNC
RMEM
LMEM
BUSY
Processor Identifier
High MEM time for CC-SAS
N-Body :N-Body :Improved ImplementationImproved Implementation
SAS: Shared Tree
MPI/SHMEM:Locally Essential Tree
Duplicate high-level cells
Duplicate high-level cells
N-Body:N-Body:Improved Performance Improved Performance
0
20
40
60
80
16 32 64
Number of Processors
Sp
eed
up
s MPI
SHMEM
SAS
SAS-NEW
Performance becomes close to message-passing
N-Body:N-Body:Essential source code lineEssential source code line
N-Body
Mesh Adaptor Load Balancer Flow Solver Total Total
MPI 5,337 4,615 6,603 16,015 1,371
SHMEM 5,579 4,100 5,906 15,585 1,322
CC-SAS 2,563 2,142 3,725 8,430 1,065
D-Mesh
SHMEM is similar to MPI, but much higher than CC-SAS
CC-SAS CONCLUSIONSCC-SAS CONCLUSIONS
Possible to achieve MPI performance with CC-SAS for irregular and dynamic problems by carefully following the same high level strategies, through methods such as partitioning, remapping, and replication
CC-SAS advantages:
Substantial ease of programming, which can be translated directly into performance gain
Efficient hardware support CC-SAS disadvantages:
Suffer from the poor spatial locality of shared data Needs application restructuring
CC-SAS take much less time to develop
Relevant PublicationsRelevant Publicationswww.nersc.gov/~olikerwww.nersc.gov/~oliker
Summary Paper: “Parallel Computing Strategies for Irregular Algorithms”, (Oliker, Biswas, Shan)Annual Review of Scalable Computing, 2003
Dynamic Remeshing: “ Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms”, (Oliker, Biswas) Supercomputing 1999
Dynamic Remeshing and N-Body Simulation:“A Comparison of Three Programming Models for Adaptive Applications on the Origin2000”, (Shan, Singh, Oliker, Biswas) Supercomputing 2000
Conjugate Gradient: “Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations”, (Oliker, Biswas, Husbands, Li) Siam Review Journal, 2002.
Algorithms ArchitecturesAlgorithms Architecturesand Programming Paradigmsand Programming Paradigms
Several parallel architectures with distinct programming methodologies and performance characteristics have emerged
Examined three irregular algorithms N-Body Simulation, Dynamic Remeshing, Conjugate Gradient
Parallel Architectures Cray T3E, SGI Origin2000, IBM SP, Cray (Tera) MTA
Programming Paradigms Message-Passing, Shared Memory, Hybrid, Multithreading
Partitioning and/or ordering strategies to decompose domain multilevel (MeTiS), linearization (RCM, SAW), combination
(MeTiS+SAW)
Sparse Conjugate GradientSparse Conjugate Gradient
CG oldest and best-known Krylov subspace method to solve sparse linear system (Ax = b)
starts from an initial guess of x successively generates approximate solutions in the Krylov
subspace & search directions to update the solution and residual slow convergence for ill-conditioned matrices (use preconditioner)
Sparse matrix vector multiply (SPMV) usually accounts for most of the flops within a CG iteration, and is one of most heavily-used kernels in large-scale numerical simulations
if A is O(n) with nnz nonzeros, SPMV is O(nnz) but DOT is O(n) flops
To perform SPMV (y Ax) assume A stored in compressed row storage (CRS) format dense vector x stored sequentially in memory with unit stride
Various numberings of mesh elements result in different nonzero patterns of A, causing different access patterns for x
PCG Algorithm Compute r0 = b - Ax0, p0 = z0 = M-1r0, for some initial guess x0
for j = 0, 1, …, until convergence j = (rj, zj) / (Apj, pj); xj+1 = xj + j pj; rj+1 = rj - j Apj; zj+1 = M-1rj+1;
j = (rj+1, zj+1) / (rj, zj); pj+1 = zj+1 + j pj; end for
Each PCG iteration involves 1 SPMV iteration for Apj 1 solve with preconditioner M (we consider ILU(0)) 3 vector updates (AXPY) for xj+1, rj+1, pj+1 3 inner products (DOT) for update scalars j, j
For symmetric positive definite linear systems, these conditions minimize distance between approximate and true solutions
For most practical matrices, SPMV and triangular solves dominate
Preconditioned Conjugate GradientPreconditioned Conjugate Gradient
Graph Partitioning Strategy: MeTiSGraph Partitioning Strategy: MeTiS
Most popular class of multilevel partitioners Objectives
balance computational workload minimize edge cut (interprocessor communication)
Algorithm collapses vertices & edges using heavy-edge matching scheme applies greedy partitioning algorithm to coarsest graph uncoarsens it back using greedy graph growing + Kernighan-Lin
Coarsening phase Refinement phase
Initial partitioning
Linearization Strategy:Linearization Strategy:Reverse Cuthill-McKee (RCM)Reverse Cuthill-McKee (RCM)
Matrix bandwidth (profile) has significant impact on efficiency of linear systems and eigensolvers
Geometry-based algorithm that generates a permutation so that non-zero entries are close to diagonal
Good preordering for LU or Cholesky factorization (reduces fill)
Improves cache performance (but does not explicitly reduce edge cut)
Linearization Strategy:Linearization Strategy:Self-Avoiding Walks (SAW)Self-Avoiding Walks (SAW)
Mesh-based technique similar to space-filling curves Two consecutive triangles in walk share edge or vertex (no
jumps) Visits each triangle exactly once, entering/exiting over
edge/vertex Improves parallel efficiency related to locality (cache reuse) and
load balancing, but does not explicitly reduce edge cuts Amenable to hierarchical coarsening & refinement
Heber, Biswas, Gao: Concurrency: Practice & Experience, 12 (2000) 85-109
MPI Distributed-Memory ImplementationMPI Distributed-Memory Implementation
Each processor has local memory that only it can directly access; message passing required to access memory of another processor
User decides data distribution and organizes comm structure
Allows efficient code design at the cost of higher complexity
CG uses Aztec sparse linear library;PCG uses BlockSolve95 (Aztec does not have ILU(0) routine)
Matrix A partitioned into blocks of rows; each block to processor
Associated component vectors (x, b) distributed accordingly
Communication needed to transfer some components of x
AXPY (local computation); DOT (local sum, global reduction)
T3E (450 MHz Alpha processor, 900 Mflops peak, 245 MB main memory, 96 KB secondary cache, 3D torus interconnect)
SPMV Locality & Communication SPMV Locality & Communication StatisticsStatistics
Performance results using T3E hardware performance monitor
ORIG ordering has large edge cut (interprocessor communication) and poor data locality (high number of cache misses)
MeTiS minimizes edge cut; SAW minimizes cache misses
SPMV and CG Runtimes:SPMV and CG Runtimes:Performance on T3EPerformance on T3E
Smart ordering / partitioning required to achieve good performance and high scalability
For this combination of apps & archs, improving cache reuse is more important than reducing interprocessor communication
Adaptivity will require repartitioning (reordering) and remapping
TriSolve and PCG Runtimes:TriSolve and PCG Runtimes:Performance on T3EPerformance on T3E
Initial ordering / partitioning significant, even though matrix further reordered by BlockSolve95
TriSolve dominates, and sensitive to ordering
SAW has slight advantage over RCM & MeTiS;an order faster than ORIG
(OpenMP) Shared-Memory (OpenMP) Shared-Memory ImplementationImplementation
Origin2000 (SMP of nodes, each with dual 250 MHz R10000 processor & 512 MB local memory)
hardware makes all memory equally accessible from software perspective
non-uniform memory access time (depends on # hops) each processor has 4 MB secondary data cache when processor modifies word, all other copies of cache line
invalidated OpenMP-style directives (requires significantly less effort than
MPI) Two implementation approaches taken (identical kernels)
FLATMEM: assume Origin2000 has uniform shared-memory (arrays not explicitly distributed, non-local data handled by cache coherence)
CC-NUMA: consider underlying architecture by explicit data distribution
Each processor assigned equal # rows in matrix (block) No explicit synchronization required since no concurrent writes Global reduction for DOT operation
CG Runtimes:CG Runtimes:Performance on Origin2000Performance on Origin2000
CC-NUMA performs significantly better than FLATMEM
RCM & SAW reduce runtimes compared to ORIG
Little difference between RCM & SAW, probably due to large cache
CC-NUMA (with ordering) and MPI runtimes comparable, even though programming methodologies quite different
Adaptivity will require reordering and remapping
Hybrid (MPI+OpenMP) ImplementationHybrid (MPI+OpenMP) Implementation
Latest teraflop-scale systems designs contain large number of SMP nodes
Mixed programming paradigm combines two layers of parallelism
OpenMP within each SMP MPI among SMP nodes
Allows codes to benefit from loop-level parallelism & shared-memory algorithms in addition to coarse-grained parallelism
Natural mapping to underlying architecture Currently unclear if hybrid performance gains compensate for
increased programming complexity and potential loss of portability
Incrementally add OpenMP directives to Aztec, and some code reorganization (including temp variables for correctness)
IBM SP (222 MHz Power3, 8-way SMP, current switch limits 4 MPI tasks per node)
CG Runtimes:CG Runtimes:Performance on SPPerformance on SP
Intelligent orderings important for good hybrid performance
MeTiS+SAW best strategy, but not dramatically
For a given processor count, varying MPI tasks & OpenMP threads have little effect
Hybrid implementation does not offer noticeable advantage
MTA Multithreaded ImplementationMTA Multithreaded Implementation
Straightforward implementation (only requires compiler directives)
Special assertions used to indicate no loop-carried dependencies
Compiler then able to parallelize loop segments
Load balancing by OS (dynamically assigns matrix rows to threads)
Other than reduction for DOT, no special synchronization constructs required for CG
Synchronization required however for PCG
No special ordering required to achieve good parallel performance
SPMV and CG Runtimes:SPMV and CG Runtimes:Performance on MTAPerformance on MTA
Both SPMV and CG show high scalability (over 90%) with 60 streams per processor
sufficient TLP to tolerate high overhead of memory access
8-proc MTA faster than 32-proc Origin2000 & 16-proc T3E with no partitioning / ordering overhead; but will scaling continue beyond 8 procs?
Adaptivity does not require extra work to maintain performance
PCG Runtimes:PCG Runtimes:Performance on MTAPerformance on MTA
Developed multithreaded version of TriSolve matrix factorization times not included use low-level locks to perform on-the-fly dependency analysis
TriSolve responsible for most of the computational overhead Limited scalability due to insufficient TLP in our dynamic
dependency scheme
CG SummaryCG Summary
Examined four different parallel implementations of CG and PCG using four leading programming paradigms and architectures
MPI most complicated compared graph partitioning & linearization strategies improving cache reuse more important than reducing communication
Smart ordering algorithms significantly improve overall performance
possible to achieve message passing performance using shared memory constructs through careful data ordering & distribution
Hybrid paradigm increases programming complexity with little performance gains
MTA easiest to program no partitioning / ordering required to obtain high efficiency &
scalability no additional complexity for dynamic adaptation limited scalability for PCG due to lack of thread level parallelism
Overall ObservationsOverall Observations
Examined parallel implementations of irregularly structured (PCG) and dynamically adapting codes (N-Body and D-Mesh) using various paradigms and architectures
Generally, MPI has best performance but most complex implementation
Showed linearization outperforms traditional graph partitioning as a data distribution method for our algorithm
Possible to achieve MPI performance with CC-SAS for irregular and dynamic problems by carefully following the same high level strategies, through methods such as partitioning, remapping, and replication
Hybrid code offers no significant performance advantage over MPI, but increases programming complexity and reduces portability
Multithreading offers tremendous potential for solving some of the most challenging real-life problems on parallel computers