exploiting thread parallelism for ocean modeling on cray ... · may 2016. abhinav sarje lawrence...
TRANSCRIPT
Exploiting Thread Parallelism for Ocean Modelingon Cray XC Supercomputers
Abhinav Sarjeasarje @ lbl.gov
Cray Users Group Meeting (CUG)May 2016
Abhinav Sarje Lawrence Berkeley National Laboratory
Douglas Jacobsen Los Alamos National Laboratory
Samuel Williams Lawrence Berkeley National Laboratory
Todd Ringler Los Alamos National Laboratory
Leonid Oliker Lawrence Berkeley National Laboratory
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Ocean Modeling with MPAS-Ocean
• MPAS = Model for PredictionAcross Scales. [LANL/NCAR]
• A multiscale method forsimulating Earth’s oceans.
• 2D Voronoi tessellation-basedvariable resolution mesh(SCVT).
• Structured vertical columns inthird dimension representocean depths.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Ocean Modeling with MPAS-Ocean
Variable Resolution Mesh Mesh Elements
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Ocean Modeling with MPAS-Ocean: Why Unstructured Meshes?
• Variable resolutions with any given density function.
• Straightforward mapping to flat 2D with effectively no distortions.
• Quasi-uniform (locally homogeneous) coverage of spherical surfaces.
• Smooth resolution transition regions.
• Preserve symmetry/isotropic nature of a spherical surface.
• Naturally allows unstructured discontinuities.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Unstructured meshes in parallelenvironment.
• Challenging to achieve uniformpartitioning unlike structured grids.
• Load balance depends onpartitioning quality and elementdistributions.
• Data exchange between processesthrough deep halo regions.
• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Background & Motivations
• Structured vertical dimension for ocean depth.
• Variable depth but constant sized buffers with maximum depth!
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Developing a Hybrid MPI + OpenMP Implementation
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Computational Environment
• Cray XC40: Cori Phase 1• 1,630 dual-socket compute nodes with Cray Aries interconnect• 16 core Intel Haswell processors• 32 cores per node• 128 GB DRAM per node
• Meshes:1 60 kms uniform resolution:
114,539 cells, maximum vertical depth of 40.2 60 kms to 30 kms variable resolution:
234,095 cells, maximum vertical depth of 60.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Performance Plots and Keys
MPI-only
32x116x2
8x44x8
2x16
# MPI Procs. × # OMP Threads
0
2
4
6
8
10Perf
orm
ance
Metr
ic
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Performance Plots and Keys
Configurations = MPI procs. per node × OMP threads per MPI
Configuration Decreasing Increasing
1 32 × 1
2 16 × 2
3 8 × 4
4 4 × 8
5 2 × 16
6 1 × 32
• Number of MPI processesper node
• Total halo region volume
• Communication (numberand volume)
• Extra computations
• Memory requirements
• Inter-process load imbalance
• Number of threads perprocess
• Amount of data sharing
• Threading overhead
• Inter-thread load imbalance
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Threading Granularity: Block-Level
• Minimal implementationdisruption to the code
• Simple intra-nodehalo-exchanges across threadswithout need for explicitcommunication
• Possibly insufficient parallelism
• Intra-block halo regionsconsume memory and requireadditional computations
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Threading Granularity: Block-Level
• Minimal implementationdisruption to the code
• Simple intra-nodehalo-exchanges across threadswithout need for explicitcommunication
• Possibly insufficient parallelism
• Intra-block halo regionsconsume memory and requireadditional computations
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Threading Granularity: Block-Level
• Minimal implementationdisruption to the code
• Simple intra-nodehalo-exchanges across threadswithout need for explicitcommunication
• Possibly insufficient parallelism
• Intra-block halo regionsconsume memory and requireadditional computations
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Threading Granularity: Element-Level
• Significant implementationdisruption to the code
• Eliminates intra-processhalo-exchanges and halo regions
• Computation andcommunication efficient
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Threading Granularity: Element-Level
• Significant implementationdisruption to the code
• Eliminates intra-processhalo-exchanges and halo regions
• Computation andcommunication efficient
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Threading Granularity: Element-Level
• Significant implementationdisruption to the code
• Eliminates intra-processhalo-exchanges and halo regions
• Computation andcommunication efficient
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Element Distributions across Threads
Goal: What is the best approach to distribute mesh elements among threads?
1 Runtime distribution with explicit OpenMP loop directives.
2 Precomputed distributions.• Naive/Static.• Depth-Aware.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Element Distributions across Threads
Goal: What is the best approach to distribute mesh elements among threads?
1 Runtime distribution with explicit OpenMP loop directives.
2 Precomputed distributions.• Naive/Static.• Depth-Aware.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Element Distributions across Threads
Goal: What is the best approach to distribute mesh elements among threads?
1 Runtime distribution with explicit OpenMP loop directives.
2 Precomputed distributions.• Naive/Static.• Depth-Aware.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Element Distributions across Threads
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
0
20
40
60
80
100
120
140
Tim
e [
s]OpenMP Static Scheduling
Pre-computed Static
Pre-computed Depth-aware
Result: Explicit OpenMP loop directives.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Element Distributions across Threads
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
0
20
40
60
80
100
120
140
Tim
e [
s]OpenMP Static Scheduling
Pre-computed Static
Pre-computed Depth-aware
Result: Explicit OpenMP loop directives.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Thread Scaling Bottlenecks
MPI-only 32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
10-1
100
101
102Tim
e [
s]btr_se_subcycle_vel_pred
btr_se_subcycle_acc
btr_se_subcycle_ssh_pred
btr_se_subcycle_vel_corr
btr se subcycle loop
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Memory Allocation and Initialization
Goal: Can the Amdahl bottlenecks be minimized?
1 Baseline code: Single thread for memory allocation and initialization.
2 Optimized code: All available threads for initialization.
3 Plus, reordered allocation and initialization events.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Memory Allocation and Initialization
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
0
20
40
60
80
100
120
140
Sim
ula
tion T
ime [
s]
Original
Threaded Initialization
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
1.0
1.5
2.0
2.5
Speedup o
ver
Ori
gin
al
Original
Threaded Initialization
Result: Multi-threaded memory management.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
OpenMP Scheduling Policies
MPI-only
32x116x2
8x44x8
2x16
# MPI Procs. × # OMP Threads
101
102
103
104
Sim
ula
tion T
ime [
s]static
static,100
static,10
static,1
dynamic
dynamic,10
guided
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
OpenMP Scheduling Policies
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
101
102
103
104
Com
pute
Tim
e [
s]
static
static,100
static,10
static,1
dynamic
dynamic,10
guided
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
OpenMP Scheduling Policies
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
101
102
103
104
Com
pute
Tim
e [
s]
static
static,100
static,10
static,1
dynamic
dynamic,10
guided
32x1 16x2 8x4 4x8 2x16
# MPI Procs. × # OMP Threads
10-1
100
101
MPI Tim
e [
s]
static
static,100
static,10
static,1
dynamic
dynamic,10
guided
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
OpenMP Scheduling Policies
MPI-only32x1 16x2 8x4 4x8 2x16
static
static,100
static,10
static,1
dynamic
dynamic,10
guided
# Nodes = 1
MPI-only32x1 16x2 8x4 4x8 2x16
# Nodes = 2
MPI-only32x1 16x2 8x4 4x8 2x16
# Nodes = 4
MPI-only32x1 16x2 8x4 4x8 2x16
# Nodes = 8
101
102
103
Sim
ula
tion T
ime [
s]
Result: Select static or guided scheduling.
Avoid small chunk sizes.
Balance number of MPI processes and threads: 16 × 2 or 8 × 4.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Vectorization?
Goal: Can we take advantage of the vector units?
1 None−no − vec, −no − simd
2 Compiler auto−vec, −simd
3 OpenMP SIMD only−no − vec, −simd
4 Full−vec, −simd
5 Full with aligned memory−align, −vec, −simd
Result: Highly memory-bound, making vectorization unnecessary.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Vectorization?
Goal: Can we take advantage of the vector units?
1 None−no − vec, −no − simd
2 Compiler auto−vec, −simd
3 OpenMP SIMD only−no − vec, −simd
4 Full−vec, −simd
5 Full with aligned memory−align, −vec, −simd
32x10.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Speedup o
ver
No V
ect
ori
zati
on Compiler Auto-vectorization
OpenMP SIMD Directives
Compiler Auto and OpenMP SIMD vectorization
Full Vectorization + Aligned Memory
Result: Highly memory-bound, making vectorization unnecessary.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Roofline Performance Model
10-3 10-2 10-1 100 101 102 103
Arithemetic Intensity (FLOPs/Byte)
100
101
102
103
GFL
OPs/
sec
L1 C
ache
DRA
M
Theoretical Peak
No Vectorization
Empirical Peak
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Roofline Performance Model
10-3 10-2 10-1 100 101 102 103
Arithemetic Intensity (FLOPs/Byte)
100
101
102
103
GFL
OPs/
sec
L1 C
ache
DRA
M
Theoretical Peak
No Vectorization
Empirical Peak
MPAS-O Simulation/L1
MPAS-O Simulation/DRAM
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Strong Scaling: PerformanceGoal: What is the optimal concurrency?
32 64128
256512
10242048
40968192
16384
Number of Cores
101
102
103
Sim
ula
tion T
ime [
s]MPI-only
32x1
Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Strong Scaling: PerformanceGoal: What is the optimal concurrency?
32 64128
256512
10242048
40968192
16384
Number of Cores
101
102
103
Sim
ula
tion T
ime [
s]MPI-only
32x1
16x2
Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Strong Scaling: PerformanceGoal: What is the optimal concurrency?
32 64128
256512
10242048
40968192
16384
Number of Cores
101
102
103
Sim
ula
tion T
ime [
s]MPI-only
32x1
16x2
8x4
Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Strong Scaling: PerformanceGoal: What is the optimal concurrency?
32 64128
256512
10242048
40968192
16384
Number of Cores
101
102
103
Sim
ula
tion T
ime [
s]MPI-only
32x1
16x2
8x4
4x8
Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Strong Scaling: PerformanceGoal: What is the optimal concurrency?
32 64128
256512
10242048
40968192
16384
Number of Cores
101
102
103
Sim
ula
tion T
ime [
s]MPI-only
32x1
16x2
8x4
4x8
2x16
Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Strong Scaling: Memory Footprint
32 64128
256512
10242048
40968192
16384
Number of Cores
102
103
104
105
Vm
RSS P
er
Node [
MB
]
MPI-only
32x1
16x2
8x4
4x8
2x16
Minimum memory footprint per process ≈ 250 MB
The working set can possibly completely fit in the HBM on KNL.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
End Notes
• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.
• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.
• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.
• Highly irregular memory accesses make depth-awareness redundant.
• Also, being memory bound, vectorization has no benefit.
• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.
• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Acknowledgements
• Authors from LBNL were supported by U.S. Department of EnergyOffice of Science’s Advanced Scientific Computing Research programunder contract number DE-AC02-05CH11231.
• Authors from LANL were supported by U.S. Department of EnergyOffice of Science’s Biological and Environmental Research program.
• Resources at NERSC were used, supported by DOE Office of Scienceunder Contract No. DE-AC02-05CH11231.
Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes
Thank you!