exploiting thread parallelism for ocean modeling on cray ... · may 2016. abhinav sarje lawrence...

56
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers Abhinav Sarje asarje @ lbl.gov Cray Users Group Meeting (CUG) May 2016

Upload: others

Post on 14-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Exploiting Thread Parallelism for Ocean Modelingon Cray XC Supercomputers

Abhinav Sarjeasarje @ lbl.gov

Cray Users Group Meeting (CUG)May 2016

Page 2: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Abhinav Sarje Lawrence Berkeley National Laboratory

Douglas Jacobsen Los Alamos National Laboratory

Samuel Williams Lawrence Berkeley National Laboratory

Todd Ringler Los Alamos National Laboratory

Leonid Oliker Lawrence Berkeley National Laboratory

Page 3: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Ocean Modeling with MPAS-Ocean

• MPAS = Model for PredictionAcross Scales. [LANL/NCAR]

• A multiscale method forsimulating Earth’s oceans.

• 2D Voronoi tessellation-basedvariable resolution mesh(SCVT).

• Structured vertical columns inthird dimension representocean depths.

Page 4: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Ocean Modeling with MPAS-Ocean

Variable Resolution Mesh Mesh Elements

Page 5: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Ocean Modeling with MPAS-Ocean: Why Unstructured Meshes?

• Variable resolutions with any given density function.

• Straightforward mapping to flat 2D with effectively no distortions.

• Quasi-uniform (locally homogeneous) coverage of spherical surfaces.

• Smooth resolution transition regions.

• Preserve symmetry/isotropic nature of a spherical surface.

• Naturally allows unstructured discontinuities.

Page 6: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 7: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 8: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 9: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 10: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 11: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 12: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 13: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 14: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Unstructured meshes in parallelenvironment.

• Challenging to achieve uniformpartitioning unlike structured grids.

• Load balance depends onpartitioning quality and elementdistributions.

• Data exchange between processesthrough deep halo regions.

• Threading motivated by increasingon-node core counts, with limitedavailable memory per core.

Page 15: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Background & Motivations

• Structured vertical dimension for ocean depth.

• Variable depth but constant sized buffers with maximum depth!

Page 16: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Developing a Hybrid MPI + OpenMP Implementation

Page 17: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Computational Environment

• Cray XC40: Cori Phase 1• 1,630 dual-socket compute nodes with Cray Aries interconnect• 16 core Intel Haswell processors• 32 cores per node• 128 GB DRAM per node

• Meshes:1 60 kms uniform resolution:

114,539 cells, maximum vertical depth of 40.2 60 kms to 30 kms variable resolution:

234,095 cells, maximum vertical depth of 60.

Page 18: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Performance Plots and Keys

MPI-only

32x116x2

8x44x8

2x16

# MPI Procs. × # OMP Threads

0

2

4

6

8

10Perf

orm

ance

Metr

ic

Page 19: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Performance Plots and Keys

Configurations = MPI procs. per node × OMP threads per MPI

Configuration Decreasing Increasing

1 32 × 1

2 16 × 2

3 8 × 4

4 4 × 8

5 2 × 16

6 1 × 32

• Number of MPI processesper node

• Total halo region volume

• Communication (numberand volume)

• Extra computations

• Memory requirements

• Inter-process load imbalance

• Number of threads perprocess

• Amount of data sharing

• Threading overhead

• Inter-thread load imbalance

Page 20: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Threading Granularity: Block-Level

• Minimal implementationdisruption to the code

• Simple intra-nodehalo-exchanges across threadswithout need for explicitcommunication

• Possibly insufficient parallelism

• Intra-block halo regionsconsume memory and requireadditional computations

Page 21: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Threading Granularity: Block-Level

• Minimal implementationdisruption to the code

• Simple intra-nodehalo-exchanges across threadswithout need for explicitcommunication

• Possibly insufficient parallelism

• Intra-block halo regionsconsume memory and requireadditional computations

Page 22: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Threading Granularity: Block-Level

• Minimal implementationdisruption to the code

• Simple intra-nodehalo-exchanges across threadswithout need for explicitcommunication

• Possibly insufficient parallelism

• Intra-block halo regionsconsume memory and requireadditional computations

Page 23: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Threading Granularity: Element-Level

• Significant implementationdisruption to the code

• Eliminates intra-processhalo-exchanges and halo regions

• Computation andcommunication efficient

Page 24: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Threading Granularity: Element-Level

• Significant implementationdisruption to the code

• Eliminates intra-processhalo-exchanges and halo regions

• Computation andcommunication efficient

Page 25: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Threading Granularity: Element-Level

• Significant implementationdisruption to the code

• Eliminates intra-processhalo-exchanges and halo regions

• Computation andcommunication efficient

Page 26: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Element Distributions across Threads

Goal: What is the best approach to distribute mesh elements among threads?

1 Runtime distribution with explicit OpenMP loop directives.

2 Precomputed distributions.• Naive/Static.• Depth-Aware.

Page 27: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Element Distributions across Threads

Goal: What is the best approach to distribute mesh elements among threads?

1 Runtime distribution with explicit OpenMP loop directives.

2 Precomputed distributions.• Naive/Static.• Depth-Aware.

Page 28: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Element Distributions across Threads

Goal: What is the best approach to distribute mesh elements among threads?

1 Runtime distribution with explicit OpenMP loop directives.

2 Precomputed distributions.• Naive/Static.• Depth-Aware.

Page 29: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Element Distributions across Threads

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

0

20

40

60

80

100

120

140

Tim

e [

s]OpenMP Static Scheduling

Pre-computed Static

Pre-computed Depth-aware

Result: Explicit OpenMP loop directives.

Page 30: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Element Distributions across Threads

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

0

20

40

60

80

100

120

140

Tim

e [

s]OpenMP Static Scheduling

Pre-computed Static

Pre-computed Depth-aware

Result: Explicit OpenMP loop directives.

Page 31: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Thread Scaling Bottlenecks

MPI-only 32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

10-1

100

101

102Tim

e [

s]btr_se_subcycle_vel_pred

btr_se_subcycle_acc

btr_se_subcycle_ssh_pred

btr_se_subcycle_vel_corr

btr se subcycle loop

Page 32: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Memory Allocation and Initialization

Goal: Can the Amdahl bottlenecks be minimized?

1 Baseline code: Single thread for memory allocation and initialization.

2 Optimized code: All available threads for initialization.

3 Plus, reordered allocation and initialization events.

Page 33: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Memory Allocation and Initialization

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

0

20

40

60

80

100

120

140

Sim

ula

tion T

ime [

s]

Original

Threaded Initialization

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

1.0

1.5

2.0

2.5

Speedup o

ver

Ori

gin

al

Original

Threaded Initialization

Result: Multi-threaded memory management.

Page 34: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

OpenMP Scheduling Policies

MPI-only

32x116x2

8x44x8

2x16

# MPI Procs. × # OMP Threads

101

102

103

104

Sim

ula

tion T

ime [

s]static

static,100

static,10

static,1

dynamic

dynamic,10

guided

Page 35: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

OpenMP Scheduling Policies

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

101

102

103

104

Com

pute

Tim

e [

s]

static

static,100

static,10

static,1

dynamic

dynamic,10

guided

Page 36: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

OpenMP Scheduling Policies

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

101

102

103

104

Com

pute

Tim

e [

s]

static

static,100

static,10

static,1

dynamic

dynamic,10

guided

32x1 16x2 8x4 4x8 2x16

# MPI Procs. × # OMP Threads

10-1

100

101

MPI Tim

e [

s]

static

static,100

static,10

static,1

dynamic

dynamic,10

guided

Page 37: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

OpenMP Scheduling Policies

MPI-only32x1 16x2 8x4 4x8 2x16

static

static,100

static,10

static,1

dynamic

dynamic,10

guided

# Nodes = 1

MPI-only32x1 16x2 8x4 4x8 2x16

# Nodes = 2

MPI-only32x1 16x2 8x4 4x8 2x16

# Nodes = 4

MPI-only32x1 16x2 8x4 4x8 2x16

# Nodes = 8

101

102

103

Sim

ula

tion T

ime [

s]

Result: Select static or guided scheduling.

Avoid small chunk sizes.

Balance number of MPI processes and threads: 16 × 2 or 8 × 4.

Page 38: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Vectorization?

Goal: Can we take advantage of the vector units?

1 None−no − vec, −no − simd

2 Compiler auto−vec, −simd

3 OpenMP SIMD only−no − vec, −simd

4 Full−vec, −simd

5 Full with aligned memory−align, −vec, −simd

Result: Highly memory-bound, making vectorization unnecessary.

Page 39: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Vectorization?

Goal: Can we take advantage of the vector units?

1 None−no − vec, −no − simd

2 Compiler auto−vec, −simd

3 OpenMP SIMD only−no − vec, −simd

4 Full−vec, −simd

5 Full with aligned memory−align, −vec, −simd

32x10.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

Speedup o

ver

No V

ect

ori

zati

on Compiler Auto-vectorization

OpenMP SIMD Directives

Compiler Auto and OpenMP SIMD vectorization

Full Vectorization + Aligned Memory

Result: Highly memory-bound, making vectorization unnecessary.

Page 40: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Roofline Performance Model

10-3 10-2 10-1 100 101 102 103

Arithemetic Intensity (FLOPs/Byte)

100

101

102

103

GFL

OPs/

sec

L1 C

ache

DRA

M

Theoretical Peak

No Vectorization

Empirical Peak

Page 41: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Roofline Performance Model

10-3 10-2 10-1 100 101 102 103

Arithemetic Intensity (FLOPs/Byte)

100

101

102

103

GFL

OPs/

sec

L1 C

ache

DRA

M

Theoretical Peak

No Vectorization

Empirical Peak

MPAS-O Simulation/L1

MPAS-O Simulation/DRAM

Page 42: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Strong Scaling: PerformanceGoal: What is the optimal concurrency?

32 64128

256512

10242048

40968192

16384

Number of Cores

101

102

103

Sim

ula

tion T

ime [

s]MPI-only

32x1

Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.

Page 43: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Strong Scaling: PerformanceGoal: What is the optimal concurrency?

32 64128

256512

10242048

40968192

16384

Number of Cores

101

102

103

Sim

ula

tion T

ime [

s]MPI-only

32x1

16x2

Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.

Page 44: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Strong Scaling: PerformanceGoal: What is the optimal concurrency?

32 64128

256512

10242048

40968192

16384

Number of Cores

101

102

103

Sim

ula

tion T

ime [

s]MPI-only

32x1

16x2

8x4

Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.

Page 45: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Strong Scaling: PerformanceGoal: What is the optimal concurrency?

32 64128

256512

10242048

40968192

16384

Number of Cores

101

102

103

Sim

ula

tion T

ime [

s]MPI-only

32x1

16x2

8x4

4x8

Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.

Page 46: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Strong Scaling: PerformanceGoal: What is the optimal concurrency?

32 64128

256512

10242048

40968192

16384

Number of Cores

101

102

103

Sim

ula

tion T

ime [

s]MPI-only

32x1

16x2

8x4

4x8

2x16

Result: 3× speedup for best hybrid (n = 64, 8 × 4) w.r.t. best MPI-only (n = 16) runs.2× speedup for best hybrid (8 × 4) w.r.t. best MPI-only runs at n = 16.

Page 47: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Strong Scaling: Memory Footprint

32 64128

256512

10242048

40968192

16384

Number of Cores

102

103

104

105

Vm

RSS P

er

Node [

MB

]

MPI-only

32x1

16x2

8x4

4x8

2x16

Minimum memory footprint per process ≈ 250 MB

The working set can possibly completely fit in the HBM on KNL.

Page 48: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 49: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 50: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 51: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 52: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 53: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 54: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

End Notes

• Threaded buffer initialization necessary. Observed up to 2.5× speedup on asingle node.

• Thread scheduling with larger chunk sizes, such as default static and guidedpolicies, perform best with MPAS-Ocean. Single element chunk size degradesperformance by up to an order of magnitude.

• Well balanced number of MPI processes and OpenMP threads (e.g. 8×4) showedbest performance.

• Highly irregular memory accesses make depth-awareness redundant.

• Also, being memory bound, vectorization has no benefit.

• Hybrid MPI+OpenMP implementation improves strong scaling at highconcurrencies w.r.t. MPI-only implementation. Up to 3× improvement observedfor best runs.

• A minimum memory of 250 MB per process needed. Can take advantage of anyavailable HBW memories in upcoming architectures, potentially providingsignificant performance improvement.

Page 55: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Acknowledgements

• Authors from LBNL were supported by U.S. Department of EnergyOffice of Science’s Advanced Scientific Computing Research programunder contract number DE-AC02-05CH11231.

• Authors from LANL were supported by U.S. Department of EnergyOffice of Science’s Biological and Environmental Research program.

• Resources at NERSC were used, supported by DOE Office of Scienceunder Contract No. DE-AC02-05CH11231.

Page 56: Exploiting Thread Parallelism for Ocean Modeling on Cray ... · May 2016. Abhinav Sarje Lawrence Berkeley National Laboratory Douglas Jacobsen Los Alamos National Laboratory ... 10

Introduction Threading Distributions Initialization Scheduling Vectorization Performance Model Scaling End Notes

Thank you!