cori application readiness strategy and early experiences
TRANSCRIPT
![Page 1: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/1.jpg)
Cori Application Readiness Strategy and Early Experiences
March, 2016
![Page 2: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/2.jpg)
What is different about Cori?
Edison (Ivy-Bridge):● 12 Cores Per CPU● 24 Virtual Cores Per CPU
● 2.4-3.2 GHz
● Can do 4 Double Precision Operations per Cycle (+ multiply/add)
● 2.5 GB of Memory Per Core
● ~100 GB/s Memory Bandwidth
Cori (Knights-Landing):● Up to 72 Physical Cores Per CPU● Up to 288 Virtual Cores Per CPU
● Much slower GHz
● Can do 8 Double Precision Operations per Cycle (+ multiply/add)
● < 0.3 GB of Fast Memory Per Core < 2 GB of Slow Memory Per Core
● Fast memory has ~ 5x DDR4 bandwidth
![Page 3: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/3.jpg)
NESAPThe NERSC Exascale Science Application Program
![Page 4: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/4.jpg)
Code Coverage
![Page 5: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/5.jpg)
Resources for Code Teams•
––
•–
–
•––
![Page 6: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/6.jpg)
NESAP Postdocs
Taylor BarnesQuantum ESPRESSO
Brian FriesenBoxlib
Andrey OvsyannikovChombo-Crunch
Mathieu LobetWARP
Tuomas KoskelaXGC1
Tareq MalasEMGeo
![Page 7: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/7.jpg)
NERSC Staff associated with NESAP
Nick Wright Katie Antypas Brian Austin Zhengji Zhao
Jack DeslippeWoo-Sun Yang
Helen He Ankit Bhagatwala
Doug Doerfler
Richard Gerber
Rebecca Hartman-BakerBrandon Cook Thorsten Kurth
Stephen Leak
![Page 8: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/8.jpg)
Timeline
![Page 9: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/9.jpg)
Timeline
![Page 10: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/10.jpg)
Working With Vendors
Dungeon Session Speedups (From Session and Immediate Followup)
NERSC Is uniquely positioned between HPC Vendors and HPC Users and Applications developers.
NESAP provides a power venue for these two groups to interact.
![Page 11: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/11.jpg)
Optimization Strategy
![Page 12: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/12.jpg)
Important Optimization Concepts
![Page 13: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/13.jpg)
![Page 14: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/14.jpg)
Can You Increase Flops
Per Byte Loaded From Memory in Your Algorithm?
Make AlgorithmChanges
Explore Using HBM on Cori
For Key Arrays
Is Performance affected by Half-Clock
Speed?
Run Example at “Half Clock”
Speed
Run Example in “Half
Packed” Mode
Is Performance affected by
Half-Packing?
Your Code is at least Partially Memory Bandwidth Bound
You are at least
Partially CPU Bound
Make Sure Your Code is
Vectorized! Measure Cycles Per Instruction
with VTune
Likely Partially Memory Latency
Bound (assuming not IO or
Communication Bound)
Use IPM and Darshan to Measure and Remove Communication and IO Bottlenecks from Code
Can You Reduce Memory
Requests Per Flop In
Algorithm?
Try Running With as Many
Virtual Threads as Possible (>
240 Per Node on Cori)
Make AlgorithmChanges
YesYes
Yes Yes
No
No No
No
The Ant Farm Flow Chart
![Page 15: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/15.jpg)
Can You Increase Flops
Per Byte Loaded From Memory in Your Algorithm?
Make AlgorithmChanges
Explore Using HBM on Cori
For Key Arrays
Is Performance affected by Half-Clock
Speed?
Run Example at “Half Clock”
Speed
Run Example in “Half
Packed” Mode
Is Performance affected by
Half-Packing?
Your Code is at least Partially Memory Bandwidth Bound
You are at least
Partially CPU Bound
Make Sure Your Code is
Vectorized! Measure Cycles Per Instruction
with VTune
Likely Partially Memory Latency
Bound (assuming not IO or
Communication Bound)
Use IPM and Darshan to Measure and Remove Communication and IO Bottlenecks from Code
Can You Reduce Memory
Requests Per Flop In
Algorithm?
Try Running With as Many
Virtual Threads as Possible (>
240 Per Node on Cori)
Make AlgorithmChanges
Yes Yes
No
No No
No
![Page 16: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/16.jpg)
Are you memory or compute bound? Or both?
Run Example in “Half
Packed” Mode
srun -N 2 -n 24 -c 2 - S 6 ... VS srun -N 1 -n 24 -c 1 ...
If you run on only half of the cores on a node, each core you do run has access to more bandwidth
If your performance changes, you are at least partially memory bandwidth bound
![Page 17: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/17.jpg)
If your performance changes, you are at least partially memory bandwidth bound
Are you memory or compute bound? Or both?
Run Example in “Half
Packed” Mode
srun -n 24 -N 12 - S 6 ... VS aprun -n 24 -N 24 -S 12 ...
If you run on only half of the cores on a node, each core you do run has access to more bandwidth
![Page 18: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/18.jpg)
Measuring Your Memory Bandwidth Usage (VTune)
Measure memory bandwidth usage in VTune. (Next Talk)
Compare to Stream GB/s.
If 90% of stream, you are memory bandwidth
bound.
If less, more tests need to be done.
![Page 19: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/19.jpg)
Are you memory or compute bound? Or both?
srun --cpu-freq=2400000 ... VS srun --cpu-freq=1900000 ...
Reducing the CPU speed slows down computation, but doesn’t reduce memory bandwidth available.
If your performance changes, you are at least partially compute bound
Run Example at “Half Clock”
Speed
![Page 20: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/20.jpg)
So, you are Memory Bandwidth Bound?
What to do?
1. Try to improve memory locality, cache reuse
2. Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-be allocated in HBM on Cori.
Profit by getting ~ 5x more bandwidth GB/s.
![Page 21: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/21.jpg)
So, you are Compute Bound?
What to do?1. Make sure you have good OpenMP scalability. Look at VTune to see thread activity for major
OpenMP regions.
2. Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization in vtune.
See whether intel compiler vectorized loop using compiler flag: -qopt-report=5
![Page 22: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/22.jpg)
Things that prevent vectorization in your code
Example From Cray COE Work on XGC1
![Page 23: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/23.jpg)
Things that prevent vectorization in your code
Example From Cray COE Work on XGC1
~40% speed up for kernel
![Page 24: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/24.jpg)
NESAP Case Studies (More on Thursday)
![Page 25: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/25.jpg)
WARP/PICSAR
Currentdeposition
Field gather
Fieldpush
Particlepush
● Current deposition (particle-to-grid) and Field gather (grid-to-particle) most time consuming subroutines
● Large time spent in memory accesses● Low vectorization
Memoryaccess
FloatingPointops
(scalar)
Floatingpointops
(vector)
NESAP Lead Ankit Bhagatwala, Mathieu Lobet
![Page 26: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/26.jpg)
Optimization 1: Tiling (Sep 2015)
▪ Improve memory locality by tiling particle and grid quantities
Former data layout in PICSAR Tiled layout
• Particles randomly distributed on the global process grid
• Poor cache reuse
• Particles grouped in tiles small enough that local particle/grid arrays fit in cache• Particles deposit charge/current on local grid array in cache• Reduction of local charge/current arrays in global array • Slight extra overhead of reduction
![Page 27: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/27.jpg)
Performance improvement from tiling
Lower is better
Tile size >> L2 Tile fits in L2
• Problem size: 80x80x80 cells• ~10 particles per cell
![Page 28: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/28.jpg)
Optimization 2: Vectorized current deposition
••
••
Lower is better
![Page 29: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/29.jpg)
VASP
NESAP Lead Zhengji Zhao
![Page 30: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/30.jpg)
VASP profiling- memory bandwidth boudn?
![Page 31: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/31.jpg)
Estimating the performance impact of HBW memory to VASP code using AutoHBW tool on Edison
Edison, a Cray XC30, with dual-socket Ivy Bridge nodes interconnected with Cray’s Aries network, the bandwidths of the near socket memory (simulating MCDRAM) and the far socket memory via QPI (simulating DDR) differ by 33%
![Page 32: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/32.jpg)
VASP+FASTMEM performance on Edison
VASP performance comparison between runs when everything was allocated in the DDR memory (blue/slow), when only a few selected arrays were allocated to HBM (red/mixed), and when everything was allocated to HBM (green/fast). The test case PdO@Pd-slab was used, and the tests were run on a single Edison node.
![Page 33: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/33.jpg)
•
•
![Page 34: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/34.jpg)
•
![Page 35: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/35.jpg)
•
Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.
![Page 36: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/36.jpg)
•
Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.
![Page 37: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/37.jpg)
••
![Page 38: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/38.jpg)
Conclusions
![Page 39: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/39.jpg)
![Page 40: Cori Application Readiness Strategy and Early Experiences](https://reader030.vdocuments.us/reader030/viewer/2022012407/616a20e811a7b741a34f1c1e/html5/thumbnails/40.jpg)