![Page 1: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/1.jpg)
PMaCPerformance Modeling and Characterization
Efficient HPC Data Motion via Scratchpad
MemoryKayla Seager, Ananta Tiwari, Michael
Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington
![Page 2: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/2.jpg)
PMaCPerformance Modeling and Characterization
Question 1Do HPC workloads benefit from
software managed Scratchpads? YES!
If, so how will we manage it?
![Page 3: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/3.jpg)
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
![Page 4: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/4.jpg)
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
![Page 5: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/5.jpg)
PMaCPerformance Modeling and Characterization
Problem: HPC Powerwall Can't scale old systems
– Powerwall already reached by petaflop systems
– Must redesign for power savings Efficiency must increase by 2x
Source: Exascale Report (Kogge, 2008)
![Page 6: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/6.jpg)
PMaCPerformance Modeling and Characterization
How to get Energy Savings1. Redesign Hardware
– Simpler hardware
– Transfer complexity to software
2. Minimize expensive data movement– Memory slower
– More cores=more contention
– HPC codes have large working set sizes
![Page 7: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/7.jpg)
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
![Page 8: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/8.jpg)
PMaCPerformance Modeling and Characterization
What is a Scratchpad?
Scratchpad (SPM)?– Local memory (like a cache)
– SPM: software allocated memory
Simpler Hardware
VS
Memory Array
Memory Array
Dec
oder Tagging
Array De
cod
er
![Page 9: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/9.jpg)
PMaCPerformance Modeling and Characterization
Scratchpad Allocation Dynamic
– Move block of code
– Iterate over code
– Move another block
Static: Move block of code once
Strategies– Knapsack
– Graph Coloring register allocation problem
![Page 10: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/10.jpg)
PMaCPerformance Modeling and Characterization
The Idea: Less Data Movement Scratchpad saves energy
– Allocation burden now on software Less complexity on hardware Move only what you use
– Uses temporal locality
Cache– Spatial locality can fail: Superfluous data movement
(Spatial locality is built into cache design – note the 8-word linesize in most architectures)
AB C D E
Moved into Cache
![Page 11: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/11.jpg)
PMaCPerformance Modeling and Characterization
Implication of Scratchpads Current use: Embedded Systems
– Smaller working set size
– Predictable code
GPU's– Coding overhead
Issue: HPC codes– Large unpredictable codes
– How to generalize codes?
– How to make it practical and efficient
![Page 12: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/12.jpg)
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
![Page 13: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/13.jpg)
PMaCPerformance Modeling and Characterization
Question 2Are there computation patterns which
get the most benefit from SPM?
![Page 14: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/14.jpg)
PMaCPerformance Modeling and Characterization
Why idioms? Pattern of
computation/memory access
Characterize Application Data Movement
Metric to compare different scientific codes (good coverage)
Easy to port
HPC Code
![Page 15: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/15.jpg)
PMaCPerformance Modeling and Characterization
The Methodology1. Idiom characterization study: idioms SPM vs.
Cache favorability
2. Find idioms on HPC codes
3. Port SPM favorable idioms in HPC codes to scratchpad
![Page 16: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/16.jpg)
PMaCPerformance Modeling and Characterization
Tool: PEBIL Binary instrumentation tool
– Executable Binary => Identify Basic Blocks => Cache Simulation
Cache Simulator built on top of PEBIL– User Defined Cache Structures– Profiles executables (hit/miss)
Cache
Block2
Executable Binary
Stage 1
Stage 2
Block1
Block2
PEBIL Output
Block 1 {#hits} {#misses}Block 2 {#hits} {#misses}
…….
A op BA=b+3…..
Block1
![Page 17: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/17.jpg)
PMaCPerformance Modeling and Characterization
Simulation Environment
Title Cache Size (KB)
Cache Assoc.
Cache Line Size (Bytes)
SPM Size (KB)
SPM Assoc.
SPM Line Size (Bytes)
Cache 64 8 64 - - -
Scratchpad - - - 64 Full 8
Hybrid 32 8 64 32 Full 8
![Page 18: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/18.jpg)
PMaCPerformance Modeling and Characterization
SPM
Stage 1
Stage 2
Cache
Block2
Block1
Executable Binary
Block1
Block2
Block2
Block1
Cache/SPM only
![Page 19: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/19.jpg)
PMaCPerformance Modeling and Characterization
SPM Cache
Block2Block1
Executable Binary
Stage 1
Stage 2
Block1
Block2
Hybrid
Hybrid System
![Page 20: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/20.jpg)
PMaCPerformance Modeling and Characterization
Tool: PIR (find Idioms in HPC) Used for: automatically identifies idioms in large-
scale HPC applications
Input: Idioms.txt– Idioms are defined using a pattern language
Output:– Idioms matched to source line number
Loop1
Loop2 Transpose
Gather
![Page 21: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/21.jpg)
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
![Page 22: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/22.jpg)
PMaCPerformance Modeling and Characterization
Under the hood: HPC Results Under the hood: HPC Results
Fundamental question: Is there a benefit of SPM for HPC codes?– Simulate full apps on cache and SPM
– Use simple heuristic to define the mappings
– Simulate on hybrid
Pitfalls:– Sometime SPM moves more than cache: LRU
![Page 23: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/23.jpg)
PMaCPerformance Modeling and Characterization
Metrics
Data Movement Ratio(SPM Data Movement)
(Cache Data Movement)
Data Moved=(Cache Misses)*Cache Line Size
![Page 24: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/24.jpg)
PMaCPerformance Modeling and Characterization
HPC Applications Graph500
– Construct and traverse weighted undirected graph
HYCOM
– Ocean model: hybrid isopycnal-sigma-pressure, generalized coordinate
SMG2000
– Parallel semi-coarsening Multi-grid Solver
Sequoia Benchmarks
– SPHOT Monte Carlo photon transport code
– UMT Unstructured-mesh deterministic radiation transport code
– AMG2006 Algebraic mult-grid linear system solver for unstructured mesh
![Page 25: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/25.jpg)
PMaCPerformance Modeling and Characterization
HPC Results
![Page 26: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/26.jpg)
PMaCPerformance Modeling and Characterization
Question 1Do HPC workloads benefit from
software managed Scratchpads?YES!
![Page 27: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/27.jpg)
PMaCPerformance Modeling and Characterization
Idiom Gather/Scatter
![Page 28: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/28.jpg)
PMaCPerformance Modeling and Characterization
Using Methodology for HYCOM1. Gather Idiom: Prefers SPM
2. Find gather in HYCOM: 33 instances
3. Port Idiom Blocks: Hybrid Structure– Port Gather Basic Blocks to SPM
– Rest on Cache
Result HYCOM (Ocean Modeling Code) Savings: 20% in data motion
![Page 29: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/29.jpg)
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
![Page 30: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/30.jpg)
PMaCPerformance Modeling and Characterization
Real SPM for PEBIL? Extension of PEBIL Simulator
– Fully associative cache Rethink replacement policy
Dynamic Allocation Scheme– Idioms determine loops for allocation
– Reuse distance library Track how often used Track distance of use
B C A
Reuse Distance = 2
A
![Page 31: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/31.jpg)
PMaCPerformance Modeling and Characterization
Results Summary SPM
– Simpler Hardware– Efficient Data Movement
Developed Methodology for SPM– Idiom characterization– Idiom identification in HPC codes– Port SPM hotspots – 20% Data Movement Savings for HYCOM
Scratchpad shows potential– Good when spatial locality fails– HPC applications
– SPM only: Average 22% Data Movement Saved– Hybrid: Average 39% Max 69% Data Movement Saved
– 4x Improvement for Gather idiom– Current work on creating SPM for PEBIL
![Page 32: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/32.jpg)
PMaCPerformance Modeling and Characterization
Acknowledgements Acknowledgements
PMaC team – Laura Carrington
– Ananta Tiwari
– Michael Laurenzano
– Pietro Cicotii
– Mitesh Meswani
Dedicated to: Allan Snavely
![Page 33: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/33.jpg)
PMaCPerformance Modeling and Characterization
EXTRA
![Page 34: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/34.jpg)
PMaCPerformance Modeling and Characterization
Idioms: Strided Access
i=i+stride
![Page 35: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e9c5503460f94b9d367/html5/thumbnails/35.jpg)
PMaCPerformance Modeling and Characterization
Looking Forward Idiom Driven Allocation
– PIR-determines loops for allocation
Pre-Allocated array for SPM– Pointers to loops: trigger replacement
Mimic Dynamic Compiler Replacement Policy