pmac performance modeling and characterization efficient hpc data motion via scratchpad memory kayla...
TRANSCRIPT
PMaCPerformance Modeling and Characterization
Efficient HPC Data Motion via Scratchpad
MemoryKayla Seager, Ananta Tiwari, Michael
Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington
PMaCPerformance Modeling and Characterization
Question 1Do HPC workloads benefit from
software managed Scratchpads? YES!
If, so how will we manage it?
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaCPerformance Modeling and Characterization
Problem: HPC Powerwall Can't scale old systems
– Powerwall already reached by petaflop systems
– Must redesign for power savings Efficiency must increase by 2x
Source: Exascale Report (Kogge, 2008)
PMaCPerformance Modeling and Characterization
How to get Energy Savings1. Redesign Hardware
– Simpler hardware
– Transfer complexity to software
2. Minimize expensive data movement– Memory slower
– More cores=more contention
– HPC codes have large working set sizes
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaCPerformance Modeling and Characterization
What is a Scratchpad?
Scratchpad (SPM)?– Local memory (like a cache)
– SPM: software allocated memory
Simpler Hardware
VS
Memory Array
Memory Array
Dec
oder Tagging
Array De
cod
er
PMaCPerformance Modeling and Characterization
Scratchpad Allocation Dynamic
– Move block of code
– Iterate over code
– Move another block
Static: Move block of code once
Strategies– Knapsack
– Graph Coloring register allocation problem
PMaCPerformance Modeling and Characterization
The Idea: Less Data Movement Scratchpad saves energy
– Allocation burden now on software Less complexity on hardware Move only what you use
– Uses temporal locality
Cache– Spatial locality can fail: Superfluous data movement
(Spatial locality is built into cache design – note the 8-word linesize in most architectures)
AB C D E
Moved into Cache
PMaCPerformance Modeling and Characterization
Implication of Scratchpads Current use: Embedded Systems
– Smaller working set size
– Predictable code
GPU's– Coding overhead
Issue: HPC codes– Large unpredictable codes
– How to generalize codes?
– How to make it practical and efficient
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaCPerformance Modeling and Characterization
Question 2Are there computation patterns which
get the most benefit from SPM?
PMaCPerformance Modeling and Characterization
Why idioms? Pattern of
computation/memory access
Characterize Application Data Movement
Metric to compare different scientific codes (good coverage)
Easy to port
HPC Code
PMaCPerformance Modeling and Characterization
The Methodology1. Idiom characterization study: idioms SPM vs.
Cache favorability
2. Find idioms on HPC codes
3. Port SPM favorable idioms in HPC codes to scratchpad
PMaCPerformance Modeling and Characterization
Tool: PEBIL Binary instrumentation tool
– Executable Binary => Identify Basic Blocks => Cache Simulation
Cache Simulator built on top of PEBIL– User Defined Cache Structures– Profiles executables (hit/miss)
Cache
Block2
Executable Binary
Stage 1
Stage 2
Block1
Block2
PEBIL Output
Block 1 {#hits} {#misses}Block 2 {#hits} {#misses}
…….
A op BA=b+3…..
Block1
PMaCPerformance Modeling and Characterization
Simulation Environment
Title Cache Size (KB)
Cache Assoc.
Cache Line Size (Bytes)
SPM Size (KB)
SPM Assoc.
SPM Line Size (Bytes)
Cache 64 8 64 - - -
Scratchpad - - - 64 Full 8
Hybrid 32 8 64 32 Full 8
PMaCPerformance Modeling and Characterization
SPM
Stage 1
Stage 2
Cache
Block2
Block1
Executable Binary
Block1
Block2
Block2
Block1
Cache/SPM only
PMaCPerformance Modeling and Characterization
SPM Cache
Block2Block1
Executable Binary
Stage 1
Stage 2
Block1
Block2
Hybrid
Hybrid System
PMaCPerformance Modeling and Characterization
Tool: PIR (find Idioms in HPC) Used for: automatically identifies idioms in large-
scale HPC applications
Input: Idioms.txt– Idioms are defined using a pattern language
Output:– Idioms matched to source line number
Loop1
Loop2 Transpose
Gather
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaCPerformance Modeling and Characterization
Under the hood: HPC Results Under the hood: HPC Results
Fundamental question: Is there a benefit of SPM for HPC codes?– Simulate full apps on cache and SPM
– Use simple heuristic to define the mappings
– Simulate on hybrid
Pitfalls:– Sometime SPM moves more than cache: LRU
PMaCPerformance Modeling and Characterization
Metrics
Data Movement Ratio(SPM Data Movement)
(Cache Data Movement)
Data Moved=(Cache Misses)*Cache Line Size
PMaCPerformance Modeling and Characterization
HPC Applications Graph500
– Construct and traverse weighted undirected graph
HYCOM
– Ocean model: hybrid isopycnal-sigma-pressure, generalized coordinate
SMG2000
– Parallel semi-coarsening Multi-grid Solver
Sequoia Benchmarks
– SPHOT Monte Carlo photon transport code
– UMT Unstructured-mesh deterministic radiation transport code
– AMG2006 Algebraic mult-grid linear system solver for unstructured mesh
PMaCPerformance Modeling and Characterization
HPC Results
PMaCPerformance Modeling and Characterization
Question 1Do HPC workloads benefit from
software managed Scratchpads?YES!
PMaCPerformance Modeling and Characterization
Idiom Gather/Scatter
PMaCPerformance Modeling and Characterization
Using Methodology for HYCOM1. Gather Idiom: Prefers SPM
2. Find gather in HYCOM: 33 instances
3. Port Idiom Blocks: Hybrid Structure– Port Gather Basic Blocks to SPM
– Rest on Cache
Result HYCOM (Ocean Modeling Code) Savings: 20% in data motion
PMaCPerformance Modeling and Characterization
Outline Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaCPerformance Modeling and Characterization
Real SPM for PEBIL? Extension of PEBIL Simulator
– Fully associative cache Rethink replacement policy
Dynamic Allocation Scheme– Idioms determine loops for allocation
– Reuse distance library Track how often used Track distance of use
B C A
Reuse Distance = 2
A
PMaCPerformance Modeling and Characterization
Results Summary SPM
– Simpler Hardware– Efficient Data Movement
Developed Methodology for SPM– Idiom characterization– Idiom identification in HPC codes– Port SPM hotspots – 20% Data Movement Savings for HYCOM
Scratchpad shows potential– Good when spatial locality fails– HPC applications
– SPM only: Average 22% Data Movement Saved– Hybrid: Average 39% Max 69% Data Movement Saved
– 4x Improvement for Gather idiom– Current work on creating SPM for PEBIL
PMaCPerformance Modeling and Characterization
Acknowledgements Acknowledgements
PMaC team – Laura Carrington
– Ananta Tiwari
– Michael Laurenzano
– Pietro Cicotii
– Mitesh Meswani
Dedicated to: Allan Snavely
PMaCPerformance Modeling and Characterization
EXTRA
PMaCPerformance Modeling and Characterization
Idioms: Strided Access
i=i+stride
PMaCPerformance Modeling and Characterization
Looking Forward Idiom Driven Allocation
– PIR-determines loops for allocation
Pre-Allocated array for SPM– Pointers to loops: trigger replacement
Mimic Dynamic Compiler Replacement Policy