pmac performance modeling and characterization efficient hpc data motion via scratchpad memory kayla...

35
PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington

Upload: miles-lester

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Efficient HPC Data Motion via Scratchpad

MemoryKayla Seager, Ananta Tiwari, Michael

Laurenzano, Joshua Peraza, Pietro Cicotti, Laura Carrington

Page 2: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Question 1Do HPC workloads benefit from

software managed Scratchpads? YES!

If, so how will we manage it?

Page 3: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Outline Motivation

Scratchpad Background

Simulation Framework and Methodology

Initial Study

Current Direction

Page 4: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Outline Motivation

Scratchpad Background

Simulation Framework and Methodology

Initial Study

Current Direction

Page 5: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Problem: HPC Powerwall Can't scale old systems

– Powerwall already reached by petaflop systems

– Must redesign for power savings Efficiency must increase by 2x

  

Source: Exascale Report (Kogge, 2008)

Page 6: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

How to get Energy Savings1. Redesign Hardware

– Simpler hardware

– Transfer complexity to software

2. Minimize expensive data movement– Memory slower

– More cores=more contention

– HPC codes have large working set sizes

Page 7: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Outline Motivation

Scratchpad Background

Simulation Framework and Methodology

Initial Study

Current Direction

Page 8: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

What is a Scratchpad?

Scratchpad (SPM)?– Local memory (like a cache)

– SPM: software allocated memory

Simpler Hardware

VS

Memory Array

Memory Array

Dec

oder Tagging

Array De

cod

er

Page 9: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Scratchpad Allocation Dynamic

– Move block of code

– Iterate over code

– Move another block

Static: Move block of code once

Strategies– Knapsack

– Graph Coloring register allocation problem

Page 10: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

The Idea: Less Data Movement Scratchpad saves energy

– Allocation burden now on software Less complexity on hardware Move only what you use

– Uses temporal locality

Cache– Spatial locality can fail: Superfluous data movement

(Spatial locality is built into cache design – note the 8-word linesize in most architectures)

AB C D E

Moved into Cache

Page 11: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Implication of Scratchpads Current use: Embedded Systems

– Smaller working set size

– Predictable code

GPU's– Coding overhead

Issue: HPC codes– Large unpredictable codes

– How to generalize codes?

– How to make it practical and efficient

Page 12: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Outline Motivation

Scratchpad Background

Simulation Framework and Methodology

Initial Study

Current Direction

Page 13: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Question 2Are there computation patterns which

get the most benefit from SPM?

Page 14: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Why idioms? Pattern of

computation/memory access

Characterize Application Data Movement

Metric to compare different scientific codes (good coverage)

Easy to port

HPC Code

Page 15: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

The Methodology1. Idiom characterization study: idioms SPM vs.

Cache favorability

2. Find idioms on HPC codes

3. Port SPM favorable idioms in HPC codes to scratchpad

Page 16: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Tool: PEBIL Binary instrumentation tool

– Executable Binary => Identify Basic Blocks => Cache Simulation

Cache Simulator built on top of PEBIL– User Defined Cache Structures– Profiles executables (hit/miss)

Cache

Block2

Executable Binary

Stage 1

Stage 2

Block1

Block2

PEBIL Output

Block 1 {#hits} {#misses}Block 2 {#hits} {#misses}

…….

A op BA=b+3…..

Block1

Page 17: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Simulation Environment

Title Cache Size (KB)

Cache Assoc.

Cache Line Size (Bytes)

SPM Size (KB)

SPM Assoc.

SPM Line Size (Bytes)

Cache 64 8 64 - - -

Scratchpad - - - 64 Full 8

Hybrid 32 8 64 32 Full 8

Page 18: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

SPM

Stage 1

Stage 2

Cache

Block2

Block1

Executable Binary

Block1

Block2

Block2

Block1

Cache/SPM only

Page 19: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

SPM Cache

Block2Block1

Executable Binary

Stage 1

Stage 2

Block1

Block2

Hybrid

Hybrid System

Page 20: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Tool: PIR (find Idioms in HPC) Used for: automatically identifies idioms in large-

scale HPC applications

Input: Idioms.txt– Idioms are defined using a pattern language

Output:– Idioms matched to source line number

Loop1

Loop2 Transpose

Gather

Page 21: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Outline Motivation

Scratchpad Background

Simulation Framework and Methodology

Initial Study

Current Direction

Page 22: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Under the hood: HPC Results Under the hood: HPC Results

Fundamental question: Is there a benefit of SPM for HPC codes?– Simulate full apps on cache and SPM

– Use simple heuristic to define the mappings

– Simulate on hybrid

Pitfalls:– Sometime SPM moves more than cache: LRU

Page 23: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Metrics

Data Movement Ratio(SPM Data Movement)

(Cache Data Movement)

Data Moved=(Cache Misses)*Cache Line Size

Page 24: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

HPC Applications Graph500

– Construct and traverse weighted undirected graph

HYCOM

– Ocean model: hybrid isopycnal-sigma-pressure, generalized coordinate

SMG2000

– Parallel semi-coarsening Multi-grid Solver

Sequoia Benchmarks

– SPHOT Monte Carlo photon transport code

– UMT Unstructured-mesh deterministic radiation transport code

– AMG2006 Algebraic mult-grid linear system solver for unstructured mesh

Page 25: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

HPC Results

Page 26: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Question 1Do HPC workloads benefit from

software managed Scratchpads?YES!

Page 27: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Idiom Gather/Scatter

Page 28: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Using Methodology for HYCOM1. Gather Idiom: Prefers SPM

2. Find gather in HYCOM: 33 instances

3. Port Idiom Blocks: Hybrid Structure– Port Gather Basic Blocks to SPM

– Rest on Cache

Result HYCOM (Ocean Modeling Code) Savings: 20% in data motion

Page 29: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Outline Motivation

Scratchpad Background

Simulation Framework and Methodology

Initial Study

Current Direction

Page 30: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Real SPM for PEBIL? Extension of PEBIL Simulator

– Fully associative cache Rethink replacement policy

Dynamic Allocation Scheme– Idioms determine loops for allocation

– Reuse distance library Track how often used Track distance of use

B C A

Reuse Distance = 2

A

Page 31: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Results Summary SPM

– Simpler Hardware– Efficient Data Movement

Developed Methodology for SPM– Idiom characterization– Idiom identification in HPC codes– Port SPM hotspots – 20% Data Movement Savings for HYCOM

Scratchpad shows potential– Good when spatial locality fails– HPC applications

– SPM only: Average 22% Data Movement Saved– Hybrid: Average 39% Max 69% Data Movement Saved

– 4x Improvement for Gather idiom– Current work on creating SPM for PEBIL

Page 32: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Acknowledgements Acknowledgements

PMaC team – Laura Carrington

– Ananta Tiwari

– Michael Laurenzano

– Pietro Cicotii

– Mitesh Meswani

Dedicated to: Allan Snavely

Page 33: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

EXTRA

Page 34: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Idioms: Strided Access

i=i+stride

Page 35: PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua

PMaCPerformance Modeling and Characterization

Looking Forward Idiom Driven Allocation

– PIR-determines loops for allocation

Pre-Allocated array for SPM– Pointers to loops: trigger replacement

Mimic Dynamic Compiler Replacement Policy