t -o in the m -c e with · 2012-03-22 · – ~2x power, 100x performance (beyond moore’s law)...

EXPLORING SOFTWARE SCALABILITY AND TRADE-OFFS IN THE MULTI-CORE ERA WITH

FAST AND ACCURATE MICRO-ARCHITECTURAL SIMULATION

TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS,

WIM VANROOSE, LIEVEN EECKHOUT

[email protected] HTTP://WWW.ELIS.UGENT.BE/~TCARLSON

WEDNESDAY, FEBRUARY 15TH, 2012 PP12, SAVANNAH, GA

HPC PERFORMANCE CHALLENGES

Source: Yalick, EXADAPT 2011

Single-core performance is not keeping pace

HPC POWER CHALLENGES

• #1 on TOP500 (K,Japan) consumes 12.7MW @ 10.5 Petaflops

• Exascale goal: 2018, 20MW @ 1,000 Petaflops – ~2x power, 100x performance (beyond Moore’s law)

Source: Yalick, EXADAPT 2011

HPC SOFTWARE, HARDWARE CHALLENGES

• New programming models – Pthreads / OpenMP / Clik++ / MPI / PGAS

• Hardware is becoming increasingly heterogeneous and diverse – NUMA – CPU Turbo Mode / DVFS – Out-of-Order (Xeon) vs. In-Order (Atom/MIC) – NUCA (future)

• Energy consumption – For current machines, for future systems

• Reliability of large thread and core counts are big concerns – Large clusters of CPUs or GPUs will have regular system-wide

failure rates (Order of months for current systems to hours/days for very large systems)

4

OVERVIEW

• Why use a Simulator?

• About the Sniper Multi-core Simulator

– Interval core model

– Parallel, fast and accurate

• Application Feedback

– CPI Stacks and software scaling

– Software Optimization Case Study

5

OVERVIEW








6

TYPICAL SYSTEM CACHE HIERARCHY

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

DRAM

WHY IS MY CODE SLOWER THAN EXPECTED?

• Traditional on-line analysis routines do not provide the whole story – Cache misses are not the whole story

• Performance counters/cache miss rates do not give an accurate picture of performance

• Tools like Valgrind can also report cache hits and misses, but they do not provide the impact in runtime

– There is no easy way to understand where the lost cycles from the software are going

– VTune can provide some help for specific problems, but does not provide a breakdown for each component

8

PERFORMANCE QUESTIONS NEED ANSWERS

• Scalability – More cores vs. more nodes

– Strong vs. weak scaling analysis

• Performance – How will it perform and scale on next generation

hardware?

• Hardware options – Is it better to have fewer fast cores, or more slower

cores?

– Will an in-order core be sufficient and power efficient?

9

OVERVIEW








10

• Out-of-order core performance model with in-order simulation speed

INTERVAL SIMULATION

11

effe

ctiv

e d

isp

atch

rat

e

time

I-cache miss branch misprediction

long-latency load miss

interval 1 interval 2 interval 3

D. Genbrugge et al., HPCA’10 S. Eyerman et al., ACM TOCS, May 2009

T. Karkhanis and J. E. Smith, ISCA’04, ISCA’07

KEY BENEFITS OF THE INTERVAL MODEL

• Models superscalar OOO execution

• Models impact of ILP

• Models second-order effects: MLP

• Allows for constructing CPI stacks

12

LONG LATENCY MISS EVENTS ISOLATED LONG-LATENCY LOAD

S. Eyerman et al., ACM TOCS, May 2009

13

LONG LATENCY MISS EVENTS OVERLAPPING LONG-LATENCY LOADS

S. Eyerman et al., ACM TOCS, May 2009

14

SNIPER SIMULATION ENVIRONMENT

• User-level, x86-64, parallel (multi-threaded)

• Based on the MIT Graphite Simulator

• Many features

– Interval core model, CPI stacks

– Shared cache models, DVFS

– OpenMP and TBB support, etc.

• Hardware-validated against a 16-core Intel Xeon X7460 Dunnington machine

15

INTERVAL PROVIDES NEEDED ACCURACY

16

The interval core model provides consistent accuracy of 25% avg. abs. error, with a minimal slowdown

T. E. Carlson et al., SC11

INTERVAL: GOOD OVERALL ACCURACY

17

Good accuracy for the entire benchmark suite


SIMULATION PERFORMANCE

18

Sniper currently scales to 2 MIPS

Typical simulators run at 10s-100s KIPS, without scaling


OVERVIEW








19

IPC TRACE – TYPICAL SIMULATOR OUTPUT

timet

per-thread IPC (ferret-large)

time

7

6

5

4

3

2

1

0

IPC Traces do not provide insight into the application’s behavior

20

CYCLE STACKS

• Where did my cycles go?

• CPI stack: cycles per instruction,

broken up in components

• Normalize by either

– Number of instructions (CPI stack)

– Execution time (time stack)

• Different from miss rates as

cycle stacks directly quantify

the effect on performance

CPI

L2 cache

I-cache

Branch

Base

21

Heirman, et. al, IISWC, Nov 2011

CYCLE STACKS FOR PARALLEL APPLICATIONS

• Homogeneous application with heterogeneous performance

22

Heirman, et. al, IISWC, Nov 2011

USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR

23

USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: application becomes DRAM bound

24

USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: application becomes DRAM bound

• Scale core count: synch losses increase to 20%

25

Carlson, et. al, SC11, Nov 2011

SUGGEST APPLICATION IMPROVEMENTS

Pthread mutex to LOCK INC instruction

26

0

7

core

time

ANALYZE SYNCHRONIZATION BEHAVIOR

Thread-state timeline for barnes:

synchronization, critical sections, load imbalance

Critical section Blocked Working

27

CASE STUDY: TILED HEAT TRANSFER

P. Ghysels, 2011

• 5-point stencil, applied to consecutive time steps

• Optimization: tiled to optimize locality, multiple time steps per tile – but

requires redundant computation at tile edges

28

TILE SIZE AND STEPS VS. CACHE BEHAVIOR

tile_heat

29

NOT JUST TIME, ENERGY AS WELL

Integration with McPAT, provide application-specific estimates for power, energy

(and EDP, ED2P, …)

Li et al., MICRO’09

30

WHERE IS THE ENERGY GOING?

EDP: Energy Delay Product

31

ARCHITECTURAL EXPLORATION

Experiment: double the size of L2 and L3 caches

32

SNIPER SIMULATION ENVIRONMENT

• Source code is publically available

• Discussion board for Q&A

• Open source (MIT license, interval model with academic license) available at

http://snipersim.org

33

CONCLUSIONS

• Detailed application understanding is needed for complex trade-off analysis – Raw Application Performance – Software Algorithm Optimization – Energy/Power Analysis

• More accurate than instrumentation (no intrusion), higher visibility

• Simulation (with Sniper) is a fast and accurate simulation for multi-core processors

• Faster than most simulators, so Sniper can be used to model the effects of large caches, large input sets, multiple runs

• Allows for architectural exploration (vs. performance counters)

34

SOFTWARE ANALYSIS AND EXPLORATION USING FAST AND ACCURATE MICRO-

ARCHITECTURAL SIMULATION

TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS,

WIM VANROOSE, LIEVEN EECKHOUT

[email protected] HTTP://WWW.ELIS.UGENT.BE/~TCARLSON

WEDNESDAY, FEBRUARY 15TH, 2012 PP12, SAVANNAH, GA

t -o in the m -c e with · 2012-03-22 · – ~2x power, 100x performance (beyond moore’s law)...

Documents