t -o in the m -c e with · 2012-03-22 · – ~2x power, 100x performance (beyond moore’s law)...
TRANSCRIPT
EXPLORING SOFTWARE SCALABILITY AND TRADE-OFFS IN THE MULTI-CORE ERA WITH
FAST AND ACCURATE MICRO-ARCHITECTURAL SIMULATION
TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS,
WIM VANROOSE, LIEVEN EECKHOUT
[email protected] HTTP://WWW.ELIS.UGENT.BE/~TCARLSON
WEDNESDAY, FEBRUARY 15TH, 2012 PP12, SAVANNAH, GA
HPC PERFORMANCE CHALLENGES
Source: Yalick, EXADAPT 2011
Single-core performance is not keeping pace
HPC POWER CHALLENGES
• #1 on TOP500 (K,Japan) consumes 12.7MW @ 10.5 Petaflops
• Exascale goal: 2018, 20MW @ 1,000 Petaflops – ~2x power, 100x performance (beyond Moore’s law)
Source: Yalick, EXADAPT 2011
HPC SOFTWARE, HARDWARE CHALLENGES
• New programming models – Pthreads / OpenMP / Clik++ / MPI / PGAS
• Hardware is becoming increasingly heterogeneous and diverse – NUMA – CPU Turbo Mode / DVFS – Out-of-Order (Xeon) vs. In-Order (Atom/MIC) – NUCA (future)
• Energy consumption – For current machines, for future systems
• Reliability of large thread and core counts are big concerns – Large clusters of CPUs or GPUs will have regular system-wide
failure rates (Order of months for current systems to hours/days for very large systems)
4
OVERVIEW
• Why use a Simulator?
• About the Sniper Multi-core Simulator
– Interval core model
– Parallel, fast and accurate
• Application Feedback
– CPI Stacks and software scaling
– Software Optimization Case Study
5
OVERVIEW
• Why use a Simulator?
• About the Sniper Multi-core Simulator
– Interval core model
– Parallel, fast and accurate
• Application Feedback
– CPI Stacks and software scaling
– Software Optimization Case Study
6
TYPICAL SYSTEM CACHE HIERARCHY
L2
L1I L1D
L1I L1D
L2
L1I L1D
L1I L1D
L3
L2
L1I L1D
L1I L1D
L2
L1I L1D
L1I L1D
L3
L2
L1I L1D
L1I L1D
L2
L1I L1D
L1I L1D
L3
L2
L1I L1D
L1I L1D
L2
L1I L1D
L1I L1D
L3
DRAM
WHY IS MY CODE SLOWER THAN EXPECTED?
• Traditional on-line analysis routines do not provide the whole story – Cache misses are not the whole story
• Performance counters/cache miss rates do not give an accurate picture of performance
• Tools like Valgrind can also report cache hits and misses, but they do not provide the impact in runtime
– There is no easy way to understand where the lost cycles from the software are going
– VTune can provide some help for specific problems, but does not provide a breakdown for each component
8
PERFORMANCE QUESTIONS NEED ANSWERS
• Scalability – More cores vs. more nodes
– Strong vs. weak scaling analysis
• Performance – How will it perform and scale on next generation
hardware?
• Hardware options – Is it better to have fewer fast cores, or more slower
cores?
– Will an in-order core be sufficient and power efficient?
9
OVERVIEW
• Why use a Simulator?
• About the Sniper Multi-core Simulator
– Interval core model
– Parallel, fast and accurate
• Application Feedback
– CPI Stacks and software scaling
– Software Optimization Case Study
10
• Out-of-order core performance model with in-order simulation speed
INTERVAL SIMULATION
11
effe
ctiv
e d
isp
atch
rat
e
time
I-cache miss branch misprediction
long-latency load miss
interval 1 interval 2 interval 3
D. Genbrugge et al., HPCA’10 S. Eyerman et al., ACM TOCS, May 2009
T. Karkhanis and J. E. Smith, ISCA’04, ISCA’07
KEY BENEFITS OF THE INTERVAL MODEL
• Models superscalar OOO execution
• Models impact of ILP
• Models second-order effects: MLP
• Allows for constructing CPI stacks
12
LONG LATENCY MISS EVENTS ISOLATED LONG-LATENCY LOAD
S. Eyerman et al., ACM TOCS, May 2009
13
LONG LATENCY MISS EVENTS OVERLAPPING LONG-LATENCY LOADS
S. Eyerman et al., ACM TOCS, May 2009
14
SNIPER SIMULATION ENVIRONMENT
• User-level, x86-64, parallel (multi-threaded)
• Based on the MIT Graphite Simulator
• Many features
– Interval core model, CPI stacks
– Shared cache models, DVFS
– OpenMP and TBB support, etc.
• Hardware-validated against a 16-core Intel Xeon X7460 Dunnington machine
15
INTERVAL PROVIDES NEEDED ACCURACY
16
The interval core model provides consistent accuracy of 25% avg. abs. error, with a minimal slowdown
T. E. Carlson et al., SC11
INTERVAL: GOOD OVERALL ACCURACY
17
Good accuracy for the entire benchmark suite
T. E. Carlson et al., SC11
SIMULATION PERFORMANCE
18
Sniper currently scales to 2 MIPS
Typical simulators run at 10s-100s KIPS, without scaling
T. E. Carlson et al., SC11
OVERVIEW
• Why use a Simulator?
• About the Sniper Multi-core Simulator
– Interval core model
– Parallel, fast and accurate
• Application Feedback
– CPI Stacks and software scaling
– Software Optimization Case Study
19
IPC TRACE – TYPICAL SIMULATOR OUTPUT
timet
per-thread IPC (ferret-large)
time
7
6
5
4
3
2
1
0
IPC Traces do not provide insight into the application’s behavior
20
CYCLE STACKS
• Where did my cycles go?
• CPI stack: cycles per instruction,
broken up in components
• Normalize by either
– Number of instructions (CPI stack)
– Execution time (time stack)
• Different from miss rates as
cycle stacks directly quantify
the effect on performance
CPI
L2 cache
I-cache
Branch
Base
21
Heirman, et. al, IISWC, Nov 2011
CYCLE STACKS FOR PARALLEL APPLICATIONS
• Homogeneous application with heterogeneous performance
22
Heirman, et. al, IISWC, Nov 2011
USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR
23
USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: application becomes DRAM bound
24
USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: application becomes DRAM bound
• Scale core count: synch losses increase to 20%
25
Carlson, et. al, SC11, Nov 2011
SUGGEST APPLICATION IMPROVEMENTS
Pthread mutex to LOCK INC instruction
26
0
7
core
time
ANALYZE SYNCHRONIZATION BEHAVIOR
Thread-state timeline for barnes:
synchronization, critical sections, load imbalance
Critical section Blocked Working
27
CASE STUDY: TILED HEAT TRANSFER
P. Ghysels, 2011
• 5-point stencil, applied to consecutive time steps
• Optimization: tiled to optimize locality, multiple time steps per tile – but
requires redundant computation at tile edges
28
TILE SIZE AND STEPS VS. CACHE BEHAVIOR
tile_heat
29
NOT JUST TIME, ENERGY AS WELL
Integration with McPAT, provide application-specific estimates for power, energy
(and EDP, ED2P, …)
Li et al., MICRO’09
30
WHERE IS THE ENERGY GOING?
EDP: Energy Delay Product
31
ARCHITECTURAL EXPLORATION
Experiment: double the size of L2 and L3 caches
32
SNIPER SIMULATION ENVIRONMENT
• Source code is publically available
• Discussion board for Q&A
• Open source (MIT license, interval model with academic license) available at
http://snipersim.org
33
CONCLUSIONS
• Detailed application understanding is needed for complex trade-off analysis – Raw Application Performance – Software Algorithm Optimization – Energy/Power Analysis
• More accurate than instrumentation (no intrusion), higher visibility
• Simulation (with Sniper) is a fast and accurate simulation for multi-core processors
• Faster than most simulators, so Sniper can be used to model the effects of large caches, large input sets, multiple runs
• Allows for architectural exploration (vs. performance counters)
34
SOFTWARE ANALYSIS AND EXPLORATION USING FAST AND ACCURATE MICRO-
ARCHITECTURAL SIMULATION
TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS,
WIM VANROOSE, LIEVEN EECKHOUT
[email protected] HTTP://WWW.ELIS.UGENT.BE/~TCARLSON
WEDNESDAY, FEBRUARY 15TH, 2012 PP12, SAVANNAH, GA