detail at scale in performance analysis · 1 detail at scale in performance analysis jesus labarta...

27
1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September 2010 2 On the title Performance analysis Scale Detail Some examples Visualizing variability Relevant information Instrumentation and sampling Outline

Upload: hadieu

Post on 05-Sep-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

1

Detail at scale in performance analysis

Jesus LabartaDirector Computer Sciences Dept.

BSC

Jesus Labarta, Detail@scale, EuroMPI, September 2010 2

• On the title

• Performance analysis

• Scale

• Detail

• Some examples

• Visualizing variability

• Relevant information

• Instrumentation and sampling

Outline

Page 2: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

2

Jesus Labarta, Detail@scale, EuroMPI, September 2010 3

Performance analysis tools objective

Who can I blame?

Generate nice color plots

Jesus Labarta, Detail@scale, EuroMPI, September 2010 4

Performance analysis tools objective

Fly with instruments

Understand our systems

How is my application performing?

Can I describe it in a simple way? Quantitatively?

Is there anything I can do to improve its performance

What? Preferably with minimum effort/cost

Page 3: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

3

Jesus Labarta, Detail@scale, EuroMPI, September 2010 5

Scale and Detail: typical perception

• Scalability: It is all about size

• Space: #cores

• Time

• Detail: Granularity / #metrics

• Routine loop lines

• Metrics: time, message sizes, hardware counters,…

• Size x Detail unmanageable. Scalability problem !!!

• drop detail

• Main practices:

• Data handling mechanisms (i.e. parallelize the tool)

• Profiles, aggregates,…

Jesus Labarta, Detail@scale, EuroMPI, September 2010 6

Performance analysis tools objective

Fly with instruments

Understand our systems

How is my application performing?

Can I describe it in a simple way? Quantitatively?

Is there anything I can do to improve its performance

What? Preferably with minimum effort/cost

Infor

mation

, not

data

Page 4: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

4

Jesus Labarta, Detail@scale, EuroMPI, September 2010 7

This talk

• Scalability is more an issue of dynamic range than absolute size

• Details ARE important

• To understand

• variability in space and time

• Microscopic causes of macroscopic effect

• We need to be able to handle/measure/analyze different levels of detail

• Some example techniques

Jesus Labarta, Detail@scale, EuroMPI, September 2010 8

Scalability

• Scalability is more an issue of dynamic range than absolute size

• Is more a matter of intelligence (data processing) than force (data handling)• First what functionality is useful, then how far can I go in size• Many performance issues do give signs at small sizes (other suddenly appear at a given

size)

106

Page 5: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

5

Jesus Labarta, Detail@scale, EuroMPI, September 2010 9

CEPBA – tools framework

ParaverParaver

PeekPerfPeekPerf Data Display Data Display ToolsTools

..prvprv++

..pcfpcf

..trftrf

MachineMachine descriptiondescription

Time Analysis, filters

..cfgcfg

StatsStats GenGen

..prvprvValgrindValgrind

DyninstDyninst, PAPI, PAPI

InstrInstr. . LevelLevelSimulatorsSimulators

how2gen.xmlhow2gen.xml

..vizviz

..txttxt

..cubecube..xlsxls

MRNETMRNET

XMLXMLcontrolcontrol

ExtraeExtrae

DIMEMASVENUS (IBM-ZRL)

Trace handling & displaySimulators

Open Source (Linux and windows)

http://www.bsc.es/paraver

Jesus Labarta, Detail@scale, EuroMPI, September 2010 10

The butterfly effect

• Sensitivity to initial conditions

• Huge impacts of small causes

• High non linearities with accumulative effects

a “Does the flap of a butterfly’s wings in Brazil set

off a tornado in Texas?”

Common in computer systems behavior

Page 6: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

6

Jesus Labarta, Detail@scale, EuroMPI, September 2010 11

Interconnects … a valley of butterflies64 nodes, G=8, 4MB

Externalcontention

Internalcontention

512 nodes, 4MB

Propagation ofinternal contentionBubble propagation

Dependence on appl.phase (comm. Pattern)

All2all - 32

1μs delay in arrival 1.5 ms longer call duration

Protocol /data messages interaction in adapter

Jesus Labarta, Detail@scale, EuroMPI, September 2010 12

Examples

• Analyzing variability

• Histograms

• Scatter plots

• hardware counts: all in one

• Can be done at scale: Selective data emission

• Communication, Load balance, micro load imbalance, OS noise

• Sampling + instrumentation

Page 7: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

7

Jesus Labarta, Detail@scale, EuroMPI, September 2010 13

Visualizing variability

Jesus Labarta, Detail@scale, EuroMPI, September 2010 14

Visualizing variability: Histograms

• Variability is out there, often more than we are aware of. (i.e. Load balance)

• Histograms of any metric

Useful Duration

Instructions

IPC

L2 miss ratio

Courtesy Dimitri Komatitsch

SPECFEM3D

Page 8: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

8

Jesus Labarta, Detail@scale, EuroMPI, September 2010 15

Visualizing variability: Histograms

• Six months later ….

Useful Duration

Instructions

IPC

L2 miss ratio

Jesus Labarta, Detail@scale, EuroMPI, September 2010 16

Visualizing variability: scatter plots

• Burst = continuous computation region

• between exit of an MPI call and entry to the next, instrumented routine, …

• Scatter plot on some relevant metrics

• Instructions: idea of computational complexity, computational load imbalance,…

• IPC: Idea of absolute performance and performance imbalance

• Automatically Identify clusters

WRFGROMACSSPECFEM3D

Page 9: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

9

Jesus Labarta, Detail@scale, EuroMPI, September 2010 17

Visualizing variability: scatter plots

• Time/space Distribution

WRF@128 cores

Jesus Labarta, Detail@scale, EuroMPI, September 2010 18

• Limited set Hardware counters

• How can we have a complete/precise/accurate characterization of hardware counters for the different regions of a program?

• From a single run?

Detail as completeness of metrics

Page 10: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

10

Jesus Labarta, Detail@scale, EuroMPI, September 2010 19

Emmiting “relevant” information

Jesus Labarta, Detail@scale, EuroMPI, September 2010 20

Emitting “relevant” data

• Detail for what is important, software counters(*) for what is not that important

• What is important?

• First order approach: Computation !!!

• MPI: a gas. Fills whatever space you give it. Very often not the major cause of problems

• Major computation bursts (i.e. > X ms)

• Entry and exit timestamps and hardware counters

• Communication phases.

• Software counters:

• # MPI calls, aggregated bytes, %time in MPI, …

(*) Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005

Page 11: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

11

Jesus Labarta, Detail@scale, EuroMPI, September 2010 21

GADGET Case A @ BGP 1024 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup

Jesus Labarta, Detail@scale, EuroMPI, September 2010 22

GADGET Case A @ BGP 2048 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup

Page 12: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

12

Jesus Labarta, Detail@scale, EuroMPI, September 2010 23

GADGET Case A @ BGP 4096 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup

Jesus Labarta, Detail@scale, EuroMPI, September 2010 24

PFLOTRAN @ jugene1 iteration

8K cores

12K cores

16K cores

Page 13: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

13

Jesus Labarta, Detail@scale, EuroMPI, September 2010 25

PFLOTRAN @ jugene – network traffic

Bytes onX dimension

Bytes onY dimension

Bytes onZ dimension

Bytes on3 dimensions 400K

Imbalance on link/direction utilization will limit

communication performance

Jesus Labarta, Detail@scale, EuroMPI, September 2010 26

PFLOTRAN @ jugene - Detailed network traffic

• Zoomed region in previous slide

Collective send bytes110K

Bytes out of node400K

Bandwidth<15MB/s

How much network bandwidth do we need?

Can we improve the way we manage and use networks?

Page 14: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

14

Jesus Labarta, Detail@scale, EuroMPI, September 2010 27

PFLOTRAN @ jaguar

Color indicates cluster IDLength indicates computation burst length

Jacobian KSPSolve

K. Huck et all. “Analysis of PFLOTRAN on Jaguar” CScADS – Workshop on Performance Tools for Petascale Computing August 2-5, 2010

Outliers as small as ~0 seconds!

Jesus Labarta, Detail@scale, EuroMPI, September 2010 28

PFLOTRAN @ Jaguar: OS noise impactDefault (pin to core) – 488 seconds

Explicit Pin to Core (“fastest”) – 463 seconds

Pin to CPU (NUMA) – 455 seconds

No pinning (slowest) – 620 seconds

Color indicates Cycles per microsecond

(timelines not to scale)

Page 15: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

15

Jesus Labarta, Detail@scale, EuroMPI, September 2010 29

PFLOTRAN @ Jaguar: OS noise impact– zoomed viewDefault

Pin to Core (“fastest”)

Color indicates Cycles per microsecond

Pre-emptions have significant effect In FLOW stage

...but not in the TRAN stage

(timelines not to scale)

Jesus Labarta, Detail@scale, EuroMPI, September 2010 30

PFLOTRAN @ Jaguar: “Spare” core results – no improvement

682 nodes, 7502 total cores – 538 seconds

744 nodes, 8184 total cores – 448 seconds

682 nodes, 6820 total cores – 566 seconds

819 nodes, 8184 total cores (last 6 unused) – 536 seconds

150 Seconds!

(timelines not to scale)

Page 16: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

16

Jesus Labarta, Detail@scale, EuroMPI, September 2010 31

Example

• PEPC 16384 tasks on Jaguar

Duration of the computation bursts

# of MPI collective operations

Jesus Labarta, Detail@scale, EuroMPI, September 2010 32

PEPC @ jugene: 8K cores

MPI calls

Useful durartion

Microscopic load imbalance!!!!

Page 17: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

17

Jesus Labarta, Detail@scale, EuroMPI, September 2010 33

Variability in microscopic behavior

• GROMACS: Only computation phases parallelized with SMPSs

SMPSs tasks and MPI calls ( ~ multispectral)

Jesus Labarta, Detail@scale, EuroMPI, September 2010 34

Variability in microscopic behavior

Four loops/routinesSequential order

Page 18: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

18

Jesus Labarta, Detail@scale, EuroMPI, September 2010 35

Instrumentation + sampling

Jesus Labarta, Detail@scale, EuroMPI, September 2010 36

• Events correlated to specific program activity

• Start/exit iterations, functions, loops,…

• Different intervals:

• May be very large, may be very short

• Variable precision

• Captured data:: Hardware counters, call arguments, call path,….

• Accurate statistics: profiles, …

Instrumentation

Start Iter

fA fBMPICall

fA fBMPICall

Start Iter

MFLOPS

Page 19: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

19

Jesus Labarta, Detail@scale, EuroMPI, September 2010 37

• Events uncorrelated to program activity (at least not specific)

• Time (or counter) overflow

• Controlled granularity:

• Sufficiently large to minimize overhead

• Guaranteed acquisition interval/precision

• Statistical projection

• %time (or metric) = f( %counts )

• Assuming no correlation, sufficiently large #samples

Sampling

fA fBMPICall

fA fBMPICall

MFLOPS

Jesus Labarta, Detail@scale, EuroMPI, September 2010 38

• Both

• Guaranteed interval

• Captured data:

• Hardware counters (since previous probe)

• call path

• Call arguments in some probes

Instrumentation + sampling

fA fBMPICall

fA fBMPICall

Start Iter Start Iter

MFLOPS

Page 20: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

20

Jesus Labarta, Detail@scale, EuroMPI, September 2010 39

Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size

Jesus Labarta, Detail@scale, EuroMPI, September 2010 40

Safe sampledfunctions

MFLOPS at each interval

Instrumented MPI calls

Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size

Page 21: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

21

Jesus Labarta, Detail@scale, EuroMPI, September 2010 41

Safe sampledfunctions

MFLOPS at each interval

Instrumented MPI calls

Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size

Jesus Labarta, Detail@scale, EuroMPI, September 2010 42

Sampling frequency

• Trade-off:

• Too low no detail

• Too high too much overhead

• Challenge: Can we get

• lot of detail, very fine grain information :

• i.e. “instantaneous” performance metric rates

• With very little overhead:

• ie. sampling a few times per second

• Work by Harald Servat

Page 22: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

22

Jesus Labarta, Detail@scale, EuroMPI, September 2010 43

• Instrumentation Reference

• Identify different instances of a region for which to obtain detailed time evolution of metrics

• Stationary behaviour assumed

• Target region:

• Iteration

• Routine

• Routine excluding MPI calls

• …

New roles

fA fBMPICall

fA fBMPICall

Harald Servat et all.. “Detailed performance analysis using coarse grain sampling” PROPER, 2009

H. Servat “Folding: providing detailed performance metrics using coarse grain sampling” UPC-DAC-RR-2010-37

Jesus Labarta, Detail@scale, EuroMPI, September 2010 44

• Sampling role relative data

• Guarantee granularity

• Provide data to increase granularity

New roles

fA fBMPICall

fA fBMPICall

fA fA

fAfA

Page 23: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

23

Jesus Labarta, Detail@scale, EuroMPI, September 2010 45

Folding counters: Projecting

• Cumulative count since reference

• Variance in duration–Eliminate outliers

–Scale

fA fA

fAfA

fA fA

fA

fA

fA

Jesus Labarta, Detail@scale, EuroMPI, September 2010 46

Folding counters: Fitting

• Eliminate outliers

• Kriging interpolation

Page 24: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

24

Jesus Labarta, Detail@scale, EuroMPI, September 2010 47

Impact of the number of folded instances

• The more samples being fold, the more detailed results

• Longer executions

• Increase frequency

• Reach stability?

• Example:

• NAS BT class B copy_faces

• showing from 10 to 200 iterations

• 20 samples per second @ SGI Altix

Jesus Labarta, Detail@scale, EuroMPI, September 2010 48

Impact of the number of folded instances

• Experiments comparing few samples per second to 1000 times higher sampling frequency.

• Not necessary to fold a very big number of instances potential application even in slowly time varying programs.

Page 25: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

25

Jesus Labarta, Detail@scale, EuroMPI, September 2010 49

Emitted data

• Timelines

• Performance counters:

• Sample again fitted function and inject synthetic events into trace

• Call stack

• Truncated by specifying routines of interest

Jesus Labarta, Detail@scale, EuroMPI, September 2010 50

Emitted data

• Plots, statistics

• Time, IPC,…

• Could think of emitting an analytical expression

• Scalability impact !!!!

• Even if generating traces

• Example (Gadget2 using 128 tasks)

• 100 its, 5 samples/s during 90minutes ~ 236MB

• Folding on 1 iteration @ 200 samples/s ~ 64 MB

NAS BT

ALYA

SIESTA

MIPS MFLOPS

Page 26: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

26

Jesus Labarta, Detail@scale, EuroMPI, September 2010 51

PfloTran (data obtained with 5 samples/s)

Jesus Labarta, Detail@scale, EuroMPI, September 2010 52

PEPC (data obtained with 5 samples/s)

Page 27: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

27

Jesus Labarta, Detail@scale, EuroMPI, September 2010 53

Summary

Jesus Labarta, Detail@scale, EuroMPI, September 2010 54

Summary

• Performance tools are more and more needed !!!!!• To tune our applications, to design our system software.• To understand what really happens, how our systems really behave, …

• Great progress is taking place• Functionality• Scalability: Dynamic range

• Detail IS important and can be obtained/handled• A lot of open research

I have seen things you people wouldn't believe...Roy Batty – Blade Runner

Seeing is believing ... measuring is betterFree adaptation of a Spanish saying