benefits of sampling in tracefiles harald servat program development for extreme-scale computing may...

Benefits of sampling in tracefiles

Harald Servat

Program Development for Extreme-Scale ComputingMay 3rd, 2010

Program Development forExtreme-Scale Computing

2May 3rd, 2010

Outline

Instrumentation and sampling Folding

Summarized traces Some results Current work


3May 3rd, 2010

Instrumentation

Performance tools based on instrumentation Granularity of the results depends on the

application structure Data gathered includes:

Performance counters, callstack, message size…


4May 3rd, 2010

Sampling

Sampling reaches any application point at a interval Easily tunable frequency Gather performance counters and callstack


5May 3rd, 2010

Main objective

Combine both mechanisms Deeper performance details Using PAPI_overflow(..)

... what about frequency trade-off? Not too high to disrupt the performance data Not too low to get useful information


6May 3rd, 2010

Work done: Folding

Harald Servat, Germán Llort, Judit Giménez, Jesús Labarta: Detailed performance analysis using coarse grain sampling. PROPER, 2009.

Objective: get detailed metrics with few samples Benefits from both high and low frequencies!

Take advantage of stationary behavior of scientific applications

Build synthetic region from scattered samples Reintroduce into the tracefile at chosen ratio


7May 3rd, 2010

Folding: Moving samples

Main idea: Move samples to the target iteration preserving their original relative time.

Steps


8May 3rd, 2010

Folding: Interpolation

Instructions evolution for routine copy_faces of NAS MPI BT B

No instrumentation points within the routine, but we got details

Red crosses represent the folded samples and show the completed instructions from the start of the routine

Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile

Blue line is the derivative of the curve fitting


9May 3rd, 2010

Folding areas

Folding is applied to delimited regions Previously instrumented

User function Iteration

Automatically obtained from the gathered results Clusters of computation bursts

Juan González, Judit Giménez, Jesús Labarta, Automatic detection of parallel applications computation phases, IPDPS 2009

Delimited time regionsMarc Casas, Rosa M. Badia, Jesús Labarta, Automatic

Structure Extraction from MPI Applications Tracefiles, Euro-Par 2007


10May 3rd, 2010

Impact of the sampling frequency

The more samples being fold, the more detailed results

Longer executions Increase frequency Reach stability?

Example:

NAS BT class B copy_faces

showing from 10 to 200 iterations

20 samples per second @ SGI Altix


11May 3rd, 2010

Impact of the sampling frequency

Choosing a sampling frequency is important Sampling frequency can couple with application frequency Choose frequencies based on prime factors


12May 3rd, 2010

Outline


Summarized traces Some results Current work


13May 3rd, 2010

Dealing with large scale traces

Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005.

Application’s behavior can be divided in: Communication phases Intensive computation phases

Instrumentation library that identifies relevant computation phases


14May 3rd, 2010

Dealing with large scale traces

Information emitted at phase change Punctual (callstack) Aggregated

Hardware Counters Software Counters

Number of point-to-point and collective operations Number of bytes transferred Time in MPI


15May 3rd, 2010

Example

PEPC 16384 tasks on Jaguar

Duration of the computation bursts

# of MPI collective operations


16May 3rd, 2010

Benefits of summarized tracefiles

Important trace size reduction Gadget2 (128) – 10 Gbytes down to 428 Mbytes PEPC (16k) – 19 Gbytes down to 400 Mbytes PFLOTRAN (16k) – +250Gbytes down to 6 Gbytes

Whole execution analysis


17May 3rd, 2010

Working with large traces?

We're dealing with large scale executions Maintain scalability of tracing + sampling

By adding more data? Use folding to reduce data

Example (Gadget2 using 128 tasks) 100 its, 5 samples/s during 90minutes ~ 236MB Folding on 1 iteration @ 200 samples/s ~ 64 MB


18May 3rd, 2010

Outline


Summarized traces Combining mechanisms Some results Current work


19May 3rd, 2010

Gadget2 analysis, 128 tasks

32% 16%

13% 8%

forc

e_

tre

e.c

+7

5

-g

ravi

ty_

tre

e.c

+1

67

gra

vity

_tr

ee

.c +

52

8-

de

nsi

ty.c

+1

67

forc

e_

tre

e.c

+1

70

1-

hyd

ra.c

+2

46

pre

dic

t.c

+9

2-

pm

_p

erio

dic

.c +

38

5


20May 3rd, 2010

PEPC analysis, 32 tasks

45% 37%

5% 3%

tre

e_

asw

alk

.f9

0 +

16

2-

tre

e_

asw

alk

.f9

0 +

38

0

tre

e_

do

ma

ins.

f90

+5

48

-tr

ee

_b

ran

che

s.f9

0 +

15

5

tre

e_

bra

nch

es.

f90

+5

48

-tr

ee

_p

rop

ert

ies.

f90

+3

28

tre

e_

asw

alk

.f9

0 +

38

0-

tre

e_

asw

alk

.f9

0 +

16

2


21May 3rd, 2010

Current directions

We work on: Is there an optimal sampling frequency? Quantify correctness and validate the results Callstack analysis


22May 3rd, 2010

Thank you!

benefits of sampling in tracefiles harald servat program development for extreme-scale computing may...

Documents

extremescale computing

sampling sampling

callstack slide

important sampling frequency

benefits of sampling

curve fitting slide

samples benefits

folded samples