benefits of sampling in tracefiles harald servat program development for extreme-scale computing may...
Post on 21-Dec-2015
213 views
TRANSCRIPT
Benefits of sampling in tracefiles
Harald Servat
Program Development for Extreme-Scale ComputingMay 3rd, 2010
Program Development forExtreme-Scale Computing
2May 3rd, 2010
Outline
Instrumentation and sampling Folding
Summarized traces Some results Current work
Program Development forExtreme-Scale Computing
3May 3rd, 2010
Instrumentation
Performance tools based on instrumentation Granularity of the results depends on the
application structure Data gathered includes:
Performance counters, callstack, message size…
Program Development forExtreme-Scale Computing
4May 3rd, 2010
Sampling
Sampling reaches any application point at a interval Easily tunable frequency Gather performance counters and callstack
Program Development forExtreme-Scale Computing
5May 3rd, 2010
Main objective
Combine both mechanisms Deeper performance details Using PAPI_overflow(..)
... what about frequency trade-off? Not too high to disrupt the performance data Not too low to get useful information
Program Development forExtreme-Scale Computing
6May 3rd, 2010
Work done: Folding
Harald Servat, Germán Llort, Judit Giménez, Jesús Labarta: Detailed performance analysis using coarse grain sampling. PROPER, 2009.
Objective: get detailed metrics with few samples Benefits from both high and low frequencies!
Take advantage of stationary behavior of scientific applications
Build synthetic region from scattered samples Reintroduce into the tracefile at chosen ratio
Program Development forExtreme-Scale Computing
7May 3rd, 2010
Folding: Moving samples
Main idea: Move samples to the target iteration preserving their original relative time.
Steps
Program Development forExtreme-Scale Computing
8May 3rd, 2010
Folding: Interpolation
Instructions evolution for routine copy_faces of NAS MPI BT B
No instrumentation points within the routine, but we got details
Red crosses represent the folded samples and show the completed instructions from the start of the routine
Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile
Blue line is the derivative of the curve fitting
Program Development forExtreme-Scale Computing
9May 3rd, 2010
Folding areas
Folding is applied to delimited regions Previously instrumented
User function Iteration
Automatically obtained from the gathered results Clusters of computation bursts
Juan González, Judit Giménez, Jesús Labarta, Automatic detection of parallel applications computation phases, IPDPS 2009
Delimited time regionsMarc Casas, Rosa M. Badia, Jesús Labarta, Automatic
Structure Extraction from MPI Applications Tracefiles, Euro-Par 2007
Program Development forExtreme-Scale Computing
10May 3rd, 2010
Impact of the sampling frequency
The more samples being fold, the more detailed results
Longer executions Increase frequency Reach stability?
Example:
NAS BT class B copy_faces
showing from 10 to 200 iterations
20 samples per second @ SGI Altix
Program Development forExtreme-Scale Computing
11May 3rd, 2010
Impact of the sampling frequency
Choosing a sampling frequency is important Sampling frequency can couple with application frequency Choose frequencies based on prime factors
Program Development forExtreme-Scale Computing
12May 3rd, 2010
Outline
Instrumentation and sampling Folding
Summarized traces Some results Current work
Program Development forExtreme-Scale Computing
13May 3rd, 2010
Dealing with large scale traces
Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005.
Application’s behavior can be divided in: Communication phases Intensive computation phases
Instrumentation library that identifies relevant computation phases
Program Development forExtreme-Scale Computing
14May 3rd, 2010
Dealing with large scale traces
Information emitted at phase change Punctual (callstack) Aggregated
Hardware Counters Software Counters
Number of point-to-point and collective operations Number of bytes transferred Time in MPI
Program Development forExtreme-Scale Computing
15May 3rd, 2010
Example
PEPC 16384 tasks on Jaguar
Duration of the computation bursts
# of MPI collective operations
Program Development forExtreme-Scale Computing
16May 3rd, 2010
Benefits of summarized tracefiles
Important trace size reduction Gadget2 (128) – 10 Gbytes down to 428 Mbytes PEPC (16k) – 19 Gbytes down to 400 Mbytes PFLOTRAN (16k) – +250Gbytes down to 6 Gbytes
Whole execution analysis
Program Development forExtreme-Scale Computing
17May 3rd, 2010
Working with large traces?
We're dealing with large scale executions Maintain scalability of tracing + sampling
By adding more data? Use folding to reduce data
Example (Gadget2 using 128 tasks) 100 its, 5 samples/s during 90minutes ~ 236MB Folding on 1 iteration @ 200 samples/s ~ 64 MB
Program Development forExtreme-Scale Computing
18May 3rd, 2010
Outline
Instrumentation and sampling Folding
Summarized traces Combining mechanisms Some results Current work
Program Development forExtreme-Scale Computing
19May 3rd, 2010
Gadget2 analysis, 128 tasks
32% 16%
13% 8%
forc
e_
tre
e.c
+7
5
-g
ravi
ty_
tre
e.c
+1
67
gra
vity
_tr
ee
.c +
52
8-
de
nsi
ty.c
+1
67
forc
e_
tre
e.c
+1
70
1-
hyd
ra.c
+2
46
pre
dic
t.c
+9
2-
pm
_p
erio
dic
.c +
38
5
Program Development forExtreme-Scale Computing
20May 3rd, 2010
PEPC analysis, 32 tasks
45% 37%
5% 3%
tre
e_
asw
alk
.f9
0 +
16
2-
tre
e_
asw
alk
.f9
0 +
38
0
tre
e_
do
ma
ins.
f90
+5
48
-tr
ee
_b
ran
che
s.f9
0 +
15
5
tre
e_
bra
nch
es.
f90
+5
48
-tr
ee
_p
rop
ert
ies.
f90
+3
28
tre
e_
asw
alk
.f9
0 +
38
0-
tre
e_
asw
alk
.f9
0 +
16
2
Program Development forExtreme-Scale Computing
21May 3rd, 2010
Current directions
We work on: Is there an optimal sampling frequency? Quantify correctness and validate the results Callstack analysis
Program Development forExtreme-Scale Computing
22May 3rd, 2010
Thank you!