performance tools (paraver/dimemas)
TRANSCRIPT
www.bsc.es
Enes workshop on exascale techs. Hamburg, March 18th 2014
Jesús Labarta, Judit Gimenez BSC
Performance Tools (Paraver/Dimemas)
2
Our Tools
! Since 1991
! Based on traces
! Open Source – http://www.bsc.es/paraver
! Core tools: – Paraver (paramedir) – offline trace analysis – Dimemas – message passing simulator – Extrae – instrumentation
! Focus – Detail, flexibility, intelligence
3
0 3.5 s
A “different” view point
! Look at structure … – Of behavior, not syntax
– Differentiated or repetitive patterns in time and space
– Focus on computation regions (Burst)
4
LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.82 0.73 0.88 0.72 0.63
A “different” view point
! … and fundamental metrics
adv2 (gather–fft-scatter)* mono
Useful user function @ NMMB
M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.
LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.97 0.84 0.73 0.88 0.96 0.75 0.61
5
More on structure and concurrency
Scalability tradeoffs between processes at different phases
?
6
More on structure and concurrency
How to find out:
Discussion with developer Automatic? V. Subotic et al, “Automatic exploration of
potential parallelism in sequential applications”. ISC 2014.
7
More on structure and concurrency
8
More on structure and concurrency
Huge potentials of concurrency and overlap to:
tolerate latencies
spread load across resource cores and network !!
9
More on structure and concurrency
You may even want to constrain potential concurrency !!!
10
More on structure and concurrency and syntax
WIP:
Taskify with OmpSs
OpenMP 4.0 accelerator features in OmpSs
11
Performance analytics
12
Using Clustering to identify structure
IPC
Completed Instructions
J. Gonzalez et al, “Automatic Detection of Parallel Applications Computation” Phases. (IPDPS 2009)
13
! Full per region HWC characterization from a single run
Projecting hardware counters based on clustering
Miss ratios Instruction mix Stalls
14
! Frame sequence: clustered scatterplot as core counts increases
Tracking structural evolution
64 128 192
256 384 512
64 128 192
256 384 512
G.Llort et all, “On the Usefulness of Object Tracking Techniques in Performance Analysis”, SC 2013
OpenMX Strong scaling
15
! … to get extreme detail with minimal overhead
! Different roles – Instrumentation delimits regions – Sampling report progress within region
Mixing instrumentation and sampling …
Iteration #1 Iteration #2 Iteration #3
Synthetic Iteration
Harald Servat et al. “Unveiling Internal Evolution of Parallel Application Computation Phases” ICPP 2011
Harald Servat et al. “Detailed performance analysis using coarse grain sampling” PROPER@EUROPAR, 2009
16
Instructions evolution for routine copy_faces of NAS MPI BT.B
Red crosses represent the folded samples and show the completed instructions from the start of the routine
Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile
Blue line is the derivative of the curve fitting over time (counter rate)
Folding hardware counters
17
17.20 M instructions ~ 1000 MIPS
24.92 M instructions ~ 1100 MIPS
32.53 M instructions ~ 1200 MIPS
MPI call
MPI call
Combined clustering + folding
! Instantaneous values ! All metrics ! From a single run ! “No” overhead
CGPOP -1D
18
CESM v18 – v19 trace
! User functions not instrumented
ATM: 384 LND: 16 ICE: 32 OCN: 10 CPL: 128
2.54 GB
160 s
5 200 ms
2.55 GB 4.5 MB
11.5 MB
570
19
CESM CAM v18
Convect_shallow_tend
Microp_driver_tend
aer_rad_props_sw
aer_rads_prop_lw
rrtmg_sg
rad_rrtmg_lw
20
CESM CAM v19
Convect_shallow_tend
Svp_water
M_list_mp_init_
Vertical_diffusion
rrtmg_sw
rad_rrtmg_lw Microp_driver_tend aer_rad_props_sw
Aerosol_dryed_intr_
21
Dimemas
22
Dimemas: Coarse grain, Trace driven simulation
! Simulation: Highly non linear model – Linear components
• Point to point communication
• Sequential processor performance – Global CPU speed – Per block/subroutine
– Non linear components • Synchronization semantics
– Blocking receives – Rendezvous
• Resource contention – CPU – Communication subsystem
» links (half/full duplex), busses CPU
Local Memory
B
CPU
CPU
L
CPU
CPU
CPU Local
Memory
L
CPU
CPU
CPU Local
Memory
L
23
Ideal machine
! The impossible machine: BW = ∞, L = 0 ! Actually describes/characterizes Intrinsic application behavior
– Load balance problems? – Dependence problems?
waitall
sendrec
alltoall
Real run
Ideal network
Allgather +
sendrecv allreduce GADGET @ Nehalem cluster
256 processes
Impact on practical machines?
24
The potential of hybrid/accelerator parallelization
! Hybrid parallelization – Speedup SELECTED regions by the
CPUratio factor ! We do need to overcome the hybrid
Amdahl’s law – asynchrony + Load balancing
mechanisms !!!
93.67% 97.49% 99.11%
Code region
%el
apse
d tim
e
GADGET, 128 procs
25
Conclusion
! BSC tools – Extremely powerful visualization and analysis capabilities
– Performance Analytics • Performance data is big data
– Management – analytics
– Capturing knowledge and methodologies in algorithmic workflows
! Useful insight for informed decisions on code refactoring
http://www.bsc.es/paraver [email protected]
THANKS
27
Insight
! Observations / highly probable speculations / good questions – about fundamental behavior – Suggesting possibilities for optimization
! Identification of specific poor performance sequential code ! Bimodal behavior in alternating “iterations?” ! Bimodal behavior in space:
– Day-night imbalance – Moving load imbalance
• Separate cause and potential solution
! Repetitive fine grain structure within phase – 2 / 3 sub iterations? parallelizable? Potential source for overlap of
communication/computation?
28
A call for Performance analytics
! Data acquisition – A lot of data is captured
! Presentation – Profile: a few (or not so few) pre computed first order statistics
• Far too summarized – Trace visualization
• No summarization at all
Need for intelligent data processing
to derive actual insight
29
CESM CLM v18
29
30
CESM POP v18
30
31
NMMB
32
Measuring Parallel efficiency