![Page 1: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/1.jpg)
www.bsc.es
Understanding applications using the
BSC performance tools
Judit Gimenez ([email protected]), Harald Servat ([email protected])
HPCKP’15 - February 3th, 2015
![Page 2: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/2.jpg)
Humans are visual creatures
Films or books? PROCESS
– Two hours vs. days (months)
Memorizing a deck of playing cards STORE
– Each card translated to an image (person, action, location)
Our brain loves pattern recognition IDENTIFY
– What do you see on the pictures?
![Page 3: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/3.jpg)
Our Tools
Since 1991
Based on traces
Open Source – http://www.bsc.es/paraver
Core tools: – Paraver (paramedir) – offline trace analysis
– Dimemas – message passing simulator
– Extrae – instrumentation
Focus – Detail, variability, flexibility
– Behavioral structure vs. syntactic structure
– Intelligence: Performance Analytics
![Page 4: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/4.jpg)
www.bsc.es
Paraver
![Page 5: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/5.jpg)
Paraver – Performance data browser
Timelines
Raw data
2/3D tables
(Statistics)
Goal = Flexibility
No semantics
Programmable
Comparative analyses
Multiple traces
Synchronize scales
+ trace manipulation
Trace visualization/analysis
![Page 6: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/6.jpg)
Timelines
Each window displays one view
– Piecewise constant function of time
Types of functions
– Categorical
• State, user function, outlined routine
– Logical
• In specific user function, In MPI call, In long MPI call
– Numerical
• IPC, L2 miss ratio, Duration of MPI call, duration of computation
burst
nNS i ,n0,
1,,)( iii ttiSts
RS i
1 ,0 iS
![Page 7: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/7.jpg)
Tables: Profiles, histograms, correlations
From timelines to tables
MPI calls profile
Useful Duration
Histogram Useful Duration
MPI calls
![Page 8: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/8.jpg)
Analyzing variability through histograms and timelines
Useful Duration
Instructions
IPC
L2 miss ratio
![Page 9: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/9.jpg)
By the way: six months later ….
Useful Duration
Instructions
IPC
L2 miss ratio
Analyzing variability through histograms and timelines
![Page 10: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/10.jpg)
From tables to timelines
Where in the timeline do the values in certain table
columns appear?
ie. want to see the time distribution of a given routine?
![Page 11: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/11.jpg)
Variability … is everywhere
CESM: 16 processes, 2 simulated days
Histogram useful computation duration
shows high variability
How is it distributed?
Dynamic imbalance
– In space and time
– Day and night.
– Season ?
![Page 12: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/12.jpg)
Data handling/summarization capability
– Filtering • Subset of records in original trace
• By duration, type, value,…
• Filtered trace IS a paraver trace and can be analysed with the same cfgs (as long as needed data kept)
– Cutting • All records in a given time interval
• Only some processes
– Software counters • Summarized values computed from those
in the original trace emitted as new even types
• #MPI calls, total hardware count,…
570 s
2.2 GB
MPI, HWC
WRF-NMM
Peninsula 4km
128 procs
570 s
5 MB
4.6 s
36.5 MB
Trace manipulation
![Page 13: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/13.jpg)
www.bsc.es
Dimemas
![Page 14: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/14.jpg)
Dimemas: Coarse grain, Trace driven simulation
Simulation: Highly non linear model
– Linear components
• Point to point communication
• Sequential processor performance
– Global CPU speed
– Per block/subroutine
– Non linear components
• Synchronization semantics
– Blocking receives
– Rendezvous
– Eager limit
• Resource contention
– CPU
– Communication subsystem
» links (half/full duplex), busses
LBW
eMessageSizT
CPU
Local
Memory
B
CPU
CPU
L
CPU
CPU
CPU Local
Memory
L
CPU
CPU
CPU Local
Memory
L
![Page 15: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/15.jpg)
Dimemas: Type of analysis
Parametric sweeps
– On abstract architectures
– On application computational regions
What if analysis
– Ideal machine (instantaneous network)
– Estimating impact of ports to MPI+OpenMP/CUDA/…
– Should I use asynchronous communications?
– Are all parts of an app. equally sensitive to network?
MPI sanity check
– Modeling nominal
Detailed feedback on simulation (trace)
Impact of BW (L=8; B=0)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 4 16 64 256 1024
Eff
icie
ncy
NMM 512
ARW 512
NMM 256
ARW 256
NMM 128
ARW 128
4ms
![Page 16: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/16.jpg)
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
4
16
64
0
5
10
15
20
Bandwdith (MB/s)
CPU
ratio
Speedup
Paraver & Dimemas: Ecosystem and Integration
Paraver – Dimemas tandem
– Analysis and prediction
– What-if from selected time window
Analytics
– Clustering to model speeding or
balancing regions
Scalability model
– Separate transfer and serialization
efficiencies
Coupled simulation
– Multiscale simulation
– Combine with detailed network / CPU
simulators
0
0,2
0,4
0,6
0,8
1
1,2
0 100 200 300Processors
Performance Factors
Parallel_Eff
LB
uLB
Transfer
GROMACSv4.5
93.67%
González J, Giménez J, Casas M, Moretó M, Ramirez A, Labarta J, Valero M.
Simulating Whole Supercomputer Applications. IEEE Micro. 2011 ;31:32-45.
![Page 17: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/17.jpg)
Network sensitivity
MPIRE 32 tasks, no network contention
17 PATC Tools Training. Barcelona, May 12–13th 2014
L = 25µs – BW = 100MB/s L = 1000µs – BW = 100MB/s
L = 25µs – BW = 10MB/s
All windows same scale
![Page 18: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/18.jpg)
Ideal machine
The impossible machine: BW = , L = 0
Actually describes/characterizes Intrinsic application behavior
– Load balance problems?
– Dependence problems?
waitall
sendrec
alltoall
Real
run
Ideal
network
Allgather
+
sendrecv allreduce GADGET @ Nehalem cluster
256 processes
Impact on practical machines?
![Page 19: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/19.jpg)
Impact of architectural parameters
Ideal speeding up ALL the computation bursts by the
CPUratio factor
– The more processes the less speedup (higher impact of bandwidth
limitations) !!
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
4
1664
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPU
ratio
Speedup
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
8
64
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPU
ratio
Speedup
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
8
64
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPU
ratio
Speedup
64
procs
128
procs 256
procs
GADGET
![Page 20: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/20.jpg)
www.bsc.es
Models
![Page 21: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/21.jpg)
Scaling model
)max(*
1
i
P
i
i
effP
eff
LB
)max( ieffCommEff
T
Teff i
i iTT
IPC
instr#
Directly from real execution metrics
CommEffLB *
LB
η
CommEff
![Page 22: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/22.jpg)
Communication efficiency = µLB * Transfer
Serializations / dependences (µLB)
Measured using Dimemas
– An ideal network Transfer (efficiency) = 1
Computation Communication
𝑳𝑩=1
MPI_Send MPI_Recv
MPI_Send MPI_Recv
MPI_Send
MPI_Send
MPI_Recv
MPI_Recv
![Page 23: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/23.jpg)
Scaling model
Dimemas simulation with ideal target
– Latency =0; BW = ∞
µLB
TransferLBCommEff *Migrating/local load imbalance
Serialization
idealT
T
TTransfer ideal
ideal
i
T
TLB
)max(
![Page 24: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/24.jpg)
Why scaling?
0
5
10
15
20
0 100 200 300 400
speed up
Good scalability !! Should we be happy?
CG-POP mpi2s1D - 180x120
TrfSerLB **
0,5
0,6
0,7
0,8
0,9
1
1,1
0 100 200 300 400
Parallel eff
LB
uLB
transfer
0,4
0,6
0,8
1
1,2
1,4
1,6
0 100 200 300 400
Efficiency
Parallel eff
instr. eff
IPC eff
IPCinstr **
M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.
![Page 25: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/25.jpg)
www.bsc.es
Clustering
![Page 26: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/26.jpg)
Using Clustering to identify structure
IPC
Completed Instructions
Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)
![Page 27: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/27.jpg)
Performance @ serial computation bursts
WRF 128 cores GROMACS SPECFEM3D
Asynchronous SPMD
Balanced #instr
variability in IPC
MPMD structure
Different coupled
imbalance trends
SPMD
Repeated substructure
Coupled imbalance
![Page 28: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/28.jpg)
28
Full per region HWC characterization from a single run
Projecting hardware counters based on clustering
Miss ratios Instruction mix Stalls
González J. et al , “Performance Data Extrapolation in Parallel Codes”. ICPADS '10
![Page 29: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/29.jpg)
19% gain
Integrating models and analytics
13% gain
… we increase the IPC of Cluster1?
What if ….
… we balance Clusters 1 & 2?
PEPC
![Page 30: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/30.jpg)
OpenMX (strong scale from 64 to 512 tasks)
Tracking: scability through clustering
64 128 192
256 384 512
64 128 192
256 384 512
![Page 31: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/31.jpg)
www.bsc.es
Folding
![Page 32: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/32.jpg)
32
Folding: Detailed metrics evolution
• Performance of a sequential region = 2000 MIPS
Is it good enough?
Is it easy to improve?
![Page 33: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/33.jpg)
33
Folding: Instantaneous CPI stack
MRGENESIS
• Trivial fix.(loop interchange)
• Easy to locate?
• Next step?
• Availability of CPI stack models
for production processors?
• Provided by
manufacturers?
![Page 34: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/34.jpg)
“Blind” optimization
From folded samples of a few
levels to timeline structure of
“relevant” routines
Recommendation without
access to source code
Harald Servat et al. “Bio-inspired call-stack reconstruction for performance analysis” Submitted, 2015
![Page 35: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/35.jpg)
17.20 M instructions
~ 1000 MIPS
24.92 M instructions
~ 1100 MIPS
32.53 M instructions
~ 1200 MIPS
MPI call
MPI call
Unbalanced MPI application
– Same code
– Different duration
– Different performance
CG-POP multicore MN3 study
![Page 36: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/36.jpg)
www.bsc.es
Methodology
![Page 37: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/37.jpg)
Performance analysis tools objective
Help validate hypotheses
Help generate hypotheses
Qualitatively
Quantitatively
![Page 38: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/38.jpg)
First steps
Parallel efficiency – percentage of time invested on computation – Identify sources for “inefficiency”:
• load balance
• Communication /synchronization
Serial efficiency – how far from peak performance? – IPC, correlate with other counters
Scalability – code replication? – Total #instructions
Behavioral structure? Variability?
Paraver Tutorial:
Introduction to Paraver and Dimemas methodology
![Page 39: Understanding applications using the BSC performance tools](https://reader031.vdocuments.us/reader031/viewer/2022020621/61e78c648ff10f67d33e07da/html5/thumbnails/39.jpg)
BSC Tools web site
www.bsc.es/paraver
• downloads
– Sources / Binaries
– Linux / windows / MAC
• documentation
– Training guides
– Tutorial slides
Getting started
– Start wxparaver
– Help tutorials and follow instructions
– Follow training guides
• Paraver introduction (MPI): Navigation and basic understanding of Paraver
operation