performance analisis withanalisis with cepbaa-tools · •2dt bld tables • accumulate over the...
TRANSCRIPT
PerformancePerformance CEPBACEPBA
Judit Gimenez PJudit Gimenez – Pjudit@
Analisis withAnalisis with A ToolsA-Tools
erformance Toolserformance [email protected]
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
Traces!
.prv.prv
.pcf.pcfapplicationapplication.row.row
instrumentationinstrumentation
.trf.trf
ParaverParaver
DimemasDimemas
Since 1991
• Variance is important
• Along time, across processors
• I f ti i i th d t il• Information is in the details
• Highly non linear systems
• Microscopic effects may have larmacroscopic impactmacroscopic impact
rge
AlPPC / x86 clusters
Power 4/5/6 AIXAl
GL / BGP
Dimemas
me translators: TAU, KOJAK, OTF, HPC
y easy to add new programming mo
ltix Sltix(Supported by NASA AMES) By Rainer Kelle
PERUSEB R i K ll (H
Cell BE
By Rainer Keller (H& (UTK)
MPICH collective internals
C Toolkit....
odels
R d tRaw data
tunable
eeing is believing
easuring is better
TimelinesId tifi f f tiIdentifier of functionHardware countsMiss ratiosPerformance (IPC, Mflops,…)
Routine duration• • •• • •
S i iStatisticsProfilesAverage miss ratio per routineg pHistogram of routine durationNumber of messages
• • •• • •
•• From raw events Piece-wise constan
• Basic metricsInstructions
MPI callsMPI callsUseful duration
• Derived metrics
instrusefulIPC= instrcycles
�useful
• Models
L2 = cycles− instr / idea
nt functions of time plots / colors
Color encoding
MPI call Cost=
MPI callduration
bytes
preemptedtime=elapsed− cyccloc
alIPC
• 2D T bl• 2D Tables • Accumulate over the different values of a co
• P fil C t i l ti li ( f ti• Profile: Categorical timeline (user function
• Histogram: Numeric timeline (duration, in• Diff t d t i d C l ti• Different data window – Correlations
• Average IPC @ each user function
• N b f i t ti f h f• Number of instructions for each range of • Communication Patterns• 3D extensions – 2 control windows3D extensions 2 control windows
Duration - IPC
While compu
ontrol window
i ll )n, mpi call...)
structions, IPC...)
d tidurations MPI calls profile
uting Communication pattern
6 tasks trace
• Simulation: Highly non linear model
• Linear componentsp
• Point to point communication
• Sequential processor performanceq p p
• Global CPU speed
• Per block/subroutinePer block/subroutine
• Non linear components
• Synchronization semantics
• Blocking receives
• Rendezvous
• Resource contention
• CPU
• Communication subsystem • links (half/full duplex), busses
B
Local
CPU
L
CPULocal
L
CPULoc
L
CPU
Local
MemoryCPU CPU
CPU
Local
Memory CPU
CPU
Loc
Mem
• GRID extensionsDedicated connectionsDedicated connections
External network (trafic)
• Point to point model
SimresMPI_send
Logical TransferTransfer
LatencyUses CPUIndependent of size
MPI_recv
• Collectives modelCollectives model
• Barrier / Fan-in / Fan-out
• Cost Generic / Per callCost Generic / Per call
Processor timeBlock time
Comm. time
Time = Latency+ mBaBa
mulated contention for machine sources (links & buses)
Physical Transfer SizeBW
Process Blocked
MODEL_Bandwidth
SizeLatencyTime ∗⎟⎠⎞
⎜⎝⎛ +=
Machine 1 Machine 2
Fan in inte
Fan in extern
Fan out exte
Fan out inter
• Simulations with 4 processes per node• NMM Iberia 4Km
• Not sensitive to latency
• 512 sensitive to contention?
• 256 MB/s OK• ARW Iberia 4 Km
• Not sensitive to latency
• sensitive to contention
• Need 1GB/sContention Impact (L=8; BW=2
1 2
0.8
1
1.2
omec
tivity
0.4
0.6
Spee
dup
vs. F
ull c
o
0
0.2
4 8 12 16 20 24 28
S
Impact of latency (BW=256; B=0)
0 998
1
1.002
l Lat
ency
0.994
0.996
0.998
up v
s.
Nom
inal
0.99
0.992
0 2 4 8 16 32
Spe
edu
Impact of BW (L=8; B=0)
256)
0.80
1.00
1.20
NMM 512
ARW 5120.40
0.60
0.80
Effic
ienc
y
NMM 256
ARW 256
NMM 128
ARW 1280.00
0.20
1 4 16 64 256 1024
32 36
• MPIRE, 32 tasks, no network contention
L=25us BW=100MB/sL=25us, BW=100MB/s
L 25 BW 10MB/
L 1000 s BW 100MB/sL=1000us, BW=100MB/s
All windows same sca
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
l for data reductioniodic structure: Reference focus for detaomatically generate a representative sub
ta handling/summarization capability
y g p
• Software counters, filtering and cuttinggnals:
• To “clean” data: Flushing processes, %pree#msgs/BW (clogged system)
• T id tif t t #T k ti S• To identify structure: #Tasks computing, Suduration, #processes in MPI, average IPC,…
tomatizable through signal processing techniqg g p g q
• Mathematical morphology to clean up pertu
• Wavelet transform to identify coarse regionsWavelet transform to identify coarse regions
• Spectral analysis for detailed periodic patte
ailed analysisbtrace of few iterations
570 s2.2 GB
WRF-NMMPeninsula 4km
MPI, HWC 128 procs
emted time,
b t
570 s5 MB
um burst …ques:q
urbed regions
s 4.6 ss
ern36.5 MB
• Useful for identifying and highlighting struct
• Cluster information injected in trace
• Phases within routines
• Different routines may have similar behav
• Compact trace encoding• Compact trace encoding
• Input to time analysis
CPMD
NAS BT
ture
vior
0.8
1
DBSCAN (Eps=0.01, MinPoints=20) clustering of trace WRF-128-PI.chop2.trf
0.4
0.6
Inst
ruct
ions
Com
plet
ed
WRF
0.2
In
Focusing analysis
• Projection of HWC within a cluster
• Apply analysis to cluster level
• Statistics• CPI stack model• My favorite metrics
• Based on MRNet• Based on MRNet
i … i+n10 … 300 … i+
m… 1i0
…
• 1st experiment: Collective duration thresholdp
245MB, >15500 col
• Current development
245MB, 15500 col
• Periodic frequency analysis
• P i di l t i h t
d <1MB, <
25MB, <
Collective internals
A l i tAnalysis atAnalysis atAnalysis at
Clusters distribution
t i t 30t minute 10t minute 20t minute 30
CPI STACK model (generic
• Useful
• To control granularity (not driven by MPI c
• To identify timeline profile, application str
• Adding sampling information on the tracefile
• based on the overflow mechanism offered
• period = f (cycles / instructions / cache m
• sampled information:
• call stack – as reference to source
• other hardware counters (not sampled)
•• interval between samples:
• High frequency sampling (> Nyquist)
• L f li• Low frequency sampling
calls, selected user functions)
ucture
e
d by PAPI
isses...)
How to increase precision?Folding based on known periodic structure of application (tagged iterations) Applies to stationary applications
R lt t f it ti ith th tiResult: trace for one iteration with synthetic paraver events
Refer counts/timestamps to start of iteration
• Call stack
• Search for consecutive sequences of folded samples within same function and generate synthetic events
• Hardware countersHardware counters
• Noise reduction• Fit folded samples• Sample fitting curve to generate synthetic
events.
• Useful to identify
• Internal behavior
• Density of L1 misses within MPI in an SM
Data
MP: when data actually arrives.
arrives
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
T
T i
T
Supi
LB=∑i=1
P
eff iCoeff i=
T iIPC
# LBP *max�eff i�
eff i T#instr
T ideal
Sup=
Migrating/local load imbalanceSerializationSerialization
CommicroLB=max �T i �
up= PP
� LBLB
� CommEffC Eff
� IPCIPC
�instr0
i t
Directly from real execution me
P0 LB0 CommEff 0 IPC 0 instr
ommEff= max �eff i �
Directly from real execution me
PP0
� macroLBmacroLB0
� microLBmicroLB0
� CommEffCommEff 0
� IPCIPC 0
�instr0
instr
Requires Dimemas simul
mmEff=T ideal
Poor comReplication of computation Poor commmunication efficiencymmunication efficiency
Improved macrmicro load bala
CommuncMicro LoMacro LoIPCIPCComputat
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
Pattern Pattern AnalysisAnalysis
MPITraceMPITrace
DIMEMASVENUS (IBM-ZRL)
realideal_lFU
ideal_nFUideal_ROB
ideal_L1_cachesideal_L1_instr_cache
ideal
BT:C.64, task 20, cluster 1
MPSimMPSimreal_lFU
real_nFUreal_ROB
real_L1_cachesreal_L1_data_cache
real
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3
ClusteringClusteringgg
bursts bursts selectionselection
fFfF
ValgrindValgrind
• Profile presentation tools
• Reduction/aggregation of the performanc
• Time dimension disappeared / Space dim• Traces
• “All” data is there
ce dimensions
Profilestats C
mension sometimes
VarianceVarianceTimspa
l t f V i d TAU tslator of Vampir and TAU traces
HPCTHPCT
Pattern Pattern AnalysisAnalysis
Stats Stats
algrindalgrind
racerace FiltersFiltersPARAVPARAV
.prv.prvff
PARAVPARAV
.pcf.pcf.row.row
MPSiMPSi
ClusteringClustering
.trf.trf
MPSimMPSim
KOJAKKOJAK
TAUTAUGenGen
.cfg.cfg
VERVERVamVam
oftoft22prvprv
VERVER
DIMEMAS
VENUS (IBM-ZRL)
CEPBA-Tools environment
• Paraver
• Dimemas
Research areas
• Time analysis
• ClusteringClustering
• On-line analysis
• Sampling• Sampling
Models
Tools integrationTools integration
How to? GROMACS analysis
Test case: NUCLEOSOME
N
GaGaaG
2 4 8 16 32 64 128 256 512 1024
Cluster / MN perf.
2,5
3
3,5
1
1,5
2
Raw processor performance0
0,5
1 2 3 4 5 6 7 8
Raw processor performancevs
Communication efficiency?
NAMD 2.7
y
GMX4_mn_work_nucleo_nsdayGMX4_mn_stop_nucleo_nsdayayGMX4_cluster_work_nsday
64 tasks: whole run 14GBfilter only “large enough” computations 25
h 1 it ti 13MBchop 1 iteration13MB
5MB
64 tasks
256 tasks
64 tasks 256 tasks
Main scalability problems• Load balanceparallel eff
comm Eff • Computation balance
4% of code replication
comm. Effload balancemico load balancecomp. balance 4% of code replication
64 tasks already poor effic
pcomputationIPC balanceIPC y p
• Communication problem
64 tasks
<
256 tasks
T diff f
Instructions histogram
64 tasks
256 tasks
64 tasks
64 tasks: globally balanced but some lo
Instructions imbal
IPC Imba
ocal imbalance
lance
alance
MPI calls
64 tasks
256 tasks
Duration of the Computation
64 tasks
256 tasks
64Load balance!!!
T id l
64 pro
– Trapezoidal– IPC ∝ Instr
– 2x instr20% IPC
256
– 20% IPC– IPC ∝ 1/Instr
ocs
procs
Long waits for FFTs
Tasks phase parallel eff.64 all 0,55
FFTs 0 55FFTs 0,55part icles 0,54
256 all 0,31FFTs 0 35FFTs 0,35part icles 0,29
0,9
1
0,6
0,7
0,8
0,4
0,5
,
0 1
0,2
0,3
64 all 64 FFTs 64 part icles0
0,1
load balance comp. balance5 0,93 0,815 0 91 0 955 0,91 0,954 0,94 0,871 0,61 0,585 0 55 0 695 0,55 0,699 0,74 0,72
parallel eff.load balancecomp. balance
256 all 256 FFTs 256 part icles
64 tasks
256 t k256 tasks
64 tasks
64 tasks
• Real
• Prediction nominal
• Prediction ideal– Not much gain– Significant serialization
• Dependences?
Particles
MPI• The imbalance
computation region where OpenMP would help for
MPI_Re
_
p pload balance is between sendrecv calls in line 1533 in domdec.c
MPI Waitall:
MPI_Senredv: 1533@do
MPI_Waitall:
Color re
Allreduce: 286@pme pp.c
cv: 427@pme_pp.c
_ @p _pp
: 126@demdec network c
omdec.c
: 126@demdec_network.c
epresents MPI caller line
FFTsCo
• Imbalance computation between Alltoallvs in line [email protected]– Strange?– Better selection of # fft
procs?procs?– OpenMP?
MPI_Sendrecv: 482@pdMPI_Sendrecv: 7
MPI_Sendrecv: 344@gmx_MPI AlltoMPI_Allto
MMM
lor represents MPI call
Color represents MPI caller line
pme.c@[email protected]
_parallel_3dfft.coallv: 241@fftgrid coallv: [email protected]_Sendrecv: [email protected] Sendrecv: [email protected]_Sendrecv: [email protected]
MPI_Recv: 204@pme_pp.c
Main scalability problem is load unbala
• Mixing MPI+OpenMP with dynamic
Mixing of Particles and FFTs tasks maMixing of Particles and FFTs tasks ma
• Two different granularities
Even if was initially considered to havepoor communication efficiencypoor communication efficiency
ance
c run time load balance
akes things difficultakes things difficult
e a good performance, the 64 tasks have
Performance analysis: fun and colorful
Simple tools to analyze complex proble
Scalability requires tools to drive/suppo
Importance of the details to understand
Benefits from the tools interoperability
Tools freely available ([email protected]) B
l
ems
ort the analyst where to look at
d
Best effort support and training