performance analisis withanalisis with cepbaa-tools · •2dt bld tables • accumulate over the...

PerformancePerformance CEPBACEPBA

Judit Gimenez PJudit Gimenez – Pjudit@

Analisis withAnalisis with A ToolsA-Tools

erformance Toolserformance [email protected]

CEPBA-Tools environment

• Paraver

• Dimemas

Research areas

• Time analysis

• ClusteringClustering

• On-line analysis

• Sampling• Sampling

Models

Tools integrationTools integration

How to? GROMACS analysis

Traces!

.prv.prv

.pcf.pcfapplicationapplication.row.row

instrumentationinstrumentation

.trf.trf

ParaverParaver

DimemasDimemas

Since 1991

• Variance is important

• Along time, across processors

• I f ti i i th d t il• Information is in the details

• Highly non linear systems

• Microscopic effects may have larmacroscopic impactmacroscopic impact

rge

AlPPC / x86 clusters

Power 4/5/6 AIXAl

GL / BGP

Dimemas

me translators: TAU, KOJAK, OTF, HPC

y easy to add new programming mo

ltix Sltix(Supported by NASA AMES) By Rainer Kelle

PERUSEB R i K ll (H

Cell BE

By Rainer Keller (H& (UTK)

MPICH collective internals

C Toolkit....

odels

R d tRaw data

tunable

eeing is believing

easuring is better

TimelinesId tifi f f tiIdentifier of functionHardware countsMiss ratiosPerformance (IPC, Mflops,…)

Routine duration• • •• • •

S i iStatisticsProfilesAverage miss ratio per routineg pHistogram of routine durationNumber of messages

• • •• • •

•• From raw events Piece-wise constan

• Basic metricsInstructions

MPI callsMPI callsUseful duration

• Derived metrics

instrusefulIPC= instrcycles

�useful

• Models

L2 = cycles− instr / idea

nt functions of time plots / colors

Color encoding

MPI call Cost=

MPI callduration

bytes

preemptedtime=elapsed− cyccloc

alIPC

• 2D T bl• 2D Tables • Accumulate over the different values of a co

• P fil C t i l ti li ( f ti• Profile: Categorical timeline (user function

• Histogram: Numeric timeline (duration, in• Diff t d t i d C l ti• Different data window – Correlations

• Average IPC @ each user function

• N b f i t ti f h f• Number of instructions for each range of • Communication Patterns• 3D extensions – 2 control windows3D extensions 2 control windows

Duration - IPC

While compu

ontrol window

i ll )n, mpi call...)

structions, IPC...)

d tidurations MPI calls profile

uting Communication pattern

6 tasks trace

• Simulation: Highly non linear model

• Linear componentsp

• Point to point communication

• Sequential processor performanceq p p

• Global CPU speed

• Per block/subroutinePer block/subroutine

• Non linear components

• Synchronization semantics

• Blocking receives

• Rendezvous

• Resource contention

• CPU

• Communication subsystem • links (half/full duplex), busses

B

Local

CPU

L

CPULocal

L

CPULoc

L

CPU

Local

MemoryCPU CPU

CPU

Local

Memory CPU

CPU

Loc

Mem

• GRID extensionsDedicated connectionsDedicated connections

External network (trafic)

• Point to point model

SimresMPI_send

Logical TransferTransfer

LatencyUses CPUIndependent of size

MPI_recv

• Collectives modelCollectives model

• Barrier / Fan-in / Fan-out

• Cost Generic / Per callCost Generic / Per call

Processor timeBlock time

Comm. time

Time = Latency+ mBaBa

mulated contention for machine sources (links & buses)

Physical Transfer SizeBW

Process Blocked

MODEL_Bandwidth

SizeLatencyTime ∗⎟⎠⎞

⎜⎝⎛ +=

Machine 1 Machine 2

Fan in inte

Fan in extern

Fan out exte

Fan out inter

• Simulations with 4 processes per node• NMM Iberia 4Km

• Not sensitive to latency

• 512 sensitive to contention?

• 256 MB/s OK• ARW Iberia 4 Km

• Not sensitive to latency

• sensitive to contention

• Need 1GB/sContention Impact (L=8; BW=2

1 2

0.8

1

1.2

omec

tivity

0.4

0.6

Spee

dup

vs. F

ull c

o

0

0.2

4 8 12 16 20 24 28

S

Impact of latency (BW=256; B=0)

0 998

1

1.002

l Lat

ency

0.994

0.996

0.998

up v

s.

Nom

inal

0.99

0.992

0 2 4 8 16 32

Spe

edu

Impact of BW (L=8; B=0)

256)

0.80

1.00

1.20

NMM 512

ARW 5120.40

0.60

0.80

Effic

ienc

y

NMM 256

ARW 256

NMM 128

ARW 1280.00

0.20

1 4 16 64 256 1024

32 36

• MPIRE, 32 tasks, no network contention

L=25us BW=100MB/sL=25us, BW=100MB/s

L 25 BW 10MB/

L 1000 s BW 100MB/sL=1000us, BW=100MB/s

All windows same sca


• Paraver

• Dimemas

Research areas

• Time analysis




Models



l for data reductioniodic structure: Reference focus for detaomatically generate a representative sub

ta handling/summarization capability

y g p

• Software counters, filtering and cuttinggnals:

• To “clean” data: Flushing processes, %pree#msgs/BW (clogged system)

• T id tif t t #T k ti S• To identify structure: #Tasks computing, Suduration, #processes in MPI, average IPC,…

tomatizable through signal processing techniqg g p g q

• Mathematical morphology to clean up pertu

• Wavelet transform to identify coarse regionsWavelet transform to identify coarse regions

• Spectral analysis for detailed periodic patte

ailed analysisbtrace of few iterations

570 s2.2 GB

WRF-NMMPeninsula 4km

MPI, HWC 128 procs

emted time,

b t

570 s5 MB

um burst …ques:q

urbed regions

s 4.6 ss

ern36.5 MB

• Useful for identifying and highlighting struct

• Cluster information injected in trace

• Phases within routines

• Different routines may have similar behav

• Compact trace encoding• Compact trace encoding

• Input to time analysis

CPMD

NAS BT

ture

vior

0.8

1

DBSCAN (Eps=0.01, MinPoints=20) clustering of trace WRF-128-PI.chop2.trf

0.4

0.6

Inst

ruct

ions

Com

plet

ed

WRF

0.2

In

Focusing analysis

• Projection of HWC within a cluster

• Apply analysis to cluster level

• Statistics• CPI stack model• My favorite metrics

• Based on MRNet• Based on MRNet

i … i+n10 … 300 … i+

m… 1i0

…

• 1st experiment: Collective duration thresholdp

245MB, >15500 col

• Current development

245MB, 15500 col

• Periodic frequency analysis

• P i di l t i h t

d <1MB, <

25MB, <

Collective internals

A l i tAnalysis atAnalysis atAnalysis at

Clusters distribution

t i t 30t minute 10t minute 20t minute 30

CPI STACK model (generic

• Useful

• To control granularity (not driven by MPI c

• To identify timeline profile, application str

• Adding sampling information on the tracefile

• based on the overflow mechanism offered

• period = f (cycles / instructions / cache m

• sampled information:

• call stack – as reference to source

• other hardware counters (not sampled)

•• interval between samples:

• High frequency sampling (> Nyquist)

• L f li• Low frequency sampling

calls, selected user functions)

ucture

e

d by PAPI

isses...)

How to increase precision?Folding based on known periodic structure of application (tagged iterations) Applies to stationary applications

R lt t f it ti ith th tiResult: trace for one iteration with synthetic paraver events

Refer counts/timestamps to start of iteration

• Call stack

• Search for consecutive sequences of folded samples within same function and generate synthetic events

• Hardware countersHardware counters

• Noise reduction• Fit folded samples• Sample fitting curve to generate synthetic

events.

• Useful to identify

• Internal behavior

• Density of L1 misses within MPI in an SM

Data

MP: when data actually arrives.

arrives


• Paraver

• Dimemas

Research areas

• Time analysis




Models



T

T i

T

Supi

LB=∑i=1

P

eff iCoeff i=

T iIPC

# LBP *max�eff i�

eff i T#instr

T ideal

Sup=

Migrating/local load imbalanceSerializationSerialization

CommicroLB=max �T i �

up= PP

� LBLB

� CommEffC Eff

� IPCIPC

�instr0

i t

Directly from real execution me

P0 LB0 CommEff 0 IPC 0 instr

ommEff= max �eff i �

Directly from real execution me

PP0

� macroLBmacroLB0

� microLBmicroLB0

� CommEffCommEff 0

� IPCIPC 0

�instr0

instr

Requires Dimemas simul

mmEff=T ideal

Poor comReplication of computation Poor commmunication efficiencymmunication efficiency

Improved macrmicro load bala

CommuncMicro LoMacro LoIPCIPCComputat


• Paraver

• Dimemas

Research areas

• Time analysis




Models



Pattern Pattern AnalysisAnalysis

MPITraceMPITrace

DIMEMASVENUS (IBM-ZRL)

realideal_lFU

ideal_nFUideal_ROB

ideal_L1_cachesideal_L1_instr_cache

ideal

BT:C.64, task 20, cluster 1

MPSimMPSimreal_lFU

real_nFUreal_ROB

real_L1_cachesreal_L1_data_cache

real

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3

ClusteringClusteringgg

bursts bursts selectionselection

fFfF

ValgrindValgrind

• Profile presentation tools

• Reduction/aggregation of the performanc

• Time dimension disappeared / Space dim• Traces

• “All” data is there

ce dimensions

Profilestats C

mension sometimes

VarianceVarianceTimspa

l t f V i d TAU tslator of Vampir and TAU traces

HPCTHPCT

Pattern Pattern AnalysisAnalysis

Stats Stats

algrindalgrind

racerace FiltersFiltersPARAVPARAV

.prv.prvff

PARAVPARAV

.pcf.pcf.row.row

MPSiMPSi

ClusteringClustering

.trf.trf

MPSimMPSim

KOJAKKOJAK

TAUTAUGenGen

.cfg.cfg

VERVERVamVam

oftoft22prvprv

VERVER

DIMEMAS

VENUS (IBM-ZRL)


• Paraver

• Dimemas

Research areas

• Time analysis




Models



Test case: NUCLEOSOME

N

GaGaaG

2 4 8 16 32 64 128 256 512 1024

Cluster / MN perf.

2,5

3

3,5

1

1,5

2

Raw processor performance0

0,5

1 2 3 4 5 6 7 8

Raw processor performancevs

Communication efficiency?

NAMD 2.7

y

GMX4_mn_work_nucleo_nsdayGMX4_mn_stop_nucleo_nsdayayGMX4_cluster_work_nsday

64 tasks: whole run 14GBfilter only “large enough” computations 25

h 1 it ti 13MBchop 1 iteration13MB

5MB

64 tasks

256 tasks

64 tasks 256 tasks

Main scalability problems• Load balanceparallel eff

comm Eff • Computation balance

4% of code replication

comm. Effload balancemico load balancecomp. balance 4% of code replication

64 tasks already poor effic

pcomputationIPC balanceIPC y p

• Communication problem

64 tasks

<

256 tasks

T diff f

Instructions histogram

64 tasks

256 tasks

64 tasks

64 tasks: globally balanced but some lo

Instructions imbal

IPC Imba

ocal imbalance

lance

alance

MPI calls

64 tasks

256 tasks

Duration of the Computation

64 tasks

256 tasks

64Load balance!!!

T id l

64 pro

– Trapezoidal– IPC ∝ Instr

– 2x instr20% IPC

256

– 20% IPC– IPC ∝ 1/Instr

ocs

procs

Long waits for FFTs

Tasks phase parallel eff.64 all 0,55

FFTs 0 55FFTs 0,55part icles 0,54

256 all 0,31FFTs 0 35FFTs 0,35part icles 0,29

0,9

1

0,6

0,7

0,8

0,4

0,5

,

0 1

0,2

0,3

64 all 64 FFTs 64 part icles0

0,1

load balance comp. balance5 0,93 0,815 0 91 0 955 0,91 0,954 0,94 0,871 0,61 0,585 0 55 0 695 0,55 0,699 0,74 0,72

parallel eff.load balancecomp. balance

256 all 256 FFTs 256 part icles

64 tasks

256 t k256 tasks

64 tasks

64 tasks

• Real

• Prediction nominal

• Prediction ideal– Not much gain– Significant serialization

• Dependences?

Particles

MPI• The imbalance

computation region where OpenMP would help for

MPI_Re

_

p pload balance is between sendrecv calls in line 1533 in domdec.c

MPI Waitall:

MPI_Senredv: 1533@do

MPI_Waitall:

Color re

Allreduce: 286@pme pp.c

cv: 427@pme_pp.c

_ @p _pp

: 126@demdec network c

omdec.c

: 126@demdec_network.c

epresents MPI caller line

FFTsCo

• Imbalance computation between Alltoallvs in line [email protected]– Strange?– Better selection of # fft

procs?procs?– OpenMP?

MPI_Sendrecv: 482@pdMPI_Sendrecv: 7

MPI_Sendrecv: 344@gmx_MPI AlltoMPI_Allto

MMM

lor represents MPI call

Color represents MPI caller line

pme.c@[email protected]

_parallel_3dfft.coallv: 241@fftgrid coallv: [email protected]_Sendrecv: [email protected] Sendrecv: [email protected]_Sendrecv: [email protected]

MPI_Recv: 204@pme_pp.c

Main scalability problem is load unbala

• Mixing MPI+OpenMP with dynamic

Mixing of Particles and FFTs tasks maMixing of Particles and FFTs tasks ma

• Two different granularities

Even if was initially considered to havepoor communication efficiencypoor communication efficiency

ance

c run time load balance

akes things difficultakes things difficult

e a good performance, the 64 tasks have

Performance analysis: fun and colorful

Simple tools to analyze complex proble

Scalability requires tools to drive/suppo

Importance of the details to understand

Benefits from the tools interoperability

Tools freely available ([email protected]) B

l

ems

ort the analyst where to look at

d

Best effort support and training

performance analisis withanalisis with cepbaa-tools · •2dt bld tables • accumulate over the...

Documents