hardware performance monitoring hardware performance monitoring
TRANSCRIPT
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
1/44
Hardware Performance MonitoringHardware Performance Monitoring
with PAPIwith PAPI
CScADS Autotuning WorkshopJuly 2007
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
2/44
CScADS Autotuning
WhatWhat
s PAPI?s PAPI?
Middleware that provides a consistent programminginterface for the performance counter hardware foundin most major micro-processors.
Countable events are defined in two ways: Platform-neutral Preset Events Platform-dependent Native Events
Preset Events can be derived from multiple NativeEvents
All events are referenced by name and collected intoEventSets for sampling
Events can be multiplexed if counters are limited Statistical sampling is implemented by:
Software overflow with timer driven sampling Hardware overflow if supported by the platform
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
3/44
CScADS Autotuning
WhereWhere
s PAPIs PAPI
PAPI runs on most modern processors andOperating Systems of interest to HPC:
IBM POWER{3, 4, 5} / AIXPOWER{4, 5, 6} / LinuxPowerPC{-32, -64, 970} / LinuxBlue Gene / L
Intel Pentium II, III, 4, M, Core, etc. / LinuxIntel Itanium{1, 2, Montecito?}AMD Athlon, Opteron / Linux
Cray T3E, X1, XD3, XT{3, 4} CatamountAltix, Sparc, SiCortexand even Windows!but not Mac
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
4/44
CScADS Autotuning
History of PAPIHistory of PAPI
http://icl.cs.utk.edu/papi/
Started as a Parallel ToolsConsortium project in 1998
Goal:Produce a specification for a portable
interface to the hardware performance
counters available on most modernmicroprocessors.
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
5/44
CScADS Autotuning
Release TimelineRelease Timeline
SDCI HPC Improvement:
High-Productivity Performance Engineering(Tools, Methods, Training) for the NSF HPC Applications
Submitted January 22, 2007
Allen D. Malony, Sameer Shende, Shirley Moore, Nicholas Nystrom, Rick Kufrin
Figure 2: Timeline of releases for each tool represented in the project. The vertical dashed lines indicate SC conference
dates where the tools are regularly demonstrated. TAUs v1.0 release occurred at SC97.
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
6/44
CScADS Autotuning
TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/
HPCToolkit (Rice Univ) http://hipersoft.cs.rice.edu/hpctoolkit/
KOJAK (UTK, FZ Juelich) http://icl.cs.utk.edu/kojak/
PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/
Titanium (UC Berkeley)http://www.cs.berkeley.edu/Research/Projects/titanium/
SCALEA (Thomas Fahringer, U Innsbruck)http://www.par.univie.ac.at/project/scalea/
Open|Speedshop (SGI)http://oss.sgi.com/projects/openspeedshop/
SvPablo (UNC Renaissance Computing Institute)http://www.renci.unc.edu/Software/Pablo/pablo.htm
Some Tools that use PAPISome Tools that use PAPI
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
7/44
CScADS Autotuning
Current PAPI DesignCurrent PAPI Design
Two exposed interfaces to the underlying counter hardware:1. The low level API manages hardware events in userdefined groups called EventSets, and provides access toadvanced features.
2. The high level API provides the ability to start, stop andread the counters for a specified list of events.
PAPI Framework Layer
LowLevelAPI
HiLevelAPI
PAPI Component Layer
Perf Counter Hardware
Operating System
Kernel Patch
Portable
PlatformDependent
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
8/44
CScADS Autotuning
PAPI Hardware EventsPAPI Hardware Events
Preset Events Standard set of over 100 events for application
performance tuning
No standardization of the exact definition Mapped to either single or linear combinations of native
events on each platform Usepapi_availutility to see what preset events are
available on a given platform Native Events
Any event countable by the CPU
Same interface as for preset events
Usepapi_native_availutility to see all available nativeevents
Usepapi_event_chooserutility to select a
compatible set of events
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
9/44
CScADS Autotuning
Data and Instruction Range QualificationData and Instruction Range Qualification
Generalized PAPI interface for data structure andinstruction address range qualification
Applied to the specific instance of the Itanium2 Extended an existing PAPI call, PAPI_set_opt(),
to specify starting and ending addresses of datastructures or instructions to be instrumented
option.addr.eventset = EventSet;
option.addr.start = (caddr_t)array;
option.addr.end = (caddr_t)(array + size_array);
retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &option);
An instruction range can be set usingPAPI_INSTR_ADDRESS
papi_native_avail was modified to list eventsthat support data or instruction address range
qualification.
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
10/44
CScADS Autotuning
PAPI Preset EventsPAPI Preset Events
Of ~100 events, over half are cache related:
PAPI_L1_DCH: Level 1 data cache hits
PAPI_L1_DCA: Level 1 data cache accesses
PAPI_L1_DCR: Level 1 data cache readsPAPI_L1_DCW: Level 1 data cache writes
PAPI_L1_DCM: Level 1 data cache misses
PAPI_L1_ICH: Level 1 instruction cache hits
PAPI_L1_ICA: Level 1 instruction cache accesses
PAPI_L1_ICR: Level 1 instruction cache reads
PAPI_L1_ICW: Level 1 instruction cache writes
PAPI_L1_ICM: Level 1 instruction cache misses
PAPI_L1_TCH: Level 1 total cache hits
PAPI_L1_TCA: Level 1 total cache accesses
PAPI_L1_TCR: Level 1 total cache reads
PAPI_L1_TCW: Level 1 total cache writesPAPI_L1_TCM: Level 1 cache misses
PAPI_L1_LDM: Level 1 load misses
PAPI_L1_STM: Level 1 store misses Repeat forLevels 2 and 3
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
11/44
CScADS Autotuning
PAPI Preset Events (ii)PAPI Preset Events (ii)
Other cache and memory events:PAPI_CA_SNP: Requests for a snoop
PAPI_CA_SHR: Requests for exclusive access to shared cache linePAPI_CA_CLN: Requests for exclusive access to clean cache line
PAPI_CA_INV: Requests for cache line invalidation
PAPI_CA_ITV: Requests for cache line intervention
PAPI_TLB_DM: Data translation lookaside buffer misses
PAPI_TLB_IM: Instruction translation lookaside buffer misses
PAPI_TLB_TL: Total translation lookaside buffer misses
PAPI_TLB_SD: Translation lookaside buffer shootdowns
PAPI_LD_INS: Load instructions
PAPI_SR_INS: Store instructions
PAPI_MEM_SCY: Cycles Stalled Waiting for memory accessesPAPI_MEM_RCY: Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY: Cycles Stalled Waiting for memory writes
PAPI_RES_STL: Cycles stalled on any resource
PAPI_FP_STAL: Cycles the FP unit(s) are stalled
Sharedcache
TLB
ResourceStalls
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
12/44
CScADS Autotuning
PAPI Preset Events (iii)PAPI Preset Events (iii)
Program flow:PAPI_BR_INS: Branch instructions
PAPI_BR_UCN: Unconditional branch instructionsPAPI_BR_CN: Conditional branch instructions
PAPI_BR_TKN: Conditional branch instructions taken
PAPI_BR_NTK: Conditional branch instructions not taken
PAPI_BR_MSP: Conditional branch instructions mispredicted
PAPI_BR_PRC: Conditional branch instructions correctly predicted
PAPI_BTAC_M: Branch target address cache misses
PAPI_CSR_FAL: Failed store conditional instructions
PAPI_CSR_SUC: Successful store conditional instructions
PAPI_CSR_TOT: Total store conditional instructions
Branches
ConditionalStores
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
13/44
CScADS Autotuning
PAPI_TOT_CYC: Total cycles
PAPI_TOT_IIS: Instructions issued
PAPI_TOT_INS: Instructions completed
PAPI_INT_INS: Integer instructions completed
PAPI_LST_INS: Load/store instructions completed
PAPI_SYC_INS: Synchronization instructions completed
PAPI_BRU_IDL: Cycles branch units are idle
PAPI_FXU_IDL: Cycles integer units are idle
PAPI_FPU_IDL: Cycles floating point units are idle
PAPI_LSU_IDL: Cycles load/store units are idle
PAPI_STL_ICY: Cycles with no instruction issue
PAPI_FUL_ICY: Cycles with maximum instruction issuePAPI_STL_CCY: Cycles with no instructions completed
PAPI_FUL_CCY: Cycles with maximum instructions completed
PAPI_HW_INT: Hardware interrupts
PAPI Preset Events (iv)PAPI Preset Events (iv)
Timing, efficiency, pipeline:
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
14/44
CScADS Autotuning
PAPI Preset Events (v)PAPI Preset Events (v)
Floating point:
PAPI_FP_INS: Floating point instructions
PAPI_FP_OPS: Floating point operationsPAPI_FML_INS: Floating point multiply instructions
PAPI_FAD_INS: Floating point add instructions
PAPI_FDV_INS: Floating point divide instructions
PAPI_FSQ_INS: Floating point square root instructions
PAPI_FNV_INS: Floating point inverse instructions
PAPI_FMA_INS: FMA instructions completed
PAPI_VEC_INS: Vector/SIMD instructions
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
15/44
CScADS Autotuning
WhatWhats a Native Event?s a Native Event?
8 mask bits 8 bits: 256 events
16 mask bits6 bits: 64 events
PMC: Pentium 4
PMC: Intel Pentium II, III, M, Core; AMD Athlon, Opteron
PMD: AMD Athlon, Opteron
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
16/44
CScADS Autotuning
Intel Pentium Core: L2_STIntel Pentium Core: L2_ST
{ .pme_uname = "SELF",
.pme_udesc = "This core",
.pme_ucode = 0x40
},
{ .pme_uname = "BOTH_CORES",
.pme_udesc = "Both cores",
.pme_ucode = 0xc0
}
},
.pme_numasks = 7
},
{ .pme_name = "L2_ST",
.pme_code = 0x2a,
.pme_flags = PFMLIB_CORE_CSPEC,
.pme_desc = "L2 store requests",
.pme_umasks = {
{ .pme_uname = "MESI",.pme_udesc = "Any cacheline access",
.pme_ucode = 0xf
},
{ .pme_uname = "I_STATE",
.pme_udesc = "Invalid cacheline",
.pme_ucode = 0x1
},
{ .pme_uname = "S_STATE",
.pme_udesc = "Shared cacheline",
.pme_ucode = 0x2
},
{ .pme_uname = "E_STATE",
.pme_udesc = "Exclusive cacheline",
.pme_ucode = 0x4
},{ .pme_uname = "M_STATE",
.pme_udesc = "Modified cacheline",
.pme_ucode = 0x8
}
PRESET,PAPI_L2_DCA,
DERIVED_ADD,L2_LD:SELF:ANY:MESI,
L2_ST:SELF:MESI
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
17/44
CScADS Autotuning
How many counters does it takeHow many counters does it take
Penti
um
Penti
umII
Penti
umIII
Penti
umM
Itaniu
m
AMD
Athlo
n
AMD
Opter
on
S
iCort
exMIP
S
Intel
Core
POW
ER3
POW
ER4
Cell
P
OWER
5
Pe
ntium
4BG
/L
Cray
X1
S1
0
10
20
30
40
50
60
70
24
5(3 fixed)
8
8:164:32
12(6:2)
18
48+2+2
32+16+16
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
18/44
CScADS Autotuning
PAPI and BG/LPAPI and BG/L
Performance Counters: 48 UPC Counters
shared by both CPUs
External to CPU cores
32 bits :(
2 Counters on each FPU 1 counts load/stores
1 counts arithmetic operations
Accessed via blg_perfctr
15 Preset Events 10 PAPI presets
5 Custom BG/L presets
328 native events available
2 FPU PMCs
2 FPU PMCs
UPC Module48 SharedCounters
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
19/44
CScADS Autotuning
Cell Broadband EngineCell Broadband Engine
Each Cell contains: 1 PPE and 8 SPEs. and 1 PMU external to all of these.
8 16-bit counters configurable as 4 32-bit counters.
1024 slot 128-bit trace buffer
400 native events
Working with IBM
engineers on developing perfmon2
libpfm layer for Cell BE
Linux Cell BE
kernel modifications Porting PAPI-C
(LANL grant)
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
20/44
CScADS Autotuning
Top500 Operating SystemsTop500 Operating Systems
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
21/44
CScADS Autotuning
PerfctrPerfctr
Written by Mikael PettersonLabor of love
First available: Fall 1999
First PAPI use: Fall 2000
Supports:Intel Pentium II, III, 4, M, Core
AMD K7 (Athlon), K8 (Opteron)
IBM PowerPC 970, POWER4, POWER5
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
22/44
CScADS Autotuning
PerfctrPerfctr FeaturesFeatures
Patches the Linux kernel
Saves perf counters on context switchVirtualizes counters to 64-bitsMemory-maps counters for fastaccess
Supports counter overflow interruptswhere available
User Space LibraryPAPI uses about a dozen calls
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
23/44
CScADS Autotuning
PerfctrPerfctr TimelineTimeline
Steady development
1999 2004Concerted effort for kernel inclusionMay 2004 May 2005
Ported to Cray Catamount; Power Linux~ 2005
Maintenance only2005
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
24/44
CScADS Autotuning
PerfmonPerfmon
Written by Stephane Eranian @ HP
Originally Itanium onlyBuilt-in to the Linux-ia64 kernel since
2.4.0System call interface
libpfm helper library for bookkeeping
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
25/44
CScADS Autotuning
Perfmon2*Perfmon2*
Provides a generic interface to access PMU Not dedicated to one app, avoid fragmentation
Must be portable across all PMU models: Almost all PMU-specific knowledge in user level libraries
Supports per-thread monitoring Self-monitoring, unmodified binaries, attach/detach
multi-threaded and multi-process workloads Supports system-wide monitoring Supports counting and sampling No modification to applications or system Built-in, efficient, robust, secure, simple,
documented
* Slide contents courtesy Stephane Eranian* Slide contents courtesy Stephane Eranian
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
26/44
CScADS Autotuning
Perfmon2Perfmon2
Setup done through external support library Uses a system call for counting operations
More flexibility, ties with ctxsw, exit, fork
Kernel compile-time option on Linux
Perfmon2 context encapsulates all PMU state Each context uniquely identified by file descriptor
int perfmonctl(int fd, int cmd, void *arg, int narg)
PFM_GET_CONFIG
PFM_GETINFO_PMCS
PFM_CREATE_EVTSETPFM_WRITE_PMDS
PFM_WRITE_PMCS
PFM_CREATE_CONTEXT
PFM_SET_CONFIG
PFM_GETINFO_PMDS
PFM_DELETE_EVTSETPFM_UNLOAD_CONTEXT
PFM_LOAD_CONTEXT
PFM_READ_PMDS
PFM_GETINFO_EVTSETPFM_RESTART
PFM_STOP
PFM_START
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
27/44
CScADS Autotuning
Perfmon2 FeaturesPerfmon2 Features
Support today for:
Intel Itanium, P6, M, Core, Pentium4, AMDOpteron, IBM Power, MIPS, SiCortex
Full native event tables for supportedprocessors
Kernel based MultiplexingEvent set chaining
Kernel based Sampling/OverflowTime or event basedCustom sampling buffers
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
28/44
CScADS Autotuning
Next StepsNext Steps
Kernel integration Final integration testing underway
Possible inclusion in 2.6.22 kernel Implemented by Cray in CNK, X2 Cell BE
Port with IBM engineers is underway
Leverage libpfm for PAPI native events Migration completed for P6, Core, P4, Opteron
PAPI testing on perfmon2 patched kernels
Opteron currently being tested Woodcrest/Clovertown testing planned
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
29/44
CScADS Autotuning
Component PAPI (PAPIComponent PAPI (PAPI--C)C)
Goals:Support simultaneous access to on- and off-
processor countersIsolate hardware dependent code in a separable
component module
Extend platform independent framework code tosupport multiple simultaneous components
Add or modify API calls to support access toany of several components
Modify build environment for easy selection andconfiguration of multiple available components
Will be released (RSN*) as PAPI 4.0
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
30/44
CScADS Autotuning
Current PAPI DesignCurrent PAPI Design
PAPI Framework Layer
LowLevelAPI
HiLevelAPI
PAPI Component Layer
Perf Counter Hardware
Operating System
Kernel Patch
Portable
PlatformDependent
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
31/44
CScADS Autotuning
Component PAPI DesignComponent PAPI Design
PAPI Framework Layer
LowLevelAPI
HiLevelAPI
PAPI Component Layer(network)
Perf Counter Hardware
Operating System
Kernel Patch
PAPI Component Layer(CPU)
Perf Counter Hardware
Operating System
Kernel Patch
PAPI Component Layer(thermal)
Perf Counter Hardware
Operating System
Kernel Patch
D
evelAPI Devel
APIDevel
API
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
32/44
CScADS Autotuning
PAPIPAPI--C StatusC Status PAPI 3.9 pre-release available with documentation
Implemented Myrinet substrate (native counters)
Implemented ACPI temperature sensor substrate
Working on Infiniband and Cray Seastar substrates(access to Seastar counters not available underCatamount but expected under CNL)
Asked by Cray engineers for input on desired metrics fornext network switch
Tested on HPC Challenge benchmarks
Tested platforms include Pentium III, Pentium 4,
Core2Duo, Itanium (I and II) and AMD Opteron
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
33/44
CScADS Autotuning
PAPIPAPI--C New RoutinesC New Routines
PAPI_get_component_info()
PAPI_num_cmp_hwctrs() PAPI_get_cmp_opt()
PAPI_set_cmp_opt() PAPI_set_cmp_domain()
PAPI_set_cmp_granularity()
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
34/44
CScADS Autotuning
PAPIPAPI--C Building and LinkingC Building and Linking
CPU components are automatically detected byconfigureand included in the build
CPU component assumed to be present andalways configured as component 0
To include additional components, use configureoption
--with- = yes Currently supported componentswith-acpi = yes
with-mx = yeswith-net = yes
The make process compiles and links sources forall requested components into a single library
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
35/44
CScADS Autotuning
MyrinetMyrinet MX CountersMX Counters
ROUTE_DISPERSION
OUT_OF_SEND_HANDLES
OUT_OF_PULL_HANDLES
OUT_OF_PUSH_HANDLES
MEDIUM_CONT_RACE
CMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWN
INTERRUPTS_OVERRUN
WAITING_FOR_INTERRUPT_DMA
WAITING_FOR_INTERRUPT_ACK
WAITING_FOR_INTERRUPT_TIM
ER
SLABS_RECYCLING
SLABS_PRESSURE
SLABS_STARVATION
OUT_OF_RDMA_HANDLES
EVENTQ_FULL
BUFFER_DROP
MEMORY_DROP
HARDWARE_FLOW_CONTROL
SIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPED
WAKE_INTERRUPTS
AVERTED_WAKEUP_RACE
DMA_METADATA_RACE
REPLY_SEND
REPLY_RECV
QUERY_UNKNOWN
DATA_SEND_NULL
DATA_SEND_SMALL
DATA_SEND_MEDIUMDATA_SEND_RNDV
DATA_SEND_PULL
DATA_RECV_NULL
DATA_RECV_SMALL_INLINE
DATA_RECV_SMALL_COPY
DATA_RECV_MEDIUM
DATA_RECV_RNDV
DATA_RECV_PULL
ETHER_SEND_UNICAST_CNT
ETHER_SEND_MULTICAST_C
NT
ETHER_RECV_SMALL_CNT
ETHER_RECV_BIG_CNT
ETHER_OVERRUN
ETHER_OVERSIZEDDATA_RECV_NO_CREDITS
PACKETS_RESENT
PACKETS_DROPPED
MAPPER_ROUTES_UPDATE
ACK_NACK_FRAMES_IN_PIPE
NACK_BAD_ENDPT
NACK_ENDPT_CLOSED
NACK_BAD_SESSION
NACK_BAD_RDMAWIN
NACK_EVENTQ_FULLSEND_BAD_RDMAWIN
CONNECT_TIMEOUT
CONNECT_SRC_UNKNOWN
QUERY_BAD_MAGIC
QUERY_TIMED_OUT
QUERY_SRC_UNKNOWN
RAW_SENDS
RAW_RECEIVES
RAW_OVERSIZED_PACKETS
RAW_RECV_OVERRUN
RAW_DISABLED
CONNECT_SEND
CONNECT_RECV
ACK_SEND
ACK_RECVPUSH_SEND
PUSH_RECV
QUERY_SEND
QUERY_RECV
LANAI_UPTIME
COUNTERS_UPTIME
BAD_CRC8
BAD_CRC32
UNSTRIPPED_ROUTE
PKT_DESC_INVALIDRECV_PKT_ERRORS
PKT_MISROUTED
DATA_SRC_UNKNOWN
DATA_BAD_ENDPT
DATA_ENDPT_CLOSED
DATA_BAD_SESSION
PUSH_BAD_WINDOW
PUSH_DUPLICATE
PUSH_OBSOLETE
PUSH_RACE_DRIVER
PUSH_BAD_SEND_HANDLE
_MAGIC
PUSH_BAD_SRC_MAGIC
PULL_OBSOLETE
PULL_NOTIFY_OBSOLETEPULL_RACE_DRIVER
ACK_BAD_TYPE
ACK_BAD_MAGIC
ACK_RESEND_RACE
LATE_ACK
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
36/44
CScADS Autotuning
MyrinetMyrinet MX CountersMX Counters
ROUTE_DISPERSION
OUT_OF_SEND_HANDLES
OUT_OF_PULL_HANDLES
OUT_OF_PUSH_HANDLES
MEDIUM_CONT_RACE
CMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWN
INTERRUPTS_OVERRUN
WAITING_FOR_INTERRUPT_DMA
WAITING_FOR_INTERRUPT_ACK
WAITING_FOR_INTERRUPT_TIM
ER
SLABS_RECYCLING
SLABS_PRESSURESLABS_STARVATION
OUT_OF_RDMA_HANDLES
EVENTQ_FULL
BUFFER_DROP
MEMORY_DROP
HARDWARE_FLOW_CONTROL
SIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPED
WAKE_INTERRUPTS
AVERTED_WAKEUP_RACE
DMA_METADATA_RACE
REPLY_SEND
REPLY_RECV
QUERY_UNKNOWN
DATA_SEND_NULL
DATA_SEND_SMALL
DATA_SEND_MEDIUMDATA_SEND_RNDV
DATA_SEND_PULL
DATA_RECV_NULL
DATA_RECV_SMALL_INLINE
DATA_RECV_SMALL_COPY
DATA_RECV_MEDIUM
DATA_RECV_RNDV
DATA_RECV_PULLETHER_SEND_UNICAST_CNT
ETHER_SEND_MULTICAST_C
NT
ETHER_RECV_SMALL_CNT
ETHER_RECV_BIG_CNT
ETHER_OVERRUN
ETHER_OVERSIZEDDATA_RECV_NO_CREDITS
PACKETS_RESENT
PACKETS_DROPPED
MAPPER_ROUTES_UPDATE
ACK_NACK_FRAMES_IN_PIPE
NACK_BAD_ENDPT
NACK_ENDPT_CLOSED
NACK_BAD_SESSION
NACK_BAD_RDMAWIN
NACK_EVENTQ_FULLSEND_BAD_RDMAWIN
CONNECT_TIMEOUT
CONNECT_SRC_UNKNOWN
QUERY_BAD_MAGIC
QUERY_TIMED_OUT
QUERY_SRC_UNKNOWN
RAW_SENDS
RAW_RECEIVESRAW_OVERSIZED_PACKETS
RAW_RECV_OVERRUN
RAW_DISABLED
CONNECT_SEND
CONNECT_RECV
ACK_SEND
ACK_RECVPUSH_SEND
PUSH_RECV
QUERY_SEND
QUERY_RECV
LANAI_UPTIME
COUNTERS_UPTIME
BAD_CRC8
BAD_CRC32
UNSTRIPPED_ROUTE
PKT_DESC_INVALIDRECV_PKT_ERRORS
PKT_MISROUTED
DATA_SRC_UNKNOWN
DATA_BAD_ENDPT
DATA_ENDPT_CLOSED
DATA_BAD_SESSION
PUSH_BAD_WINDOW
PUSH_DUPLICATEPUSH_OBSOLETE
PUSH_RACE_DRIVER
PUSH_BAD_SEND_HANDLE
_MAGIC
PUSH_BAD_SRC_MAGIC
PULL_OBSOLETE
PULL_NOTIFY_OBSOLETEPULL_RACE_DRIVER
ACK_BAD_TYPE
ACK_BAD_MAGIC
ACK_RESEND_RACE
LATE_ACK
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
37/44
CScADS Autotuning
Multiple MeasurementsMultiple Measurements
The HPCC HPL benchmark with 3 performance metrics: FLOPS; Temperature; Network Sends/Receives
Temperature is from an on-chip thermal diode
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
38/44
CScADS Autotuning
Multiple Measurements (2)Multiple Measurements (2)
The HPCC HPL benchmark with 3 performance metrics: FLOPS; Temperature; Network Sends/Receives
Temperature is from an on-chip thermal diode
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
39/44
CScADS Autotuning
Eclipse PTP IDEEclipse PTP IDE
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
40/44
CScADS Autotuning
Performance Evaluation within Eclipse PTPPerformance Evaluation within Eclipse PTP
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
41/44
CScADS Autotuning
TAU and PAPITAU and PAPI PluginsPlugins for Eclipse PTPfor Eclipse PTP
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
42/44
CScADS Autotuning
PotentialPotential AutotuningAutotuning OpportunitiesOpportunities
Provide feedback to compilers orsearch engines
Run-time monitoring for dynamictuning or selection
Minimally intrusive collection ofalgorithm/application statisticsHow little data do we need?
How can we find the needle in thehaystack?
Other suggestions?
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
43/44
CScADS Autotuning
ConclusionsConclusions
PAPI has a long track record ofsuccessful adoption and use.
New architectures pose a challengefor off-processor hardware
monitoring as well as interpretationof counter values.
Integration of perfmon2 into the
Linux kernel will broaden the base ofPAPI users still further.
-
8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring
44/44
Hardware PerformanceHardware Performance
Monitoring with PAPIMonitoring with PAPI
CScADS Autotuning WorkshopJuly 2007