hardware performance monitoring hardware performance monitoring

Upload: miguelzas1

Post on 06-Apr-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    1/44

    Hardware Performance MonitoringHardware Performance Monitoring

    with PAPIwith PAPI

    Dan [email protected]

    CScADS Autotuning WorkshopJuly 2007

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    2/44

    CScADS Autotuning

    WhatWhat

    s PAPI?s PAPI?

    Middleware that provides a consistent programminginterface for the performance counter hardware foundin most major micro-processors.

    Countable events are defined in two ways: Platform-neutral Preset Events Platform-dependent Native Events

    Preset Events can be derived from multiple NativeEvents

    All events are referenced by name and collected intoEventSets for sampling

    Events can be multiplexed if counters are limited Statistical sampling is implemented by:

    Software overflow with timer driven sampling Hardware overflow if supported by the platform

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    3/44

    CScADS Autotuning

    WhereWhere

    s PAPIs PAPI

    PAPI runs on most modern processors andOperating Systems of interest to HPC:

    IBM POWER{3, 4, 5} / AIXPOWER{4, 5, 6} / LinuxPowerPC{-32, -64, 970} / LinuxBlue Gene / L

    Intel Pentium II, III, 4, M, Core, etc. / LinuxIntel Itanium{1, 2, Montecito?}AMD Athlon, Opteron / Linux

    Cray T3E, X1, XD3, XT{3, 4} CatamountAltix, Sparc, SiCortexand even Windows!but not Mac

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    4/44

    CScADS Autotuning

    History of PAPIHistory of PAPI

    http://icl.cs.utk.edu/papi/

    Started as a Parallel ToolsConsortium project in 1998

    Goal:Produce a specification for a portable

    interface to the hardware performance

    counters available on most modernmicroprocessors.

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    5/44

    CScADS Autotuning

    Release TimelineRelease Timeline

    SDCI HPC Improvement:

    High-Productivity Performance Engineering(Tools, Methods, Training) for the NSF HPC Applications

    Submitted January 22, 2007

    Allen D. Malony, Sameer Shende, Shirley Moore, Nicholas Nystrom, Rick Kufrin

    Figure 2: Timeline of releases for each tool represented in the project. The vertical dashed lines indicate SC conference

    dates where the tools are regularly demonstrated. TAUs v1.0 release occurred at SC97.

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    6/44

    CScADS Autotuning

    TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/

    HPCToolkit (Rice Univ) http://hipersoft.cs.rice.edu/hpctoolkit/

    KOJAK (UTK, FZ Juelich) http://icl.cs.utk.edu/kojak/

    PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/

    Titanium (UC Berkeley)http://www.cs.berkeley.edu/Research/Projects/titanium/

    SCALEA (Thomas Fahringer, U Innsbruck)http://www.par.univie.ac.at/project/scalea/

    Open|Speedshop (SGI)http://oss.sgi.com/projects/openspeedshop/

    SvPablo (UNC Renaissance Computing Institute)http://www.renci.unc.edu/Software/Pablo/pablo.htm

    Some Tools that use PAPISome Tools that use PAPI

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    7/44

    CScADS Autotuning

    Current PAPI DesignCurrent PAPI Design

    Two exposed interfaces to the underlying counter hardware:1. The low level API manages hardware events in userdefined groups called EventSets, and provides access toadvanced features.

    2. The high level API provides the ability to start, stop andread the counters for a specified list of events.

    PAPI Framework Layer

    LowLevelAPI

    HiLevelAPI

    PAPI Component Layer

    Perf Counter Hardware

    Operating System

    Kernel Patch

    Portable

    PlatformDependent

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    8/44

    CScADS Autotuning

    PAPI Hardware EventsPAPI Hardware Events

    Preset Events Standard set of over 100 events for application

    performance tuning

    No standardization of the exact definition Mapped to either single or linear combinations of native

    events on each platform Usepapi_availutility to see what preset events are

    available on a given platform Native Events

    Any event countable by the CPU

    Same interface as for preset events

    Usepapi_native_availutility to see all available nativeevents

    Usepapi_event_chooserutility to select a

    compatible set of events

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    9/44

    CScADS Autotuning

    Data and Instruction Range QualificationData and Instruction Range Qualification

    Generalized PAPI interface for data structure andinstruction address range qualification

    Applied to the specific instance of the Itanium2 Extended an existing PAPI call, PAPI_set_opt(),

    to specify starting and ending addresses of datastructures or instructions to be instrumented

    option.addr.eventset = EventSet;

    option.addr.start = (caddr_t)array;

    option.addr.end = (caddr_t)(array + size_array);

    retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &option);

    An instruction range can be set usingPAPI_INSTR_ADDRESS

    papi_native_avail was modified to list eventsthat support data or instruction address range

    qualification.

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    10/44

    CScADS Autotuning

    PAPI Preset EventsPAPI Preset Events

    Of ~100 events, over half are cache related:

    PAPI_L1_DCH: Level 1 data cache hits

    PAPI_L1_DCA: Level 1 data cache accesses

    PAPI_L1_DCR: Level 1 data cache readsPAPI_L1_DCW: Level 1 data cache writes

    PAPI_L1_DCM: Level 1 data cache misses

    PAPI_L1_ICH: Level 1 instruction cache hits

    PAPI_L1_ICA: Level 1 instruction cache accesses

    PAPI_L1_ICR: Level 1 instruction cache reads

    PAPI_L1_ICW: Level 1 instruction cache writes

    PAPI_L1_ICM: Level 1 instruction cache misses

    PAPI_L1_TCH: Level 1 total cache hits

    PAPI_L1_TCA: Level 1 total cache accesses

    PAPI_L1_TCR: Level 1 total cache reads

    PAPI_L1_TCW: Level 1 total cache writesPAPI_L1_TCM: Level 1 cache misses

    PAPI_L1_LDM: Level 1 load misses

    PAPI_L1_STM: Level 1 store misses Repeat forLevels 2 and 3

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    11/44

    CScADS Autotuning

    PAPI Preset Events (ii)PAPI Preset Events (ii)

    Other cache and memory events:PAPI_CA_SNP: Requests for a snoop

    PAPI_CA_SHR: Requests for exclusive access to shared cache linePAPI_CA_CLN: Requests for exclusive access to clean cache line

    PAPI_CA_INV: Requests for cache line invalidation

    PAPI_CA_ITV: Requests for cache line intervention

    PAPI_TLB_DM: Data translation lookaside buffer misses

    PAPI_TLB_IM: Instruction translation lookaside buffer misses

    PAPI_TLB_TL: Total translation lookaside buffer misses

    PAPI_TLB_SD: Translation lookaside buffer shootdowns

    PAPI_LD_INS: Load instructions

    PAPI_SR_INS: Store instructions

    PAPI_MEM_SCY: Cycles Stalled Waiting for memory accessesPAPI_MEM_RCY: Cycles Stalled Waiting for memory Reads

    PAPI_MEM_WCY: Cycles Stalled Waiting for memory writes

    PAPI_RES_STL: Cycles stalled on any resource

    PAPI_FP_STAL: Cycles the FP unit(s) are stalled

    Sharedcache

    TLB

    ResourceStalls

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    12/44

    CScADS Autotuning

    PAPI Preset Events (iii)PAPI Preset Events (iii)

    Program flow:PAPI_BR_INS: Branch instructions

    PAPI_BR_UCN: Unconditional branch instructionsPAPI_BR_CN: Conditional branch instructions

    PAPI_BR_TKN: Conditional branch instructions taken

    PAPI_BR_NTK: Conditional branch instructions not taken

    PAPI_BR_MSP: Conditional branch instructions mispredicted

    PAPI_BR_PRC: Conditional branch instructions correctly predicted

    PAPI_BTAC_M: Branch target address cache misses

    PAPI_CSR_FAL: Failed store conditional instructions

    PAPI_CSR_SUC: Successful store conditional instructions

    PAPI_CSR_TOT: Total store conditional instructions

    Branches

    ConditionalStores

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    13/44

    CScADS Autotuning

    PAPI_TOT_CYC: Total cycles

    PAPI_TOT_IIS: Instructions issued

    PAPI_TOT_INS: Instructions completed

    PAPI_INT_INS: Integer instructions completed

    PAPI_LST_INS: Load/store instructions completed

    PAPI_SYC_INS: Synchronization instructions completed

    PAPI_BRU_IDL: Cycles branch units are idle

    PAPI_FXU_IDL: Cycles integer units are idle

    PAPI_FPU_IDL: Cycles floating point units are idle

    PAPI_LSU_IDL: Cycles load/store units are idle

    PAPI_STL_ICY: Cycles with no instruction issue

    PAPI_FUL_ICY: Cycles with maximum instruction issuePAPI_STL_CCY: Cycles with no instructions completed

    PAPI_FUL_CCY: Cycles with maximum instructions completed

    PAPI_HW_INT: Hardware interrupts

    PAPI Preset Events (iv)PAPI Preset Events (iv)

    Timing, efficiency, pipeline:

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    14/44

    CScADS Autotuning

    PAPI Preset Events (v)PAPI Preset Events (v)

    Floating point:

    PAPI_FP_INS: Floating point instructions

    PAPI_FP_OPS: Floating point operationsPAPI_FML_INS: Floating point multiply instructions

    PAPI_FAD_INS: Floating point add instructions

    PAPI_FDV_INS: Floating point divide instructions

    PAPI_FSQ_INS: Floating point square root instructions

    PAPI_FNV_INS: Floating point inverse instructions

    PAPI_FMA_INS: FMA instructions completed

    PAPI_VEC_INS: Vector/SIMD instructions

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    15/44

    CScADS Autotuning

    WhatWhats a Native Event?s a Native Event?

    8 mask bits 8 bits: 256 events

    16 mask bits6 bits: 64 events

    PMC: Pentium 4

    PMC: Intel Pentium II, III, M, Core; AMD Athlon, Opteron

    PMD: AMD Athlon, Opteron

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    16/44

    CScADS Autotuning

    Intel Pentium Core: L2_STIntel Pentium Core: L2_ST

    { .pme_uname = "SELF",

    .pme_udesc = "This core",

    .pme_ucode = 0x40

    },

    { .pme_uname = "BOTH_CORES",

    .pme_udesc = "Both cores",

    .pme_ucode = 0xc0

    }

    },

    .pme_numasks = 7

    },

    { .pme_name = "L2_ST",

    .pme_code = 0x2a,

    .pme_flags = PFMLIB_CORE_CSPEC,

    .pme_desc = "L2 store requests",

    .pme_umasks = {

    { .pme_uname = "MESI",.pme_udesc = "Any cacheline access",

    .pme_ucode = 0xf

    },

    { .pme_uname = "I_STATE",

    .pme_udesc = "Invalid cacheline",

    .pme_ucode = 0x1

    },

    { .pme_uname = "S_STATE",

    .pme_udesc = "Shared cacheline",

    .pme_ucode = 0x2

    },

    { .pme_uname = "E_STATE",

    .pme_udesc = "Exclusive cacheline",

    .pme_ucode = 0x4

    },{ .pme_uname = "M_STATE",

    .pme_udesc = "Modified cacheline",

    .pme_ucode = 0x8

    }

    PRESET,PAPI_L2_DCA,

    DERIVED_ADD,L2_LD:SELF:ANY:MESI,

    L2_ST:SELF:MESI

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    17/44

    CScADS Autotuning

    How many counters does it takeHow many counters does it take

    Penti

    um

    Penti

    umII

    Penti

    umIII

    Penti

    umM

    Itaniu

    m

    AMD

    Athlo

    n

    AMD

    Opter

    on

    S

    iCort

    exMIP

    S

    Intel

    Core

    POW

    ER3

    POW

    ER4

    Cell

    P

    OWER

    5

    Pe

    ntium

    4BG

    /L

    Cray

    X1

    S1

    0

    10

    20

    30

    40

    50

    60

    70

    24

    5(3 fixed)

    8

    8:164:32

    12(6:2)

    18

    48+2+2

    32+16+16

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    18/44

    CScADS Autotuning

    PAPI and BG/LPAPI and BG/L

    Performance Counters: 48 UPC Counters

    shared by both CPUs

    External to CPU cores

    32 bits :(

    2 Counters on each FPU 1 counts load/stores

    1 counts arithmetic operations

    Accessed via blg_perfctr

    15 Preset Events 10 PAPI presets

    5 Custom BG/L presets

    328 native events available

    2 FPU PMCs

    2 FPU PMCs

    UPC Module48 SharedCounters

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    19/44

    CScADS Autotuning

    Cell Broadband EngineCell Broadband Engine

    Each Cell contains: 1 PPE and 8 SPEs. and 1 PMU external to all of these.

    8 16-bit counters configurable as 4 32-bit counters.

    1024 slot 128-bit trace buffer

    400 native events

    Working with IBM

    engineers on developing perfmon2

    libpfm layer for Cell BE

    Linux Cell BE

    kernel modifications Porting PAPI-C

    (LANL grant)

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    20/44

    CScADS Autotuning

    Top500 Operating SystemsTop500 Operating Systems

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    21/44

    CScADS Autotuning

    PerfctrPerfctr

    Written by Mikael PettersonLabor of love

    First available: Fall 1999

    First PAPI use: Fall 2000

    Supports:Intel Pentium II, III, 4, M, Core

    AMD K7 (Athlon), K8 (Opteron)

    IBM PowerPC 970, POWER4, POWER5

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    22/44

    CScADS Autotuning

    PerfctrPerfctr FeaturesFeatures

    Patches the Linux kernel

    Saves perf counters on context switchVirtualizes counters to 64-bitsMemory-maps counters for fastaccess

    Supports counter overflow interruptswhere available

    User Space LibraryPAPI uses about a dozen calls

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    23/44

    CScADS Autotuning

    PerfctrPerfctr TimelineTimeline

    Steady development

    1999 2004Concerted effort for kernel inclusionMay 2004 May 2005

    Ported to Cray Catamount; Power Linux~ 2005

    Maintenance only2005

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    24/44

    CScADS Autotuning

    PerfmonPerfmon

    Written by Stephane Eranian @ HP

    Originally Itanium onlyBuilt-in to the Linux-ia64 kernel since

    2.4.0System call interface

    libpfm helper library for bookkeeping

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    25/44

    CScADS Autotuning

    Perfmon2*Perfmon2*

    Provides a generic interface to access PMU Not dedicated to one app, avoid fragmentation

    Must be portable across all PMU models: Almost all PMU-specific knowledge in user level libraries

    Supports per-thread monitoring Self-monitoring, unmodified binaries, attach/detach

    multi-threaded and multi-process workloads Supports system-wide monitoring Supports counting and sampling No modification to applications or system Built-in, efficient, robust, secure, simple,

    documented

    * Slide contents courtesy Stephane Eranian* Slide contents courtesy Stephane Eranian

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    26/44

    CScADS Autotuning

    Perfmon2Perfmon2

    Setup done through external support library Uses a system call for counting operations

    More flexibility, ties with ctxsw, exit, fork

    Kernel compile-time option on Linux

    Perfmon2 context encapsulates all PMU state Each context uniquely identified by file descriptor

    int perfmonctl(int fd, int cmd, void *arg, int narg)

    PFM_GET_CONFIG

    PFM_GETINFO_PMCS

    PFM_CREATE_EVTSETPFM_WRITE_PMDS

    PFM_WRITE_PMCS

    PFM_CREATE_CONTEXT

    PFM_SET_CONFIG

    PFM_GETINFO_PMDS

    PFM_DELETE_EVTSETPFM_UNLOAD_CONTEXT

    PFM_LOAD_CONTEXT

    PFM_READ_PMDS

    PFM_GETINFO_EVTSETPFM_RESTART

    PFM_STOP

    PFM_START

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    27/44

    CScADS Autotuning

    Perfmon2 FeaturesPerfmon2 Features

    Support today for:

    Intel Itanium, P6, M, Core, Pentium4, AMDOpteron, IBM Power, MIPS, SiCortex

    Full native event tables for supportedprocessors

    Kernel based MultiplexingEvent set chaining

    Kernel based Sampling/OverflowTime or event basedCustom sampling buffers

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    28/44

    CScADS Autotuning

    Next StepsNext Steps

    Kernel integration Final integration testing underway

    Possible inclusion in 2.6.22 kernel Implemented by Cray in CNK, X2 Cell BE

    Port with IBM engineers is underway

    Leverage libpfm for PAPI native events Migration completed for P6, Core, P4, Opteron

    PAPI testing on perfmon2 patched kernels

    Opteron currently being tested Woodcrest/Clovertown testing planned

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    29/44

    CScADS Autotuning

    Component PAPI (PAPIComponent PAPI (PAPI--C)C)

    Goals:Support simultaneous access to on- and off-

    processor countersIsolate hardware dependent code in a separable

    component module

    Extend platform independent framework code tosupport multiple simultaneous components

    Add or modify API calls to support access toany of several components

    Modify build environment for easy selection andconfiguration of multiple available components

    Will be released (RSN*) as PAPI 4.0

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    30/44

    CScADS Autotuning

    Current PAPI DesignCurrent PAPI Design

    PAPI Framework Layer

    LowLevelAPI

    HiLevelAPI

    PAPI Component Layer

    Perf Counter Hardware

    Operating System

    Kernel Patch

    Portable

    PlatformDependent

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    31/44

    CScADS Autotuning

    Component PAPI DesignComponent PAPI Design

    PAPI Framework Layer

    LowLevelAPI

    HiLevelAPI

    PAPI Component Layer(network)

    Perf Counter Hardware

    Operating System

    Kernel Patch

    PAPI Component Layer(CPU)

    Perf Counter Hardware

    Operating System

    Kernel Patch

    PAPI Component Layer(thermal)

    Perf Counter Hardware

    Operating System

    Kernel Patch

    D

    evelAPI Devel

    APIDevel

    API

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    32/44

    CScADS Autotuning

    PAPIPAPI--C StatusC Status PAPI 3.9 pre-release available with documentation

    Implemented Myrinet substrate (native counters)

    Implemented ACPI temperature sensor substrate

    Working on Infiniband and Cray Seastar substrates(access to Seastar counters not available underCatamount but expected under CNL)

    Asked by Cray engineers for input on desired metrics fornext network switch

    Tested on HPC Challenge benchmarks

    Tested platforms include Pentium III, Pentium 4,

    Core2Duo, Itanium (I and II) and AMD Opteron

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    33/44

    CScADS Autotuning

    PAPIPAPI--C New RoutinesC New Routines

    PAPI_get_component_info()

    PAPI_num_cmp_hwctrs() PAPI_get_cmp_opt()

    PAPI_set_cmp_opt() PAPI_set_cmp_domain()

    PAPI_set_cmp_granularity()

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    34/44

    CScADS Autotuning

    PAPIPAPI--C Building and LinkingC Building and Linking

    CPU components are automatically detected byconfigureand included in the build

    CPU component assumed to be present andalways configured as component 0

    To include additional components, use configureoption

    --with- = yes Currently supported componentswith-acpi = yes

    with-mx = yeswith-net = yes

    The make process compiles and links sources forall requested components into a single library

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    35/44

    CScADS Autotuning

    MyrinetMyrinet MX CountersMX Counters

    ROUTE_DISPERSION

    OUT_OF_SEND_HANDLES

    OUT_OF_PULL_HANDLES

    OUT_OF_PUSH_HANDLES

    MEDIUM_CONT_RACE

    CMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWN

    INTERRUPTS_OVERRUN

    WAITING_FOR_INTERRUPT_DMA

    WAITING_FOR_INTERRUPT_ACK

    WAITING_FOR_INTERRUPT_TIM

    ER

    SLABS_RECYCLING

    SLABS_PRESSURE

    SLABS_STARVATION

    OUT_OF_RDMA_HANDLES

    EVENTQ_FULL

    BUFFER_DROP

    MEMORY_DROP

    HARDWARE_FLOW_CONTROL

    SIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPED

    WAKE_INTERRUPTS

    AVERTED_WAKEUP_RACE

    DMA_METADATA_RACE

    REPLY_SEND

    REPLY_RECV

    QUERY_UNKNOWN

    DATA_SEND_NULL

    DATA_SEND_SMALL

    DATA_SEND_MEDIUMDATA_SEND_RNDV

    DATA_SEND_PULL

    DATA_RECV_NULL

    DATA_RECV_SMALL_INLINE

    DATA_RECV_SMALL_COPY

    DATA_RECV_MEDIUM

    DATA_RECV_RNDV

    DATA_RECV_PULL

    ETHER_SEND_UNICAST_CNT

    ETHER_SEND_MULTICAST_C

    NT

    ETHER_RECV_SMALL_CNT

    ETHER_RECV_BIG_CNT

    ETHER_OVERRUN

    ETHER_OVERSIZEDDATA_RECV_NO_CREDITS

    PACKETS_RESENT

    PACKETS_DROPPED

    MAPPER_ROUTES_UPDATE

    ACK_NACK_FRAMES_IN_PIPE

    NACK_BAD_ENDPT

    NACK_ENDPT_CLOSED

    NACK_BAD_SESSION

    NACK_BAD_RDMAWIN

    NACK_EVENTQ_FULLSEND_BAD_RDMAWIN

    CONNECT_TIMEOUT

    CONNECT_SRC_UNKNOWN

    QUERY_BAD_MAGIC

    QUERY_TIMED_OUT

    QUERY_SRC_UNKNOWN

    RAW_SENDS

    RAW_RECEIVES

    RAW_OVERSIZED_PACKETS

    RAW_RECV_OVERRUN

    RAW_DISABLED

    CONNECT_SEND

    CONNECT_RECV

    ACK_SEND

    ACK_RECVPUSH_SEND

    PUSH_RECV

    QUERY_SEND

    QUERY_RECV

    LANAI_UPTIME

    COUNTERS_UPTIME

    BAD_CRC8

    BAD_CRC32

    UNSTRIPPED_ROUTE

    PKT_DESC_INVALIDRECV_PKT_ERRORS

    PKT_MISROUTED

    DATA_SRC_UNKNOWN

    DATA_BAD_ENDPT

    DATA_ENDPT_CLOSED

    DATA_BAD_SESSION

    PUSH_BAD_WINDOW

    PUSH_DUPLICATE

    PUSH_OBSOLETE

    PUSH_RACE_DRIVER

    PUSH_BAD_SEND_HANDLE

    _MAGIC

    PUSH_BAD_SRC_MAGIC

    PULL_OBSOLETE

    PULL_NOTIFY_OBSOLETEPULL_RACE_DRIVER

    ACK_BAD_TYPE

    ACK_BAD_MAGIC

    ACK_RESEND_RACE

    LATE_ACK

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    36/44

    CScADS Autotuning

    MyrinetMyrinet MX CountersMX Counters

    ROUTE_DISPERSION

    OUT_OF_SEND_HANDLES

    OUT_OF_PULL_HANDLES

    OUT_OF_PUSH_HANDLES

    MEDIUM_CONT_RACE

    CMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWN

    INTERRUPTS_OVERRUN

    WAITING_FOR_INTERRUPT_DMA

    WAITING_FOR_INTERRUPT_ACK

    WAITING_FOR_INTERRUPT_TIM

    ER

    SLABS_RECYCLING

    SLABS_PRESSURESLABS_STARVATION

    OUT_OF_RDMA_HANDLES

    EVENTQ_FULL

    BUFFER_DROP

    MEMORY_DROP

    HARDWARE_FLOW_CONTROL

    SIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPED

    WAKE_INTERRUPTS

    AVERTED_WAKEUP_RACE

    DMA_METADATA_RACE

    REPLY_SEND

    REPLY_RECV

    QUERY_UNKNOWN

    DATA_SEND_NULL

    DATA_SEND_SMALL

    DATA_SEND_MEDIUMDATA_SEND_RNDV

    DATA_SEND_PULL

    DATA_RECV_NULL

    DATA_RECV_SMALL_INLINE

    DATA_RECV_SMALL_COPY

    DATA_RECV_MEDIUM

    DATA_RECV_RNDV

    DATA_RECV_PULLETHER_SEND_UNICAST_CNT

    ETHER_SEND_MULTICAST_C

    NT

    ETHER_RECV_SMALL_CNT

    ETHER_RECV_BIG_CNT

    ETHER_OVERRUN

    ETHER_OVERSIZEDDATA_RECV_NO_CREDITS

    PACKETS_RESENT

    PACKETS_DROPPED

    MAPPER_ROUTES_UPDATE

    ACK_NACK_FRAMES_IN_PIPE

    NACK_BAD_ENDPT

    NACK_ENDPT_CLOSED

    NACK_BAD_SESSION

    NACK_BAD_RDMAWIN

    NACK_EVENTQ_FULLSEND_BAD_RDMAWIN

    CONNECT_TIMEOUT

    CONNECT_SRC_UNKNOWN

    QUERY_BAD_MAGIC

    QUERY_TIMED_OUT

    QUERY_SRC_UNKNOWN

    RAW_SENDS

    RAW_RECEIVESRAW_OVERSIZED_PACKETS

    RAW_RECV_OVERRUN

    RAW_DISABLED

    CONNECT_SEND

    CONNECT_RECV

    ACK_SEND

    ACK_RECVPUSH_SEND

    PUSH_RECV

    QUERY_SEND

    QUERY_RECV

    LANAI_UPTIME

    COUNTERS_UPTIME

    BAD_CRC8

    BAD_CRC32

    UNSTRIPPED_ROUTE

    PKT_DESC_INVALIDRECV_PKT_ERRORS

    PKT_MISROUTED

    DATA_SRC_UNKNOWN

    DATA_BAD_ENDPT

    DATA_ENDPT_CLOSED

    DATA_BAD_SESSION

    PUSH_BAD_WINDOW

    PUSH_DUPLICATEPUSH_OBSOLETE

    PUSH_RACE_DRIVER

    PUSH_BAD_SEND_HANDLE

    _MAGIC

    PUSH_BAD_SRC_MAGIC

    PULL_OBSOLETE

    PULL_NOTIFY_OBSOLETEPULL_RACE_DRIVER

    ACK_BAD_TYPE

    ACK_BAD_MAGIC

    ACK_RESEND_RACE

    LATE_ACK

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    37/44

    CScADS Autotuning

    Multiple MeasurementsMultiple Measurements

    The HPCC HPL benchmark with 3 performance metrics: FLOPS; Temperature; Network Sends/Receives

    Temperature is from an on-chip thermal diode

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    38/44

    CScADS Autotuning

    Multiple Measurements (2)Multiple Measurements (2)

    The HPCC HPL benchmark with 3 performance metrics: FLOPS; Temperature; Network Sends/Receives

    Temperature is from an on-chip thermal diode

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    39/44

    CScADS Autotuning

    Eclipse PTP IDEEclipse PTP IDE

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    40/44

    CScADS Autotuning

    Performance Evaluation within Eclipse PTPPerformance Evaluation within Eclipse PTP

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    41/44

    CScADS Autotuning

    TAU and PAPITAU and PAPI PluginsPlugins for Eclipse PTPfor Eclipse PTP

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    42/44

    CScADS Autotuning

    PotentialPotential AutotuningAutotuning OpportunitiesOpportunities

    Provide feedback to compilers orsearch engines

    Run-time monitoring for dynamictuning or selection

    Minimally intrusive collection ofalgorithm/application statisticsHow little data do we need?

    How can we find the needle in thehaystack?

    Other suggestions?

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    43/44

    CScADS Autotuning

    ConclusionsConclusions

    PAPI has a long track record ofsuccessful adoption and use.

    New architectures pose a challengefor off-processor hardware

    monitoring as well as interpretationof counter values.

    Integration of perfmon2 into the

    Linux kernel will broaden the base ofPAPI users still further.

  • 8/3/2019 Hardware Performance Monitoring Hardware Performance Monitoring

    44/44

    Hardware PerformanceHardware Performance

    Monitoring with PAPIMonitoring with PAPI

    Dan [email protected]

    CScADS Autotuning WorkshopJuly 2007