recent advances in the performance api (papi) · recent advances in the performance api (papi)...
TRANSCRIPT
10thScalableToolsWorkshopAnthonyDanalis
LakeTahoe,CaliforniaAugust1-4,2016
RecentAdvancesinthePerformanceAPI(PAPI)
Collaborators:HeikeJagodeAsimYarkhanJackDongarra
PAPI• Middlewarethatprovidesaconsistentinterfaceandmethodologyfortheperformancecounter
hardwarefoundinmostmajormicroprocessors
• PAPIenablessoftwareengineerstosee,innearrealtime,therelationbetweenSWperformanceandHWevents
SUPPORTEDARCHITECTURES:
• AMD• CRAY:Aris,Gemini,power• IBMBlueGeneSeries,Q:5D-Torus,I/Osystem,CNK,EMONpower/energy• IBMPowerSeries• IntelWestmere,Sandy|IvyBridge,Haswell,Broadwell,Skylake,KnightsCorner|Landing• ARMCortexA8,A9,A15,ARM64• NVidiaTesla,Kepler,NVML:CUDAsupportformultipleGPUs;PCSampling• In\iniband• IntelRAPL(power/energy);powercapping• IntelKNC,KNLpower/energy
COMPONENTPAPI:
• providesaccesstoacollectionofcomponentsthatexposeperformancemeasurementopportunitiesacrossthesystemasawhole,includingnetwork,theI/Osystem,theComputeNodeKernel,power/energy
2
PAPICPUComponents:KNCvs.KNL
3
PAPIComponents KnightsCorner KnightsLanding
perf_event:Linuxperf_eventCPUcoreevents
PMU'ssupported:perf,perf_raw,knc#ofNaBveEvents:140#ofPresetEvents:14#ofCounters:2
PMU'ssupported:perf,perf_raw,knl,ix86arch#ofNaBveEvents:182#ofPresetEvents:26#ofCounters:5
perf_event_uncore:Linuxperf_eventCPUuncoreandnorthbridgeevents
---
PMU’ssupported:e.g.Memory,on-dieinterconnect,IO,Memory-to-PCIeeventsupport#ofNaBveEvents:894
PAPIofferstwocomponentstotheCPUcounters:
Seenextslide
4
PresetEvents Descrip(onPAPI_L1_DCM Level1datacachemissesPAPI_L1_ICM Level1instruc2oncachemissesPAPI_L1_TCM Level1cachemissesPAPI_L2_TCM Level2cachemissesPAPI_TLB_DM Datatransla2onlookasidebuffermissesPAPI_L1_LDM Level1loadmissesPAPI_L2_LDM Level2loadmissesPAPI_STL_ICY Cycleswithnoinstruc2onissuePAPI_BR_UCN Uncondi2onalbranchinstruc2onsPAPI_BR_CN Condi2onalbranchinstruc2onsPAPI_BR_TKN Condi2onalbranchinstruc2onstakenPAPI_BR_NTK Condi2onalbranchinstruc2onsnottakenPAPI_BR_MSP Condi2onalbranchinstruc2onsmispredictedPAPI_TOT_INS Instruc2onscompletedPAPI_LD_INS Loadinstruc2onsPAPI_ST_INS Storeinstruc2onsPAPI_BR_INS Branchinstruc2onsPAPI_RES_STL CyclesstalledonanyresourcePAPI_TOT_CYC TotalcyclesPAPI_LST_INS Load/storeinstruc2onscompletedPAPI_L1_DCA Level1datacacheaccessesPAPI_L1_ICH Level1instruc2oncachehitsPAPI_L1_ICA Level1instruc2oncacheaccessesPAPI_L2_TCH Level2totalcachehitsPAPI_L2_TCA Level2totalcacheaccessesPAPI_REF_CYC Referenceclockcycles
Listofthe26PAPIPresetEventsforKNL
PAPIPOWERComponents:KNCvs.KNL
5
PAPIComponents KNC KNL
powercap:ReadsRAPLresultsviaLinuxPOWERCAPinterface
---
#ofNaBveEvents:15(requiresnospecialpermissions)
rapl:RAPLresults(rawaccesstotheunderlyingMSRs)
---
#ofNaBveEvents:14(requiresrootprivilege)
micpower:Readingpowerinna2vemode
Powervaluesreportedin/sys/class/micras/power#ofNaBveEvents:16
---
host_micpower:Readingpowerinoffloadmode
PowervaluesexportedviaMicAccessAPI(MPSS)#ofNaBveEvents:16
---
PAPIoffersfourcomponentsforPower/Energymonitoring:
PAPIPOWERonKnightsLanding
6
Thefollowingpowerdomainsaresupported:• PACKAGE: Processordie
• DRAM(Memory): Directly-attachedDRAM
• PP0(PowerPlane0):Processorscoressubsystem
SimpleVeriTicationTest:NaiveMMM(1024x1024):Scaled Energy Measurements:
PACKAGE_ENERGY 8216.792419 J (Average Power 84.5W)
DRAM_ENERGY 2539.645264 J (Average Power 26.1W)
Energy Measurement Counts:PACKAGE_ENERGY_CNT 134623927
DRAM_ENERGY_CNT 41609548
Scaled Fixed Values:
THERMAL_SPEC 215.000 W
MAXIMUM_POWER 258.000 W
MAXIMUM_TIME_WINDOW 0.046 s
Fixed Value Counts:THERMAL_SPEC_CNT 1720
MAXIMUM_POWER_CNT 2064
MAXIMUM_TIME_WINDOW_CNT 47
PAPIPOWERonKNL:Hessenberg(MKL’sdgehrd)Intel®XeonPhi™KnightsLanding,68cores(4HWthreads/core)
+memory-boundkernel(GEMVsandGEMMs)
+9computationswithdifferentmatrixsizes
powerusagemimicscomputationalintensity:
+Factorizationstartsonentirematrixàconsumesmostpower
+Asfactorizationprogresses,itoperatesonsmallermatricesàconsumeslesspower
7
0
50
100
150
200
0 50 100 150 200
1088 (6.3 G
Flops)
2112 (13.6 G
Flops)
3136 (21.4 G
Flops)
4160 (27.9 G
Flops)
5184 (34.4 G
Flops)
6082 (36.0 G
Flops)
7232 (46.1 G
Flops)
8256 (50.5 G
Flops)
9280 (56.1 G
Flops)
Po
we
r (w
att
s)
Time (seconds) passing as different size Hessenberg reductions are done on KNL (68x4 cores)
Accelerator Power Usage (Watts) (PACKAGE) Memory Power Usage (Watts) (DRAM)
PAPIFORPARSECPARALLELRUNTIMESCHEDULINGANDEXECUTIONCONTROLLER
8
DataTlow-drivenProgrammingModels
• Developforaportabilitylayer,notanarchitecture
• Lettheruntimedealwiththehardwarecharacteristics
• Task-scheduling:PaRSEC,StarSS,StarPU,Swift,Parallex,Quark,Kaapi,DuctTeip
9
PaRSECFeatures
10
PAPIandPaRSECParallelRuntimeSchedulingandExecutionController
PaRSEC:
• Genericframeworkforarchitectureawareschedulingofmicro-tasksondistributedmany-coreheterogeneousarchitectures
à Performancetoolsbecomemoreandmoreimportantfortask-baseddataTlowandexecutionsystems
à AnalysisfeaturesthatshowtheconnectionofthedataTlowandtheexecutionproTile/traceisextremelybeneTicial
PAPIinPaRSEC:
• IntegratedinPaRSEC’sPerformanceINStrumentationmodules
• PINSmodulescanbeselectivelyloadedandusedbyusers
• Enablesuserstomeasureperformancecounterdataforeachtask/nodeinaDAG(DirectedAcyclicGraph)
• EverythingsupportedbyPAPIcanbemeasuredinPaRSECat“pertask”granularity
11
PAPIPowerperTask:PaRSEC
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70
Aver
age
Pow
er (W
atts
)
Time (Seconds)
PACKAGE_ENERGY:PACKAGE0PACKAGE_ENERGY:PACKAGE1
PP0_ENERGY:PACKAGE0PP0_ENERGY:PACKAGE1
[email protected]=11,584--TileSize=724SandyBridgeEP2.60GHz,2sockets,runningon1(outof8)corepersocket
TotalPoweronSocket0,1(includeseverything)
PowerofcoresonlyonSocket0,1
12
PAPIPowerSampling:ScaLAPACK
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100 120 140
Aver
age
Pow
er (W
atts
)
Time (Seconds)
PACKAGE_ENERGY:PACKAGE0PACKAGE_ENERGY:PACKAGE1
PP0_ENERGY:PACKAGE0PP0_ENERGY:PACKAGE1
[email protected]=11,584--TileSize=724SandyBridgeEP2.60GHz,2sockets,runningon1(outof8)corepersocket
TotalPoweronSocket0,1(includeseverything)
PowerofcoresonlyonSocket0,1
13
PaRSEC:30.8GFLOPsTotalEnergy:
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100 120 140
Aver
age
Pow
er (W
atts
)
Time (Seconds)
PACKAGE_ENERGY:PACKAGE0PACKAGE_ENERGY:PACKAGE1
PP0_ENERGY:PACKAGE0PP0_ENERGY:PACKAGE1
PAPIPowerMeasurements
4.35kWs
ScaLAPACK:19.6GFLOPsTotalEnergy:
7.79kWs
AveragePowerUsagefor(p)dgeqrf--MatrixSize=11,584--TileSize=724SandyBridgeEP2.60GHz,2sockets,runningon1(outof8)corepersocket
14
PAPIPOWERCONTROLLINGREADINGANDWRITINGPOWER
15
RAPL&msr-safe&libmsr• RAPL–RunningAveragePowerLimit
• IntelSandybridgeorbetter• Modelsenergyatpackage,DRAMcontroller,CPUcore(PP0),graphicsuncore(PP1)
• ProvidingwriteaccesstoMSRscanbeunsafe• Canhavelargeeffectonmachine• Touse,youneedtomakeMSRswriteable,capability-executable,static-only
executable,paranoidsettinginkernel!!!
• msr-safeandlibmsrfromLLNL• msr-safeprovidesasaferwhitelistcontrolledaccesstoMSRs
• kernelmoduleprovides/dev/cpu/*/safe_msr• libmsrisalibrarytosimplifyaccesstoMSRs
• PAPIlibmsrcomponentinPAPItoWRITEvalues• WrapstheRAPLpowercallsinlibmsr• SetRAPLpowerlimitsoveratime-window(twowindows)
• setlimitonsocket,low,high,time-window
• CollaborationwithBarryRountree(LLNL)
16
0
20
40
60
80
100
0 5 10 15 20 25 30
Watts
Unit
Work
Tim
e (
seco
nds)
Elapsed time (seconds)
Using PAPI libmsr component to read and set power caps 2x8 cores Xeon E5-2690 SandyBridge-EP at 2.9GHz
Set/Request Avg Power Cap (watts in 1 sec)Read Power Consumpution (watts)
Time for Unit Work (seconds on y2 axis)
Ini_alPowerconsump_on Timetakenforwork
a`erseangtolowestpower(y2axis)
Timeforworkdecreasesaspowerincreases
ControllingPowerwithPAPI-libmsr
0
20
40
60
80
100
0 5 10 15 20 25 30
Watts
Unit
Work
Tim
e (
seco
nds)
Elapsed time (seconds)
Using PAPI libmsr component to read and set power caps 2x8 cores Xeon E5-2690 SandyBridge-EP at 2.9GHz
Set/Request Avg Power Cap (watts in 1 sec)Read Power Consumpution (watts)
Time for Unit Work (seconds on y2 axis)
Ini_alPowerconsump_on
Set/writePowerCap
Timetakenforworka`erseangtolowestpower(y2axis)
Timeforworkdecreasesaspowerincreases
ControllingPowerwithPAPI-libmsr
0
20
40
60
80
100
0 5 10 15 20 25 30
Watts
Unit
Work
Tim
e (
seco
nds)
Elapsed time (seconds)
Using PAPI libmsr component to read and set power caps 2x8 cores Xeon E5-2690 SandyBridge-EP at 2.9GHz
Set/Request Avg Power Cap (watts in 1 sec)Read Power Consumpution (watts)
Time for Unit Work (seconds on y2 axis)
Ini_alPowerconsump_on
Set/writePowerCap
Timetakenforworka`erseangtolowestpower(y2axis)
Trytosetpowerabovemaximum
Timeforworkdecreasesaspowerincreases
ControllingPowerwithPAPI-libmsr
20
UsageScenarios
• Sampleusagescenarios
• Ifweknowthatcomputationrequirementswilldecreaseduetocommunication(I/Obound)andthattheoverallexecutiontimewillnotsufferiftheCPUpoweriscappedtemporarily.
• WecanschedulethecriticalpathoftheDAGonfastresources,decreasepowerconsumptiononsocketsrunningtherestoftheDAG.
LUfactorizationandDAG
�����
����
�����
����
����
���������
����
����� �����
��������
����
�����
����
���������
���������
�����
SOC 0 CPU 0SOC 0 CPU 1SOC 0 CPU 2SOC 0 CPU 3SOC 1 CPU 4SOC 1 CPU 5SOC 1 CPU 6SOC 1 CPU 7
Time (sec): 0.0
SOC 0 CPU 0SOC 0 CPU 1SOC 0 CPU 2SOC 0 CPU 3SOC 1 CPU 4SOC 1 CPU 5SOC 1 CPU 6SOC 1 CPU 7
Time (sec): 0.0
Bothsocket0and1runningatfullpower.Notethatthepanelfactoriza_on(red)islongandonthecri_calpath,sothereiswhitespacewherenotaskscanberunonsocket2.
Slowdownsocket1usingRAPLandlockcriBcalpathtosocket0.Thegemmtasks(green)takelonger,fillingoutthewhitespaceonsocket2.Thisoccurswithoutanyoveralllossin_meforthefullcomputa_on.
RED:GETRFBLUE/BROWN:LASWP/TRSMGREEN:GEMMBLUE:LASWP
• TiledLU(N=17920=224x80)usingaSandyBridgeEP(2sockets,4cores/socket)• Demonstra_ngrunningthecri_calpathatahigherspeedthanothertasks.
• Highpower:Bothsocketsrunningatfullpower.• Lowpower:Secondsocketrunningalowpower;cri_calpath(panel)onfirstsocket.• TotalenergyusedbyprocessorsinJoules:High4136Joules;Low:4001Joules
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40
Pro
cess
or
Pow
er
Sock
et 0+
1 (
Watts)
Elapsed Time (sec)
High Power (Total Joules:2069+2067=4136)High Power Usage (Socket 0)High Power Usage (Socket 1)
Low Power (Total Joules:2634+1371=4001)Low Power Usage (Socket 0)Low Power Usage (Socket 1)
DGEMMPower&Performance:KNCvs.KNL
24
Matrix size2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Gflo
p/s
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400DGEMM Performance
Time (s)0 10 20 30 40 50 60 70 80
Ave
rage
pow
er (W
atts
)
0
40
80
120
160
200
240
280Accelerator Power Usage (PACKAGE)
Matrix size2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Gflo
p/s
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400DGEMM Performance
Time (s)0 20 40 60 80 100 120
Ave
rage
pow
er (W
atts
)
0
40
80
120
160
200
240
280Accelerator Power Usage (PACKAGE)
KnightsCornerKnightsLanding60cores(4HWthreads/core) 68cores(1HWthread/core)
PAPICOUNTERINSPECTIONTOOLKIT
25
CounterInspectionToolkitDe\inean“accuratemapping”betweenhigh-levelconceptsofperformancemetricsandtheunderlyinglow-levelhardwareevents.
Benchmarksandanalysesfor:
1. Validatingnativeevents
2. De\ininghigh-levelevents(“pre-de\ined”)
26
RandomPointerChasing
Benchmarktiming
0
2
4
6
8
10
12
14
16
18
20
22
24
26
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Ave
rage A
ccess
Late
ncy
(n
s)
Buffer Size (KBytes)
Memory Hierarchy (ig)
DeTininghighlevelevents(Presets)| LLC_MISSES |
| Alias for LAST_LEVEL_CACHE_MISSES |
--------------------------------------------------------------------------------
| LAST_LEVEL_CACHE_MISSES |
| This is an alias for L3_LAT_CACHE:MISS |
--------------------------------------------------------------------------------
| L3_LAT_CACHE |
| Core-originated cacheable demand requests to L3 |
| :MISS |
| Core-originated cacheable demand requests missed L3 |
| :REFERENCE |
| Core-originated cacheable demand requests that refer to L3 |
| :e=0 |
| edge level (may require counter-mask >= 1) |
| :i=0 |
| invert |
| :c=0 |
| counter-mask in range [0-255] |
| :t=0 |
| measure any thread |
| :u=0 |
| monitor at user level |
| :k=0 |
| monitor at kernel level |
29
PAPI6(a.k.a.PAPI-EX)
System-widemeasurements:• Sharedhardwarecountersupportiscomplex
• limitedvendorandkernelsupport
CounterinspectionToolkit:• kernelsthatstresson-core+sharedhardwarefeatures
DeeperIntegrationofPAPIfordata\low-basedprogrammingmodelsNewArchitectures:
• XeonPhiKnightsLanding,CaviumThunderX,…
30
3rdPartyToolsapplyingPAPI
• PaRSEC(UTK)http://icl.cs.utk.edu/parsec/
• TAU(UOregon)http://www.cs.uoregon.edu/research/tau/
• PerfSuite(NCSA)http://perfsuite.ncsa.uiuc.edu/
• HPCToolkit(RiceUniversity)http://hpctoolkit.org/
• KOJAKandSCALASCA(FZJuelich,UTK)http://icl.cs.utk.edu/kojak/
• VampirTraceandVampir(TUDresden)http://www.vamir.eu
• Open|Speedshop(SGI)http://oss.sgi.com/projects/openspeedshop/
• SvPablo(UNCRenaissanceComputingInstitute)http://www.renci.org/research/pablo/
• ompP(UTK)http://www.ompp-tool.com31