new developments for papi 5.6+ archive/tech_poster... · 2018. 9. 17. · improved papi test...
TRANSCRIPT
New Developments for PAPI 5.6+Anthony Danalis, Heike Jagode, and Jack Dongarra
University of Tennessee, KnoxvilleVince Weaver and Yan Liu
University of Maine
INTRODUCTION
PART 1: PAPI for Power‐Aware Computing
PART 2:
PAPI’s Counter Inspection Toolkit (CIT)
PART 3:
Modernizing PAPI Infrastructure
SUPPORTED ARCHITECTURES
3RD PARTY TOOLS APPLYING PAPI
• PAPI provides a consistent interface (and methodology) for hardware performance counters, found across a compute system:� i. e., CPUs, GPUs, on- and off-chip memory, interconnects, I/O system, file system, energy/power, etc.
• PAPI enables software engineers to see, in near real time, the relationship between �software performance and hardware events across the entire compute system.
Intel Knights Landing: FLAT mode• Entire MCDRAM is used as addressable memory� ➔ memory allocations are treated similarly to DDR4 memory allocations
PAPI for power-aware computing• We use PAPI’s latest powercap component for
measurement, control, and performance analysis.
• In the past, PAPI power components supported only reading power information.
• New component exposes RAPL functionality to allow users to read and write power.
• Study numerical building blocks of varying computational intensity.
• Use PAPI powercap component to detect power optimization opportunities.
• Cap the power on the architecture to reduce power usage while keeping the execution time constant ➔ energy savings.
Key ConceptsGoal:
• Create a set of micro-benchmarks for illustrating details in hardware events and how they relate to the behavior of the micro-architecture
Target audience:
• Performance-conscious application developers
• PAPI developers working on new architectures (think preset events)
• Developers interested in validating hardware-event counters
QUESTION
How many branch instructions will these codes execute per iteration?
The expectation is that both codes will execute 2.
The measured answer is 2 for the first and 2.5 for the second!
Can you guess why?
Improved PAPI Test Infrastructure• The existing PAPI testsuite is used to test the correctness of
PAPI before release.
• The hardware and operating systems used by PAPI are always changing, and some of the existing tests were outdated or gave false negatives.
• Existing tests were checked to ensure accurate results on modern hardware.
• New counter validation tests were created which should provide a sanity check when bringing up support for a new processor architecture.
Low-Overhead PAPI_read() support• Traditionally PAPI_read() counter reads went through the
standard Linux read() system call, which can be slow (around 1,000 cycles).
• x86 hardware supports a userspace rdpmc() instruction that bypasses the kernel and requires 200 cycles (a 5× speedup).
• Various bugs in the Linux kernel around this interface were found and fixed so that rdpmc() can be enabled by default.
Enhanced Sampling Interface• PAPI currently has a limited counter sampling interface that
only allows gathering the instruction pointer at regular intervals
• Modern processors support much richer sampling information, including the cause of cache misses, where in the cache hierarchy the miss happened, and the cycles taken.
• We extend the PAPI sampling interface to provide this additional sampling information.
Heike Jagode, Azzam Haidar, Phil Vaccaro, Asim YarKhan, and Stan Tomov
ACKNOWLEDGMENTS This material is based upon work supported in part by the National Science Foundation NSF under awards No. 1450429. A portion of this research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.
Vince Weaver and Yan LiuAnthony Danalis, Heike Jagode, and Hanumanthrayappa
Power awareness for Level-3 BLAS: DGEMM68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
• DGEMM is run repeatedly for a fixed matrix size of 18K per iteration.
• For each iteration, we dropped the power in 10 or 20 Watts steps.
• Powercap achieves its power limiting though DVFS scaling. See frequency value, which — as it drops — corresponds to a decrease in power.
Power awareness for Level-3 BLAS: DGEMV68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
• DGEMV is run repeatedly for a fixed matrix size of 18K per iteration.• For each iteration, we dropped the power in 10 or 20 Watts steps.• Frequency is not affected until the power caps kick in.
Power awareness for applications: Jacobi• Solving Helmholtz equation with Jacobi Iterative Method (grid size 12,800 x 12,800, 2D 5-point stencil)➔ requires multiple memory accesses per update➔ high-memory bandwidth and low computational intensity
Lesson for DGEMM-type operations (compute intensive):• Both data allocation schemes provides roughly the same performance. This is caused by the high ratio of FLOPs to
bytes in the compute-intensive routines, which allows for as much cache reuse as possible while hiding the data transfer between memory and cache levels.
• Performance drops with decreasing power caps, which is expected for compute-intensive kernels.
Lesson for DGEMV-type operations (memory bound):• Performance varies between the two storage options: Data allocations in MCDRAM results in much higher performance
for memory-bound kernels (BW for MCDRAM is 425GB/s vs. 90 GB/s for DDR4).• Dgemv is bound by the reading of matrix A. Depending on its storage location, performance and power may be affected.
• We see lower package-power-usage but higher DDR4-power-usage.• Capping at 140 W improves energy efficiency by ~10%.• Capping at 120 W results in a 17% energy savings with some performance degradation (8%).
• Capping at 180 Watts renders in ~15% of energy saving with a negligible performance penalty (of less than 5%).
Lesson for Jacobi iteration:• Computation is about 3.5X faster when the data is allocated in MCDRAM compared to DDR4.
• MCDRAM: Capping at 170 Watts improves energy efficiency by ~14% without any loss in time to solution.
• DDR4: capping at 135 Watts improves energy efficiency by ~25% without any loss in time to solution.
Power awareness for applications: Lattice-Boltzman• Lattice-Boltzmann simulation of computational fluid dynamics (from SPEC 2006 benchmark)➔ high-memory bandwidth and low computational intensity
Lesson for Lattice-Boltzmann:• Computation is about 3.6X faster when the data is allocated in MCDRAM compared to DDR4.
• For MCDRAM, capping at 190 results in energy saving of ~6% without loss of time to solution.
• For DDR4, capping at 130 Watts improves energy efficiency by ~12%.
Native Event Characterization
Highlighting non-obvious behavior
Cortex A8, A9, A15, ARM64 Gemini and Aries interconnect, power Blue Gene Series, Q: 5-D Torus,I/0 System, EMON power, energy
Tesla, Kepler: CUDA support formultiple GPUs; PC Sampling NVML Virtual Environment Virtual Environment
Power Series Westmore, Sandy/Ivy Bridge, Haswell,Broadwell, Skylake(-X), Kaby Lake
RAPL (power/energy),power capping
PaRSECUTK
http://icl.utk.edu/parsec/
HPCToolkitRice University
http://hpctoolkit.org/
TAUUniversity of Oregon
http://www.cs.uoregon.edu/research/tau/
ScalascaFZ Juelich, TU Darmstadt
http://scalasca.org/
VampirTU Dresden
http://www.vampir.eu/
PerfSuite NCSA
http://perfsuite.ncsa.uiuc.edu/
Open|SpeedshopSGI
http://oss.sgi.com/projects/openspeedshop/
SvPabloRENCI at UNC
http://www.renci.org/research/pablo/
ompPUTK
http://www.ompp-tool.com/
Score-P http://score-p.org/
KNC, Knights Landingincluding power/energy
0
20
40
60
80
100
16 64 256 1024 4096 16384 65536
Aver
age
coun
t per
100
acc
esse
s
Buffer size in KB
L1 HITL2 HITL3 HIT
0
20
40
60
80
100
16 64 256 1024 4096 16384 65536
Aver
age
coun
t per
100
acc
esse
s
Buffer size in KB
L1 MISSL2 MISSL3 MISS
0
10
20
30
40
50
60
70
80
90
100
16 32 64 128 256 512 1024 2048 4096
Aver
age
coun
t per
100
acc
esse
s
Buffer size in KB
unit size= 64Bunit size=128Bunit size=256B
0
10
20
30
40
50
60
70
80
90
100
256 768 1280 1792 2304 2816 3328 3840
Aver
age
coun
t per
100
acc
esse
s
Buffer size in KB
ppb=4ppb=8
ppb=16ppb=32ppb=33ppb=34ppb=35
TDP = 215W
Boxplot showing read latency for various versions of PAPI and the large improvement by using rdpmc
Comparison of historical performance counter interfaces (perfmon2, perfctr) showing that perf_event rdpmc matches even the best historical interface.
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300320
1991
8.22
1383
1901
8.36
1341
1785
8.58
1314
1615
8.57
1325
1340
8.02
1419
1154
7.36
1562
971
6.66
1715
785
5.81
1982
575
4.64
2489
Performancein Gflop/s
Gflops/WattsJoules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300320
1997
8.82
1303
1904
8.92
1279
1741
8.95
1267
1589
9.04
1253
1328
8.47
1345
1137
7.73
1480
956
6.95
1661
773
6.03
1904
560
4.72
2443
Performancein Gflop/s
Gflops/WattsJoules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
DDR4 MCDRAM
MCDRAM
DDR4
MCDRAM
Time (sec)0 11 22 33 44 55 66 77 88 99 110 121 132 143 154
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300320
2185
0.12
2588
2185
0.12
2623
2185
0.12
2639
2185
0.12
2664
2185
0.12
2661
2184
0.13
2558
2182
0.13
2432
2082
0.14
2305
1978
0.14
2240
Performancein Gflop/sAchievedBandwidth GB/sGflops/WattsJoules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300320
8433
50.
3981
5
8333
10.
3980
5
8232
80.
4271
779
317
0.45
686
6526
10.
4274
5
5622
50.
3882
2
4718
80.
3490
9
3815
00.
2910
77
2811
10.
2413
61
Performancein Gflop/sAchievedBandwidth GB/sGflops/WattsJoules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
DDR4
MCDRAMDDR4DDR4 MCDRAM
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300
2770 joules2878 joules2689 joules2438 joules2137 joules2731 joules
DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts
Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300
826 joules842 joules724 joules803 joules1066 joules1865 joules
MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140WattsMCDRAM_120Watts
Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300
9064 joules9196 joules9194 joules9178 joules9005 joules8084 joules8467 joules
DDR_270wattsDDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts
Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54
Ave
rage
pow
er (W
atts
)
0 20 40 60 80 100120140160180200220240260280300
3050 joules
2981 joules
2872 joules
3043 joules3383 joules
4033 joules5734 joules
MCDRAM_270wattsMCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts