m page scicomp14 hd3d
DESCRIPTION
Benchmark Power5 vs. Power5+TRANSCRIPT
Revisiting Power5/Power5+ Revisiting Power5/Power5+ Performance DifferencesPerformance Differences
Mike Page Mike Page ScicomP 14ScicomP 14
Poughkeepsie, New YorkPoughkeepsie, New YorkMay 23, 2008May 23, 2008
NCAR/CISL/HSS/CSGNCAR/CISL/HSS/CSGConsulting Services GroupConsulting Services Group
[email protected]@ucar.edu
Model Performance: Bluevista vs. Blueice
-11.0788.92BL17.636119.8319WRF 1
-3.1096.89BL9.54749.8538WRF 2
-1.6998.30BL4.83764.9209WRF 4
-1.9998.00BL2.57832.6308WRF 8
3.57103.57BV1.4071.3584WRF 16
7.98107.98BV0.72740.6736WRF 32
2.20102.20BV0.39340.3849WRF 64
3.19103.19BV0.2390.2316WRF 128
-7.2192.78BL0.14660.158WRF 256
-4.9695.03BL1.70281.7917hd3D 8
0.54100.54BV0.94590.9408hd3D 16
7.48107.48BV0.56550.5261hd3D 32
30.03130.03BV0.39610.3046hd3D 64
17.82117.82BV0.21880.1857hd3D 128
17.35117.35BV231.32197.11POP 8
8.45108.45BV112.75103.96POP 16
11.31111.31BV79.9671.83POP 24
15.03115.03BV65.3456.8POP 32
3.69103.69BV43.7642.2POP 48
3.66103.66BV36.2434.96POP 64
-2.7897.21BL21.6722.29POP 128
-1.76101.76BL1.151.13cam_waccm 32
0100SAME2.132.13cam_waccm 64
6.5893.41BV3.123.34cam_waccm 128
-6.23106.23BL5.965.61cam_waccm 256
DIFF %100*BL/BVFAST BL BVMODEL PROCS
A Graphical Look at the Data:CAM_WACCM
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
32 64 96 128 160 192 224 256
Processor Count
ICESS Benchmark Time (sec)
POP
0
50
100
150
200
250
0 32 64 96 128
Processor Count
ICESS Benchmark Time(sec)
HD3D
0
0.5
1
1.5
2
0 16 32 48 64 80 96 112 128 144
Processor Count
ICESS Benchmark Time (sec)
WRF
0
2
4
6
8
10
12
14
16
18
20
0 20 40 60 80 100 120 140 160 180 200 220 240 260
Processor Count
ICESS Benchmark Time (sec)
Blueice
Bluevista
A Graphical Look at the Data:Blueice/Bluevista performace
0.800
0.900
1.000
1.100
1.200
1.300
1.400
0 32 64 96 128 160 192 224 256
Processor Count
Ratio of ICESS Benchmark Times
cam_waccmPOPhd3DWRFEqual
Blueice Faster
Bluevista Faster
Huang/Ghosh
• Analysis of POP performance• Varied run configuration (ptile) for POP
• Conclusions• POP performance on Blueice improves by 13%
when nodes are undersubscribed• Undersubscription uses only 8 of 16 processors on a
Blueice node• Undersubscription avoids sharing L3 cache
• POP performance on Blueice exceeds that on Bluevista if Blueice nodes are undersubscribed
• POP Blueice vs. fully-subscribed-Bluevista performance difference is mainly due to L2 cache misses
But here’s what caught my interest!
Model Performance: Bluevista vs. Blueice
-11.0788.92BL17.636119.8319WRF 1
-3.1096.89BL9.54749.8538WRF 2
-1.6998.30BL4.83764.9209WRF 4
-1.9998.00BL2.57832.6308WRF 8
3.57103.57BV1.4071.3584WRF 16
7.98107.98BV0.72740.6736WRF 32
2.20102.20BV0.39340.3849WRF 64
3.19103.19BV0.2390.2316WRF 128
-7.2192.78BL0.14660.158WRF 256
-4.9695.03BL1.70281.7917hd3D 8
0.54100.54BV0.94590.9408hd3D 16
7.48107.48BV0.56550.5261hd3D 32
30.03130.03BV0.39610.3046hd3D 64
17.82117.82BV0.21880.1857hd3D 128
17.35117.35BV231.32197.11POP 8
8.45108.45BV112.75103.96POP 16
11.31111.31BV79.9671.83POP 24
15.03115.03BV65.3456.8POP 32
3.69103.69BV43.7642.2POP 48
3.66103.66BV36.2434.96POP 64
-2.7897.21BL21.6722.29POP 128
-1.76101.76BL1.151.13cam_waccm 32
0100SAME2.132.13cam_waccm 64
6.5893.41BV3.123.34cam_waccm 128
-6.23106.23BL5.965.61cam_waccm 256
DIFF %100*BL/BVFAST BL BVMODEL PROCS
Blueice/Bluevista
0.800
0.900
1.000
1.100
1.200
1.300
1.400
0 32 64 96 128 160 192 224 256
Processor Count
ICESS Benchmark Time
cam_waccm
POP
hd3D
WRF
Equal
Blueice Faster
Bluevista Faster
hd3D shows largest performance variationof all the apps in the ICESS suite
hd3D needed to be studied too
With the new Power 5+ system and an AIX upgrade there were many new factors that could affect performance:
• SMT• Varied page sizes• Processor Binding
HD3D is a pseudospectral three-dimensional periodic hydrodynamic/magnetohydrodynamic/Hall-MHD
turbulence model.
The results presented here are derived from a numerical solution of the incompressible Navier-Stokes equations in
3 dimensions with periodic boundary conditions on a 256 x 256 x 256 grid.
hd3D uses a pseudo-spectral method to compute spatial derivatives, while adjustable order Runge-Kutta method
is used to evolve the system in the time domain.
This benchmark does a free-decay simulation of Taylor-Green vortices.
hd3D Details
A Closer Look at hd3D - Bluevista runs
Bluevista - Single Core, Private L3 (smt degrades performance)
0.000
0.200
0.400
0.600
0.800
1.000
1.200
32 64 96 128
Processor Count
ICESS Benchmark Time (sec)
non-smtsmtHuang_Ghosh
A Closer Look at hd3D - Blueice runs
Blueice HD3D (Shared L3)(smt degrades performance)
0.000
0.200
0.400
0.600
0.800
1.000
1.200
32 64 96 128
Processor Count
ICESS Benchmark (sec)
non-smt
smt
Huang_Ghosh
Huang/Ghosh Run ConfigurationsModel Performance: Bluevista vs. Blueice
7.98BV2TPP0.72740.6736WRF 32
2.20BV2TPP0.39340.3849WRF 64
3.19BV2TPP0.2390.2316WRF 128
7.48BV1TPP0.56550.5261hd3D 32
30.03BV1TPP0.39610.3046hd3D 64
17.82BV1TPP0.21880.1857hd3D 128
15.03BV2TPP65.3456.8POP 32
3.69BV2TPP43.7642.2POP 48
3.66BV2TPP36.2434.96POP 64
-2.78BL1TPP21.6722.29POP 128
-1.76BL2TPP (16 OMP)1.151.13cam_waccm 32
0SAME2TPP (16 OMP)2.132.13cam_waccm 64
6.58BV2TPP (16 OMP)3.123.34cam_waccm 128
DIFF %FASTSMT? BL BVMODEL PROCS
Give up on SMT for hd3D, look at shared L3 effects
A Closer Look at hd3D - Blueice runs
This supports the conclusion that underscribing Blueice nodes,making L3 cache private, improves Blueice performance.
Blueice HD3D (no smt)Private L3 improves performance
0.200
0.300
0.400
0.500
0.600
0.700
32 64 96 128
Processor Count
ICESS Benchmark (sec)
Private L3
Shared L3
Huang/Ghosh
A Closer Look at hd3D
• Undersubscribing Blueice improves hd3D performance. It is close to, but still slower than, Bluevista.
• For hd3D, Bluevista still outperforms Blueice with ~6% difference for 32 and 64 processors and ~2% for 128.
• While Blueice POP is 13% faster than Bluevista POP on 16 logical(?) cpus, hd3D shows the opposite behavior.
Bluevista (no smt) vs.
Blueice (no smt, private L3)
0.100
0.200
0.300
0.400
0.500
0.600
32 64 96 128
Processor Count
ICESS Benchmark Time (sec)
BluevistaBlueice
A Closer Look at hd3D - memoryHD3D Memory Footprint
20
30
40
50
60
70
32 64 96 128
Processor Count
Mem Req per Task (Mb)
A Closer Look at hd3DTwo issues to investigate: - 1TPP/2TPP differences
- Blueice/Bluevista 1TPP differences
Bluevista - Single Core, Private L3 (smt degrades performance)
0.000
0.200
0.400
0.600
0.800
1.000
1.200
32 64 96 128
Processor Count
ICESS Benchmark Time (sec)
non-smtsmtHuang_Ghosh
Bluevista (no smt) vs.
Blueice (no smt, private L3)
0.100
0.200
0.300
0.400
0.500
0.600
32 64 96 128
Processor Count
ICESS Benchmark Time (sec)
BluevistaBlueice
A Closer Look at hd3DCPI Breakdown Analysis
• Uses multiple Hardware Performance Counters on the processor to:
• Track processor cycles required to complete a given workload
• hd3D computational kernel with hpmcount API
• Track events in processor core
• Track events in the memory subsystem
• 17 counters required for Power5/Power5+ CPI Breakdown
PM_1PLUS_PPC_CMPL
PM_CMPL_STALL_FPUPM_GCT_NOSLOT_BR_MPRED
PM_CMPL_STALL_FDIVPM_GCT_NOSLOT_SRQ_FULL
PM_CMPL_STALL_DIVPM_GCT_NOSLOT_IC_MISS
PM_CMPL_STALL_FXUPM_GCT_NOSLOT_CYC
PM_CMPL_STALL_ERAT_MISSPM_GRP_CMPL
PM_CMPL_STALL_DCACHE_MISSPM_RUN_CYC
PM_CMPL_STALL_REJECTPM_INST_CMPL
PM_CMPL_STALL_LSUPM_IOPS_CMPL
A Closer Look at hd3D 2TPP/1TPP performance differences
Bluevista HPM Countershd3D on 32 cpus
0.0E+00
5.0E+11
1.0E+12
1.5E+12
2.0E+12
2.5E+12
3.0E+12
PM_IOPS_CMPLPM_INST_CMPLPM_RUN_CYCPM_GRP_CMPL
PM_GCT_NOSLOT_CYC
PM_GCT_NOSLOT_IC_MISSPM_GCT_NOSLOT_SRQ_FULLPM_GCT_NOSLOT_BR_MPRED
PM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECT
PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISS
PM_CMPLU_STALL_FXUPM_CMPLU_STALL_DIVPM_CMPLU_STALL_FDIVPM_CMPLU_STALL_FPUPM_1PLUS_PPC_CMPL
nosmt
smt
A Closer Look at hd3D2TPP/1TPP performance differences
Ratio of smt counters to non-smt counters - Bluevista
Ratios: smt/nosmt 32 tasks 64 tasks 128 tasksPM_IOPS_CMPL 0.94 0.96 1.02PM_INST_CMPL 0.94 0.96 1.02PM_RUN_CYC 1.67 1.66 1.67PM_GRP_CMPL 0.94 0.95 1.03PM_GCT_NOSLOT_CYC 3.21 2.72 2.94PM_GCT_NOSLOT_IC_MISS 1.61 1.74 1.71PM_GCT_NOSLOT_SRQ_FULL 5101.47 2380.99 470.68PM_GCT_NOSLOT_BR_MPRED 3.67 3.27 3.78PM_CMPLU_STALL_LSU 2.48 2.62 2.28PM_CMPLU_STALL_REJECT 2.82 3.90 3.21PM_CMPLU_STALL_DCACHE_MISS 2.42 2.36 2.21PM_CMPLU_STALL_ERAT_MISS 2.32 2.67 3.06PM_CMPLU_STALL_FXU 1.43 1.48 1.61PM_CMPLU_STALL_DIV 1.13 1.20 1.17PM_CMPLU_STALL_FDIV 1.15 1.30 1.11PM_CMPLU_STALL_FPU 2.11 2.15 2.05
pmlist -d -c 2,244PM_LSU_SRQ_FULL_CYC,Cycles SRQ full Cycles the Store Request Queue is full.
A Closer Look at hd3D 2TPP/1TPP performance differences
3.6383.480CPI
0.2870.254C-C1-C2-C3
0.2290.159C3-C3A
0.0320.046PM_CMPLU_STALL_FDIVC3A
0.2600.205PM_CMPLU_STALL_FPUC3
0.0390.042C2-C2A
0.0100.015PM_CMPLU_STALL_DIVC2A
0.0490.057PM_CMPLU_STALL_FXUC2
0.4690.428C1-C1A-C1B
0.1280.088PM_CMPLU_STALL_DCACHE_MISSC1BMiss(C1B)
0.1150.064C1A-C1A1
0.0310.022PM_CMPLU_STALL_ERAT_MISSC1A1Miss(C1A1)
0.1460.086PM_CMPLU_STALL_REJECTC1A
0.3870.260PM_CMPLU_STALL_LSUC1
0.7420.602CT-A-B
0.0170.009B-B1-B2-B3
0.0000.000PM_GCT_NOSLOT_SRQ_FULLB3
0.0270.012PM_GCT_NOSLOT_BR_MPREDB2Mispredict(B2)
0.0030.003PM_GCT_NOSLOT_IC_MISSB1Miss(B1)
0.0480.025PM_GCT_NOSLOT_CYCBEmpty(B)
0.0110.018A-A1Overhead(A2)
0.0580.104A1-A1AOverhead(A1B)
0.1420.251PM_INST_CMPL/5A1ACompleted(A1A)
0.2000.355PM_1PLUS_PPC_CMPLA1Completion Cycles(A1)
0.2100.373PM_GRP_CMPLACompleted(A)
PM_RUN_CYCT
smt_32nosmt_32hd3D Bluevista
A Closer Look at hd3D Blueice/Bluevista 1TPP differences
Bv and Bl HPM Countersnosmt_32, private L3
0.E+00
5.E+11
1.E+12
2.E+12
2.E+12
3.E+12
PM_IOPS_CMPLPM_INST_CMPLPM_RUN_CYCPM_GRP_CMPL
PM_GCT_NOSLOT_CYC
PM_GCT_NOSLOT_IC_MISSPM_GCT_NOSLOT_SRQ_FULLPM_GCT_NOSLOT_BR_MPRED
PM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECT
PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISS
PM_CMPLU_STALL_FXUPM_CMPLU_STALL_DIVPM_CMPLU_STALL_FDIVPM_CMPLU_STALL_FPU
BluevistaBlueice
A Closer Look at hd3D Blueice/Bluevista 1TPP differences
Ratio of PM Counters
0.000
1.000
2.000
3.000
4.000
5.000
6.000PM_IOPS_CMPL
PM_INST_CMPL
PM_RUN_CYC
PM_GRP_CMPL
PM_GCT_NOSLOT_CYC
PM_GCT_NOSLOT_IC_MISS
PM_GCT_NOSLOT_SRQ_FULL
PM_GCT_NOSLOT_BR_MPRED
PM_CMPLU_STALL_LSU
PM_CMPLU_STALL_REJECT
PM_CMPLU_STALL_DCACHE_MISS
PM_CMPLU_STALL_ERAT_MISS
PM_CMPLU_STALL_FXU
PM_CMPLU_STALL_DIV
PM_CMPLU_STALL_FDIV
PM_CMPLU_STALL_FPU
pmlist -p POWER5 -d -c 4,8PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss Following a completion stall (any period when no groups completed) the last instruction to finish before completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.
A Closer Look at hd3D Blueice/Bluevista 1TPP differences
3.6383.480CPI
0.2870.254C-C1-C2-C3
0.2290.159C3-C3A
0.0320.046PM_CMPLU_STALL_FDIVC3A
0.2600.205PM_CMPLU_STALL_FPUC3
0.0390.042C2-C2A
0.0100.015PM_CMPLU_STALL_DIVC2A
0.0490.057PM_CMPLU_STALL_FXUC2
0.4690.428C1-C1A-C1B
0.1280.088PM_CMPLU_STALL_DCACHE_MISSC1BMiss(C1B)
0.1150.064C1A-C1A1
0.0310.022PM_CMPLU_STALL_ERAT_MISSC1A1Miss(C1A1)
0.1460.086PM_CMPLU_STALL_REJECTC1A
0.3870.260PM_CMPLU_STALL_LSUC1
0.7420.602CT-A-B
0.0170.009B-B1-B2-B3
0.0000.000PM_GCT_NOSLOT_SRQ_FULLB3
0.0270.012PM_GCT_NOSLOT_BR_MPREDB2Mispredict(B2)
0.0030.003PM_GCT_NOSLOT_IC_MISSB1Miss(B1)
0.0480.025PM_GCT_NOSLOT_CYCBEmpty(B)
0.0110.018A-A1Overhead(A2)
0.0580.104A1-A1AOverhead(A1B)
0.1420.251PM_INST_CMPL/5A1ACompleted(A1A)
0.2000.355PM_1PLUS_PPC_CMPLA1Completion Cycles(A1)
0.2100.373PM_GRP_CMPLACompleted(A)
PM_RUN_CYCT
smt_32nosmt_32hd3D Bluevista
Conclusions> Nothing new <
Sharing cache can degrade performance
But lots of questions remain:• Gathering and processing the data from performance counters was extremely tedious.
•Is there an easier way?• Does difficulty increase exponentially with level of detail?
• Are the Power 5/5+ performance counters accurate?•Some say not• Eyerman, et. al. (ASPLOS, 2004)
• What do the counters mean?• Are there expanded references besides pmlist?
• What is an ERAT miss?• What does it say about code performance?
• Will ACTC tools give more info that what’s available via pmlist?
• • •
?