m page scicomp14 hd3d

Revisiting Power5/Power5+ Revisiting Power5/Power5+ Performance DifferencesPerformance Differences

Mike Page Mike Page ScicomP 14ScicomP 14

Poughkeepsie, New YorkPoughkeepsie, New YorkMay 23, 2008May 23, 2008

NCAR/CISL/HSS/CSGNCAR/CISL/HSS/CSGConsulting Services GroupConsulting Services Group

[email protected]@ucar.edu

Model Performance: Bluevista vs. Blueice

-11.0788.92BL17.636119.8319WRF 1

-3.1096.89BL9.54749.8538WRF 2

-1.6998.30BL4.83764.9209WRF 4

-1.9998.00BL2.57832.6308WRF 8

3.57103.57BV1.4071.3584WRF 16

7.98107.98BV0.72740.6736WRF 32

2.20102.20BV0.39340.3849WRF 64

3.19103.19BV0.2390.2316WRF 128

-7.2192.78BL0.14660.158WRF 256

-4.9695.03BL1.70281.7917hd3D 8

0.54100.54BV0.94590.9408hd3D 16

7.48107.48BV0.56550.5261hd3D 32

30.03130.03BV0.39610.3046hd3D 64

17.82117.82BV0.21880.1857hd3D 128

17.35117.35BV231.32197.11POP 8

8.45108.45BV112.75103.96POP 16

11.31111.31BV79.9671.83POP 24

15.03115.03BV65.3456.8POP 32

3.69103.69BV43.7642.2POP 48

3.66103.66BV36.2434.96POP 64

-2.7897.21BL21.6722.29POP 128

-1.76101.76BL1.151.13cam_waccm 32

0100SAME2.132.13cam_waccm 64

6.5893.41BV3.123.34cam_waccm 128

-6.23106.23BL5.965.61cam_waccm 256

DIFF %100*BL/BVFAST BL BVMODEL PROCS

A Graphical Look at the Data:CAM_WACCM

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

32 64 96 128 160 192 224 256

Processor Count

ICESS Benchmark Time (sec)

POP

0

50

100

150

200

250

0 32 64 96 128

Processor Count

ICESS Benchmark Time(sec)

HD3D

0

0.5

1

1.5

2

0 16 32 48 64 80 96 112 128 144

Processor Count


WRF

0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140 160 180 200 220 240 260

Processor Count


Blueice

Bluevista

A Graphical Look at the Data:Blueice/Bluevista performace

0.800

0.900

1.000

1.100

1.200

1.300

1.400

0 32 64 96 128 160 192 224 256

Processor Count

Ratio of ICESS Benchmark Times

cam_waccmPOPhd3DWRFEqual

Blueice Faster

Bluevista Faster

Huang/Ghosh

• Analysis of POP performance• Varied run configuration (ptile) for POP

• Conclusions• POP performance on Blueice improves by 13%

when nodes are undersubscribed• Undersubscription uses only 8 of 16 processors on a

Blueice node• Undersubscription avoids sharing L3 cache

• POP performance on Blueice exceeds that on Bluevista if Blueice nodes are undersubscribed

• POP Blueice vs. fully-subscribed-Bluevista performance difference is mainly due to L2 cache misses

But here’s what caught my interest!

Model Performance: Bluevista vs. Blueice

-11.0788.92BL17.636119.8319WRF 1

-3.1096.89BL9.54749.8538WRF 2

-1.6998.30BL4.83764.9209WRF 4

-1.9998.00BL2.57832.6308WRF 8

3.57103.57BV1.4071.3584WRF 16

7.98107.98BV0.72740.6736WRF 32

2.20102.20BV0.39340.3849WRF 64

3.19103.19BV0.2390.2316WRF 128

-7.2192.78BL0.14660.158WRF 256

-4.9695.03BL1.70281.7917hd3D 8

0.54100.54BV0.94590.9408hd3D 16

7.48107.48BV0.56550.5261hd3D 32

30.03130.03BV0.39610.3046hd3D 64

17.82117.82BV0.21880.1857hd3D 128

17.35117.35BV231.32197.11POP 8

8.45108.45BV112.75103.96POP 16

11.31111.31BV79.9671.83POP 24

15.03115.03BV65.3456.8POP 32

3.69103.69BV43.7642.2POP 48

3.66103.66BV36.2434.96POP 64

-2.7897.21BL21.6722.29POP 128

-1.76101.76BL1.151.13cam_waccm 32

0100SAME2.132.13cam_waccm 64

6.5893.41BV3.123.34cam_waccm 128

-6.23106.23BL5.965.61cam_waccm 256

DIFF %100*BL/BVFAST BL BVMODEL PROCS

Blueice/Bluevista

0.800

0.900

1.000

1.100

1.200

1.300

1.400

0 32 64 96 128 160 192 224 256

Processor Count

ICESS Benchmark Time

cam_waccm

POP

hd3D

WRF

Equal

Blueice Faster

Bluevista Faster

hd3D shows largest performance variationof all the apps in the ICESS suite

hd3D needed to be studied too

With the new Power 5+ system and an AIX upgrade there were many new factors that could affect performance:

• SMT• Varied page sizes• Processor Binding

HD3D is a pseudospectral three-dimensional periodic hydrodynamic/magnetohydrodynamic/Hall-MHD

turbulence model.

The results presented here are derived from a numerical solution of the incompressible Navier-Stokes equations in

3 dimensions with periodic boundary conditions on a 256 x 256 x 256 grid.

hd3D uses a pseudo-spectral method to compute spatial derivatives, while adjustable order Runge-Kutta method

is used to evolve the system in the time domain.

This benchmark does a free-decay simulation of Taylor-Green vortices.

hd3D Details

A Closer Look at hd3D - Bluevista runs

Bluevista - Single Core, Private L3 (smt degrades performance)

0.000

0.200

0.400

0.600

0.800

1.000

1.200

32 64 96 128

Processor Count


non-smtsmtHuang_Ghosh

A Closer Look at hd3D - Blueice runs

Blueice HD3D (Shared L3)(smt degrades performance)

0.000

0.200

0.400

0.600

0.800

1.000

1.200

32 64 96 128

Processor Count

ICESS Benchmark (sec)

non-smt

smt

Huang_Ghosh

Huang/Ghosh Run ConfigurationsModel Performance: Bluevista vs. Blueice

7.98BV2TPP0.72740.6736WRF 32

2.20BV2TPP0.39340.3849WRF 64

3.19BV2TPP0.2390.2316WRF 128

7.48BV1TPP0.56550.5261hd3D 32

30.03BV1TPP0.39610.3046hd3D 64

17.82BV1TPP0.21880.1857hd3D 128

15.03BV2TPP65.3456.8POP 32

3.69BV2TPP43.7642.2POP 48

3.66BV2TPP36.2434.96POP 64

-2.78BL1TPP21.6722.29POP 128

-1.76BL2TPP (16 OMP)1.151.13cam_waccm 32

0SAME2TPP (16 OMP)2.132.13cam_waccm 64

6.58BV2TPP (16 OMP)3.123.34cam_waccm 128

DIFF %FASTSMT? BL BVMODEL PROCS

Give up on SMT for hd3D, look at shared L3 effects

A Closer Look at hd3D - Blueice runs

This supports the conclusion that underscribing Blueice nodes,making L3 cache private, improves Blueice performance.

Blueice HD3D (no smt)Private L3 improves performance

0.200

0.300

0.400

0.500

0.600

0.700

32 64 96 128

Processor Count

ICESS Benchmark (sec)

Private L3

Shared L3

Huang/Ghosh

A Closer Look at hd3D

• Undersubscribing Blueice improves hd3D performance. It is close to, but still slower than, Bluevista.

• For hd3D, Bluevista still outperforms Blueice with ~6% difference for 32 and 64 processors and ~2% for 128.

• While Blueice POP is 13% faster than Bluevista POP on 16 logical(?) cpus, hd3D shows the opposite behavior.

Bluevista (no smt) vs.

Blueice (no smt, private L3)

0.100

0.200

0.300

0.400

0.500

0.600

32 64 96 128

Processor Count


BluevistaBlueice

A Closer Look at hd3D - memoryHD3D Memory Footprint

20

30

40

50

60

70

32 64 96 128

Processor Count

Mem Req per Task (Mb)

A Closer Look at hd3DTwo issues to investigate: - 1TPP/2TPP differences

- Blueice/Bluevista 1TPP differences

Bluevista - Single Core, Private L3 (smt degrades performance)

0.000

0.200

0.400

0.600

0.800

1.000

1.200

32 64 96 128

Processor Count


non-smtsmtHuang_Ghosh

Bluevista (no smt) vs.

Blueice (no smt, private L3)

0.100

0.200

0.300

0.400

0.500

0.600

32 64 96 128

Processor Count


BluevistaBlueice

A Closer Look at hd3DCPI Breakdown Analysis

• Uses multiple Hardware Performance Counters on the processor to:

• Track processor cycles required to complete a given workload

• hd3D computational kernel with hpmcount API

• Track events in processor core

• Track events in the memory subsystem

• 17 counters required for Power5/Power5+ CPI Breakdown

PM_1PLUS_PPC_CMPL

PM_CMPL_STALL_FPUPM_GCT_NOSLOT_BR_MPRED

PM_CMPL_STALL_FDIVPM_GCT_NOSLOT_SRQ_FULL

PM_CMPL_STALL_DIVPM_GCT_NOSLOT_IC_MISS

PM_CMPL_STALL_FXUPM_GCT_NOSLOT_CYC

PM_CMPL_STALL_ERAT_MISSPM_GRP_CMPL

PM_CMPL_STALL_DCACHE_MISSPM_RUN_CYC

PM_CMPL_STALL_REJECTPM_INST_CMPL

PM_CMPL_STALL_LSUPM_IOPS_CMPL

A Closer Look at hd3D 2TPP/1TPP performance differences

Bluevista HPM Countershd3D on 32 cpus

0.0E+00

5.0E+11

1.0E+12

1.5E+12

2.0E+12

2.5E+12

3.0E+12

PM_IOPS_CMPLPM_INST_CMPLPM_RUN_CYCPM_GRP_CMPL

PM_GCT_NOSLOT_CYC

PM_GCT_NOSLOT_IC_MISSPM_GCT_NOSLOT_SRQ_FULLPM_GCT_NOSLOT_BR_MPRED

PM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_FXUPM_CMPLU_STALL_DIVPM_CMPLU_STALL_FDIVPM_CMPLU_STALL_FPUPM_1PLUS_PPC_CMPL

nosmt

smt

A Closer Look at hd3D2TPP/1TPP performance differences

Ratio of smt counters to non-smt counters - Bluevista

Ratios: smt/nosmt 32 tasks 64 tasks 128 tasksPM_IOPS_CMPL 0.94 0.96 1.02PM_INST_CMPL 0.94 0.96 1.02PM_RUN_CYC 1.67 1.66 1.67PM_GRP_CMPL 0.94 0.95 1.03PM_GCT_NOSLOT_CYC 3.21 2.72 2.94PM_GCT_NOSLOT_IC_MISS 1.61 1.74 1.71PM_GCT_NOSLOT_SRQ_FULL 5101.47 2380.99 470.68PM_GCT_NOSLOT_BR_MPRED 3.67 3.27 3.78PM_CMPLU_STALL_LSU 2.48 2.62 2.28PM_CMPLU_STALL_REJECT 2.82 3.90 3.21PM_CMPLU_STALL_DCACHE_MISS 2.42 2.36 2.21PM_CMPLU_STALL_ERAT_MISS 2.32 2.67 3.06PM_CMPLU_STALL_FXU 1.43 1.48 1.61PM_CMPLU_STALL_DIV 1.13 1.20 1.17PM_CMPLU_STALL_FDIV 1.15 1.30 1.11PM_CMPLU_STALL_FPU 2.11 2.15 2.05

pmlist -d -c 2,244PM_LSU_SRQ_FULL_CYC,Cycles SRQ full Cycles the Store Request Queue is full.

A Closer Look at hd3D 2TPP/1TPP performance differences

3.6383.480CPI

0.2870.254C-C1-C2-C3

0.2290.159C3-C3A

0.0320.046PM_CMPLU_STALL_FDIVC3A

0.2600.205PM_CMPLU_STALL_FPUC3

0.0390.042C2-C2A

0.0100.015PM_CMPLU_STALL_DIVC2A

0.0490.057PM_CMPLU_STALL_FXUC2

0.4690.428C1-C1A-C1B

0.1280.088PM_CMPLU_STALL_DCACHE_MISSC1BMiss(C1B)

0.1150.064C1A-C1A1

0.0310.022PM_CMPLU_STALL_ERAT_MISSC1A1Miss(C1A1)

0.1460.086PM_CMPLU_STALL_REJECTC1A

0.3870.260PM_CMPLU_STALL_LSUC1

0.7420.602CT-A-B

0.0170.009B-B1-B2-B3

0.0000.000PM_GCT_NOSLOT_SRQ_FULLB3

0.0270.012PM_GCT_NOSLOT_BR_MPREDB2Mispredict(B2)

0.0030.003PM_GCT_NOSLOT_IC_MISSB1Miss(B1)

0.0480.025PM_GCT_NOSLOT_CYCBEmpty(B)

0.0110.018A-A1Overhead(A2)

0.0580.104A1-A1AOverhead(A1B)

0.1420.251PM_INST_CMPL/5A1ACompleted(A1A)

0.2000.355PM_1PLUS_PPC_CMPLA1Completion Cycles(A1)

0.2100.373PM_GRP_CMPLACompleted(A)

PM_RUN_CYCT

smt_32nosmt_32hd3D Bluevista

A Closer Look at hd3D Blueice/Bluevista 1TPP differences

Bv and Bl HPM Countersnosmt_32, private L3

0.E+00

5.E+11

1.E+12

2.E+12

2.E+12

3.E+12

PM_IOPS_CMPLPM_INST_CMPLPM_RUN_CYCPM_GRP_CMPL

PM_GCT_NOSLOT_CYC

PM_GCT_NOSLOT_IC_MISSPM_GCT_NOSLOT_SRQ_FULLPM_GCT_NOSLOT_BR_MPRED

PM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_FXUPM_CMPLU_STALL_DIVPM_CMPLU_STALL_FDIVPM_CMPLU_STALL_FPU

BluevistaBlueice


Ratio of PM Counters

0.000

1.000

2.000

3.000

4.000

5.000

6.000PM_IOPS_CMPL

PM_INST_CMPL

PM_RUN_CYC

PM_GRP_CMPL

PM_GCT_NOSLOT_CYC

PM_GCT_NOSLOT_IC_MISS

PM_GCT_NOSLOT_SRQ_FULL

PM_GCT_NOSLOT_BR_MPRED

PM_CMPLU_STALL_LSU

PM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_DCACHE_MISS

PM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_FXU

PM_CMPLU_STALL_DIV

PM_CMPLU_STALL_FDIV

PM_CMPLU_STALL_FPU

pmlist -p POWER5 -d -c 4,8PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss Following a completion stall (any period when no groups completed) the last instruction to finish before completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.


3.6383.480CPI

0.2870.254C-C1-C2-C3

0.2290.159C3-C3A

0.0320.046PM_CMPLU_STALL_FDIVC3A

0.2600.205PM_CMPLU_STALL_FPUC3

0.0390.042C2-C2A

0.0100.015PM_CMPLU_STALL_DIVC2A

0.0490.057PM_CMPLU_STALL_FXUC2

0.4690.428C1-C1A-C1B

0.1280.088PM_CMPLU_STALL_DCACHE_MISSC1BMiss(C1B)

0.1150.064C1A-C1A1

0.0310.022PM_CMPLU_STALL_ERAT_MISSC1A1Miss(C1A1)

0.1460.086PM_CMPLU_STALL_REJECTC1A

0.3870.260PM_CMPLU_STALL_LSUC1

0.7420.602CT-A-B

0.0170.009B-B1-B2-B3

0.0000.000PM_GCT_NOSLOT_SRQ_FULLB3

0.0270.012PM_GCT_NOSLOT_BR_MPREDB2Mispredict(B2)

0.0030.003PM_GCT_NOSLOT_IC_MISSB1Miss(B1)

0.0480.025PM_GCT_NOSLOT_CYCBEmpty(B)

0.0110.018A-A1Overhead(A2)

0.0580.104A1-A1AOverhead(A1B)

0.1420.251PM_INST_CMPL/5A1ACompleted(A1A)

0.2000.355PM_1PLUS_PPC_CMPLA1Completion Cycles(A1)

0.2100.373PM_GRP_CMPLACompleted(A)

PM_RUN_CYCT

smt_32nosmt_32hd3D Bluevista

Conclusions> Nothing new <

Sharing cache can degrade performance

But lots of questions remain:• Gathering and processing the data from performance counters was extremely tedious.

•Is there an easier way?• Does difficulty increase exponentially with level of detail?

• Are the Power 5/5+ performance counters accurate?•Some say not• Eyerman, et. al. (ASPLOS, 2004)

• What do the counters mean?• Are there expanded references besides pmlist?

• What is an ERAT miss?• What does it say about code performance?

• Will ACTC tools give more info that what’s available via pmlist?

• • •

m page scicomp14 hd3d

Documents

undersubscribed pop

l3 cache pop performance

model performance

blueice nodes

blueice node undersubscription

new yorkmay

new power

new factors