m page scicomp14 hd3d

24
Revisiting Power5/Power5+ Revisiting Power5/Power5+ Performance Differences Performance Differences Mike Page Mike Page ScicomP 14 ScicomP 14 Poughkeepsie, New York Poughkeepsie, New York May 23, 2008 May 23, 2008 NCAR/CISL/HSS/CSG NCAR/CISL/HSS/CSG Consulting Services Group Consulting Services Group [email protected] [email protected]

Upload: mpagecolo

Post on 09-Jul-2015

120 views

Category:

Documents


2 download

DESCRIPTION

Benchmark Power5 vs. Power5+

TRANSCRIPT

Page 1: M Page ScicomP14 Hd3D

Revisiting Power5/Power5+ Revisiting Power5/Power5+ Performance DifferencesPerformance Differences

Mike Page Mike Page ScicomP 14ScicomP 14

Poughkeepsie, New YorkPoughkeepsie, New YorkMay 23, 2008May 23, 2008

NCAR/CISL/HSS/CSGNCAR/CISL/HSS/CSGConsulting Services GroupConsulting Services Group

[email protected]@ucar.edu

Page 2: M Page ScicomP14 Hd3D

Model Performance: Bluevista vs. Blueice

-11.0788.92BL17.636119.8319WRF 1

-3.1096.89BL9.54749.8538WRF 2

-1.6998.30BL4.83764.9209WRF 4

-1.9998.00BL2.57832.6308WRF 8

3.57103.57BV1.4071.3584WRF 16

7.98107.98BV0.72740.6736WRF 32

2.20102.20BV0.39340.3849WRF 64

3.19103.19BV0.2390.2316WRF 128

-7.2192.78BL0.14660.158WRF 256

-4.9695.03BL1.70281.7917hd3D 8

0.54100.54BV0.94590.9408hd3D 16

7.48107.48BV0.56550.5261hd3D 32

30.03130.03BV0.39610.3046hd3D 64

17.82117.82BV0.21880.1857hd3D 128

17.35117.35BV231.32197.11POP 8

8.45108.45BV112.75103.96POP 16

11.31111.31BV79.9671.83POP 24

15.03115.03BV65.3456.8POP 32

3.69103.69BV43.7642.2POP 48

3.66103.66BV36.2434.96POP 64

-2.7897.21BL21.6722.29POP 128

-1.76101.76BL1.151.13cam_waccm 32

0100SAME2.132.13cam_waccm 64

6.5893.41BV3.123.34cam_waccm 128

-6.23106.23BL5.965.61cam_waccm 256

DIFF %100*BL/BVFAST BL BVMODEL PROCS

Page 3: M Page ScicomP14 Hd3D

A Graphical Look at the Data:CAM_WACCM

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

32 64 96 128 160 192 224 256

Processor Count

ICESS Benchmark Time (sec)

POP

0

50

100

150

200

250

0 32 64 96 128

Processor Count

ICESS Benchmark Time(sec)

HD3D

0

0.5

1

1.5

2

0 16 32 48 64 80 96 112 128 144

Processor Count

ICESS Benchmark Time (sec)

WRF

0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140 160 180 200 220 240 260

Processor Count

ICESS Benchmark Time (sec)

Blueice

Bluevista

Page 4: M Page ScicomP14 Hd3D

A Graphical Look at the Data:Blueice/Bluevista performace

0.800

0.900

1.000

1.100

1.200

1.300

1.400

0 32 64 96 128 160 192 224 256

Processor Count

Ratio of ICESS Benchmark Times

cam_waccmPOPhd3DWRFEqual

Blueice Faster

Bluevista Faster

Page 5: M Page ScicomP14 Hd3D

Huang/Ghosh

• Analysis of POP performance• Varied run configuration (ptile) for POP

• Conclusions• POP performance on Blueice improves by 13%

when nodes are undersubscribed• Undersubscription uses only 8 of 16 processors on a

Blueice node• Undersubscription avoids sharing L3 cache

• POP performance on Blueice exceeds that on Bluevista if Blueice nodes are undersubscribed

• POP Blueice vs. fully-subscribed-Bluevista performance difference is mainly due to L2 cache misses

Page 6: M Page ScicomP14 Hd3D

But here’s what caught my interest!

Model Performance: Bluevista vs. Blueice

-11.0788.92BL17.636119.8319WRF 1

-3.1096.89BL9.54749.8538WRF 2

-1.6998.30BL4.83764.9209WRF 4

-1.9998.00BL2.57832.6308WRF 8

3.57103.57BV1.4071.3584WRF 16

7.98107.98BV0.72740.6736WRF 32

2.20102.20BV0.39340.3849WRF 64

3.19103.19BV0.2390.2316WRF 128

-7.2192.78BL0.14660.158WRF 256

-4.9695.03BL1.70281.7917hd3D 8

0.54100.54BV0.94590.9408hd3D 16

7.48107.48BV0.56550.5261hd3D 32

30.03130.03BV0.39610.3046hd3D 64

17.82117.82BV0.21880.1857hd3D 128

17.35117.35BV231.32197.11POP 8

8.45108.45BV112.75103.96POP 16

11.31111.31BV79.9671.83POP 24

15.03115.03BV65.3456.8POP 32

3.69103.69BV43.7642.2POP 48

3.66103.66BV36.2434.96POP 64

-2.7897.21BL21.6722.29POP 128

-1.76101.76BL1.151.13cam_waccm 32

0100SAME2.132.13cam_waccm 64

6.5893.41BV3.123.34cam_waccm 128

-6.23106.23BL5.965.61cam_waccm 256

DIFF %100*BL/BVFAST BL BVMODEL PROCS

Blueice/Bluevista

0.800

0.900

1.000

1.100

1.200

1.300

1.400

0 32 64 96 128 160 192 224 256

Processor Count

ICESS Benchmark Time

cam_waccm

POP

hd3D

WRF

Equal

Blueice Faster

Bluevista Faster

hd3D shows largest performance variationof all the apps in the ICESS suite

Page 7: M Page ScicomP14 Hd3D

hd3D needed to be studied too

With the new Power 5+ system and an AIX upgrade there were many new factors that could affect performance:

• SMT• Varied page sizes• Processor Binding

Page 8: M Page ScicomP14 Hd3D

HD3D is a pseudospectral three-dimensional periodic hydrodynamic/magnetohydrodynamic/Hall-MHD

turbulence model.

The results presented here are derived from a numerical solution of the incompressible Navier-Stokes equations in

3 dimensions with periodic boundary conditions on a 256 x 256 x 256 grid.

hd3D uses a pseudo-spectral method to compute spatial derivatives, while adjustable order Runge-Kutta method

is used to evolve the system in the time domain.

This benchmark does a free-decay simulation of Taylor-Green vortices.

hd3D Details

Page 9: M Page ScicomP14 Hd3D

A Closer Look at hd3D - Bluevista runs

Bluevista - Single Core, Private L3 (smt degrades performance)

0.000

0.200

0.400

0.600

0.800

1.000

1.200

32 64 96 128

Processor Count

ICESS Benchmark Time (sec)

non-smtsmtHuang_Ghosh

Page 10: M Page ScicomP14 Hd3D

A Closer Look at hd3D - Blueice runs

Blueice HD3D (Shared L3)(smt degrades performance)

0.000

0.200

0.400

0.600

0.800

1.000

1.200

32 64 96 128

Processor Count

ICESS Benchmark (sec)

non-smt

smt

Huang_Ghosh

Page 11: M Page ScicomP14 Hd3D

Huang/Ghosh Run ConfigurationsModel Performance: Bluevista vs. Blueice

7.98BV2TPP0.72740.6736WRF 32

2.20BV2TPP0.39340.3849WRF 64

3.19BV2TPP0.2390.2316WRF 128

7.48BV1TPP0.56550.5261hd3D 32

30.03BV1TPP0.39610.3046hd3D 64

17.82BV1TPP0.21880.1857hd3D 128

15.03BV2TPP65.3456.8POP 32

3.69BV2TPP43.7642.2POP 48

3.66BV2TPP36.2434.96POP 64

-2.78BL1TPP21.6722.29POP 128

-1.76BL2TPP (16 OMP)1.151.13cam_waccm 32

0SAME2TPP (16 OMP)2.132.13cam_waccm 64

6.58BV2TPP (16 OMP)3.123.34cam_waccm 128

DIFF %FASTSMT? BL BVMODEL PROCS

Give up on SMT for hd3D, look at shared L3 effects

Page 12: M Page ScicomP14 Hd3D

A Closer Look at hd3D - Blueice runs

This supports the conclusion that underscribing Blueice nodes,making L3 cache private, improves Blueice performance.

Blueice HD3D (no smt)Private L3 improves performance

0.200

0.300

0.400

0.500

0.600

0.700

32 64 96 128

Processor Count

ICESS Benchmark (sec)

Private L3

Shared L3

Huang/Ghosh

Page 13: M Page ScicomP14 Hd3D

A Closer Look at hd3D

• Undersubscribing Blueice improves hd3D performance. It is close to, but still slower than, Bluevista.

• For hd3D, Bluevista still outperforms Blueice with ~6% difference for 32 and 64 processors and ~2% for 128.

• While Blueice POP is 13% faster than Bluevista POP on 16 logical(?) cpus, hd3D shows the opposite behavior.

Bluevista (no smt) vs.

Blueice (no smt, private L3)

0.100

0.200

0.300

0.400

0.500

0.600

32 64 96 128

Processor Count

ICESS Benchmark Time (sec)

BluevistaBlueice

Page 14: M Page ScicomP14 Hd3D

A Closer Look at hd3D - memoryHD3D Memory Footprint

20

30

40

50

60

70

32 64 96 128

Processor Count

Mem Req per Task (Mb)

Page 15: M Page ScicomP14 Hd3D

A Closer Look at hd3DTwo issues to investigate: - 1TPP/2TPP differences

- Blueice/Bluevista 1TPP differences

Bluevista - Single Core, Private L3 (smt degrades performance)

0.000

0.200

0.400

0.600

0.800

1.000

1.200

32 64 96 128

Processor Count

ICESS Benchmark Time (sec)

non-smtsmtHuang_Ghosh

Bluevista (no smt) vs.

Blueice (no smt, private L3)

0.100

0.200

0.300

0.400

0.500

0.600

32 64 96 128

Processor Count

ICESS Benchmark Time (sec)

BluevistaBlueice

Page 16: M Page ScicomP14 Hd3D

A Closer Look at hd3DCPI Breakdown Analysis

• Uses multiple Hardware Performance Counters on the processor to:

• Track processor cycles required to complete a given workload

• hd3D computational kernel with hpmcount API

• Track events in processor core

• Track events in the memory subsystem

• 17 counters required for Power5/Power5+ CPI Breakdown

PM_1PLUS_PPC_CMPL

PM_CMPL_STALL_FPUPM_GCT_NOSLOT_BR_MPRED

PM_CMPL_STALL_FDIVPM_GCT_NOSLOT_SRQ_FULL

PM_CMPL_STALL_DIVPM_GCT_NOSLOT_IC_MISS

PM_CMPL_STALL_FXUPM_GCT_NOSLOT_CYC

PM_CMPL_STALL_ERAT_MISSPM_GRP_CMPL

PM_CMPL_STALL_DCACHE_MISSPM_RUN_CYC

PM_CMPL_STALL_REJECTPM_INST_CMPL

PM_CMPL_STALL_LSUPM_IOPS_CMPL

Page 17: M Page ScicomP14 Hd3D

A Closer Look at hd3D 2TPP/1TPP performance differences

Bluevista HPM Countershd3D on 32 cpus

0.0E+00

5.0E+11

1.0E+12

1.5E+12

2.0E+12

2.5E+12

3.0E+12

PM_IOPS_CMPLPM_INST_CMPLPM_RUN_CYCPM_GRP_CMPL

PM_GCT_NOSLOT_CYC

PM_GCT_NOSLOT_IC_MISSPM_GCT_NOSLOT_SRQ_FULLPM_GCT_NOSLOT_BR_MPRED

PM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_FXUPM_CMPLU_STALL_DIVPM_CMPLU_STALL_FDIVPM_CMPLU_STALL_FPUPM_1PLUS_PPC_CMPL

nosmt

smt

Page 18: M Page ScicomP14 Hd3D

A Closer Look at hd3D2TPP/1TPP performance differences

Ratio of smt counters to non-smt counters - Bluevista

Ratios: smt/nosmt 32 tasks 64 tasks 128 tasksPM_IOPS_CMPL 0.94 0.96 1.02PM_INST_CMPL 0.94 0.96 1.02PM_RUN_CYC 1.67 1.66 1.67PM_GRP_CMPL 0.94 0.95 1.03PM_GCT_NOSLOT_CYC 3.21 2.72 2.94PM_GCT_NOSLOT_IC_MISS 1.61 1.74 1.71PM_GCT_NOSLOT_SRQ_FULL 5101.47 2380.99 470.68PM_GCT_NOSLOT_BR_MPRED 3.67 3.27 3.78PM_CMPLU_STALL_LSU 2.48 2.62 2.28PM_CMPLU_STALL_REJECT 2.82 3.90 3.21PM_CMPLU_STALL_DCACHE_MISS 2.42 2.36 2.21PM_CMPLU_STALL_ERAT_MISS 2.32 2.67 3.06PM_CMPLU_STALL_FXU 1.43 1.48 1.61PM_CMPLU_STALL_DIV 1.13 1.20 1.17PM_CMPLU_STALL_FDIV 1.15 1.30 1.11PM_CMPLU_STALL_FPU 2.11 2.15 2.05

pmlist -d -c 2,244PM_LSU_SRQ_FULL_CYC,Cycles SRQ full Cycles the Store Request Queue is full.

Page 19: M Page ScicomP14 Hd3D

A Closer Look at hd3D 2TPP/1TPP performance differences

3.6383.480CPI

0.2870.254C-C1-C2-C3

0.2290.159C3-C3A

0.0320.046PM_CMPLU_STALL_FDIVC3A

0.2600.205PM_CMPLU_STALL_FPUC3

0.0390.042C2-C2A

0.0100.015PM_CMPLU_STALL_DIVC2A

0.0490.057PM_CMPLU_STALL_FXUC2

0.4690.428C1-C1A-C1B

0.1280.088PM_CMPLU_STALL_DCACHE_MISSC1BMiss(C1B)

0.1150.064C1A-C1A1

0.0310.022PM_CMPLU_STALL_ERAT_MISSC1A1Miss(C1A1)

0.1460.086PM_CMPLU_STALL_REJECTC1A

0.3870.260PM_CMPLU_STALL_LSUC1

0.7420.602CT-A-B

0.0170.009B-B1-B2-B3

0.0000.000PM_GCT_NOSLOT_SRQ_FULLB3

0.0270.012PM_GCT_NOSLOT_BR_MPREDB2Mispredict(B2)

0.0030.003PM_GCT_NOSLOT_IC_MISSB1Miss(B1)

0.0480.025PM_GCT_NOSLOT_CYCBEmpty(B)

0.0110.018A-A1Overhead(A2)

0.0580.104A1-A1AOverhead(A1B)

0.1420.251PM_INST_CMPL/5A1ACompleted(A1A)

0.2000.355PM_1PLUS_PPC_CMPLA1Completion Cycles(A1)

0.2100.373PM_GRP_CMPLACompleted(A)

PM_RUN_CYCT

smt_32nosmt_32hd3D Bluevista

Page 20: M Page ScicomP14 Hd3D

A Closer Look at hd3D Blueice/Bluevista 1TPP differences

Bv and Bl HPM Countersnosmt_32, private L3

0.E+00

5.E+11

1.E+12

2.E+12

2.E+12

3.E+12

PM_IOPS_CMPLPM_INST_CMPLPM_RUN_CYCPM_GRP_CMPL

PM_GCT_NOSLOT_CYC

PM_GCT_NOSLOT_IC_MISSPM_GCT_NOSLOT_SRQ_FULLPM_GCT_NOSLOT_BR_MPRED

PM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_FXUPM_CMPLU_STALL_DIVPM_CMPLU_STALL_FDIVPM_CMPLU_STALL_FPU

BluevistaBlueice

Page 21: M Page ScicomP14 Hd3D

A Closer Look at hd3D Blueice/Bluevista 1TPP differences

Ratio of PM Counters

0.000

1.000

2.000

3.000

4.000

5.000

6.000PM_IOPS_CMPL

PM_INST_CMPL

PM_RUN_CYC

PM_GRP_CMPL

PM_GCT_NOSLOT_CYC

PM_GCT_NOSLOT_IC_MISS

PM_GCT_NOSLOT_SRQ_FULL

PM_GCT_NOSLOT_BR_MPRED

PM_CMPLU_STALL_LSU

PM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_DCACHE_MISS

PM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_FXU

PM_CMPLU_STALL_DIV

PM_CMPLU_STALL_FDIV

PM_CMPLU_STALL_FPU

pmlist -p POWER5 -d -c 4,8PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss Following a completion stall (any period when no groups completed) the last instruction to finish before completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.

Page 22: M Page ScicomP14 Hd3D

A Closer Look at hd3D Blueice/Bluevista 1TPP differences

3.6383.480CPI

0.2870.254C-C1-C2-C3

0.2290.159C3-C3A

0.0320.046PM_CMPLU_STALL_FDIVC3A

0.2600.205PM_CMPLU_STALL_FPUC3

0.0390.042C2-C2A

0.0100.015PM_CMPLU_STALL_DIVC2A

0.0490.057PM_CMPLU_STALL_FXUC2

0.4690.428C1-C1A-C1B

0.1280.088PM_CMPLU_STALL_DCACHE_MISSC1BMiss(C1B)

0.1150.064C1A-C1A1

0.0310.022PM_CMPLU_STALL_ERAT_MISSC1A1Miss(C1A1)

0.1460.086PM_CMPLU_STALL_REJECTC1A

0.3870.260PM_CMPLU_STALL_LSUC1

0.7420.602CT-A-B

0.0170.009B-B1-B2-B3

0.0000.000PM_GCT_NOSLOT_SRQ_FULLB3

0.0270.012PM_GCT_NOSLOT_BR_MPREDB2Mispredict(B2)

0.0030.003PM_GCT_NOSLOT_IC_MISSB1Miss(B1)

0.0480.025PM_GCT_NOSLOT_CYCBEmpty(B)

0.0110.018A-A1Overhead(A2)

0.0580.104A1-A1AOverhead(A1B)

0.1420.251PM_INST_CMPL/5A1ACompleted(A1A)

0.2000.355PM_1PLUS_PPC_CMPLA1Completion Cycles(A1)

0.2100.373PM_GRP_CMPLACompleted(A)

PM_RUN_CYCT

smt_32nosmt_32hd3D Bluevista

Page 23: M Page ScicomP14 Hd3D

Conclusions> Nothing new <

Sharing cache can degrade performance

But lots of questions remain:• Gathering and processing the data from performance counters was extremely tedious.

•Is there an easier way?• Does difficulty increase exponentially with level of detail?

• Are the Power 5/5+ performance counters accurate?•Some say not• Eyerman, et. al. (ASPLOS, 2004)

• What do the counters mean?• Are there expanded references besides pmlist?

• What is an ERAT miss?• What does it say about code performance?

• Will ACTC tools give more info that what’s available via pmlist?

• • •

Page 24: M Page ScicomP14 Hd3D

?