analytical performance modeling and simulation for blue waters · 2018. 3. 5. · analytical...

Analytical Performance Modeling and

Simulation for Blue Waters

Torsten Hoefler

With contributions from: Greg Bauer, Steven Gottlieb, William

Gropp, William Kramer, Marc Snir, IBM, and the Blue Waters team

Keynote at the Workshop on Productivity and Performance (PROPER 2010)

August 30th Ischia Italy

Imagine …

• … you‟re planning to construct a multi-million

Dollar Supercomputer …

• … that consumes as much energy as a small

[european] town …

• … to solve computational problems at an

international scale and advance science to the

next level …

• … with “hero-runs” of [insert verb here] scientific

applications that cost $10k and more per run …

2Performance Modeling and Simulation on Blue Waters

… and all you have (now) is …

• … then you better plan ahead! (same for Exascale)


HPC is all about “Performance” – What is this?

• FLOP/s?

• Merriam Webster “flop: to fail completely”

• HPCC: MB/s? GUPS? FFT-rate?

• It‟s a multi-dimensional vector space

• Identify independent vectors

• network: bandwidth, latency, injection rate, …

• memory and I/O: bandwidth, latency, random access rate, …

• CPU: latency (pipeline depth), # execution units, clock speed, …

• Categorize each vector for a particular machine

• open a “feasible region” in the n-dimensional space


Example: Memory Subsystem

• each application has particular coordinates


La

ten

cy

Injection Rate

some graph or

“informatics”

applications regular mesh

computations

highly irregular

mesh computations

• Application A

• Application B

Multi-dimensional Performance Metrics

• Very helpful for HW/SW co-design

• Good for comparison of applications

• Good for comparison of machines

• Design application-optimized (specific) HW

• Not good for absolute estimates

• App. A on machine A vs. app. B on machine B

• “Absolute performance” is very hard to define

• Most useful (universal) metric: “time to solve problem X”

• Also most important as time is fundamentally limited


What is Performance Modeling?

• Understand the resource usage of an application

on a particular architecture

• We focus mostly on time as a resource

• Generate analytic expressions to estimate runtime

• Closely related to “Performance Engineering”

• Often builds on empirical techniques

• Also cutting into complexity theory

• More pragmatic (asymptotes often insufficient)

• Complex (low-order terms cannot be dropped)

Performance Modeling and Simulation on Blue Waters 7

Performance Modeling and Productivity?

• Predict resources and costs to solve a particular

problem instance

• Evaluate effectiveness of a computing platform to

solve a particular problem

• Understand bottlenecks, i.e., HW/SW Co-Design

• Find performance bugs and assess upper and

lower bounds of code optimizations

• Makes programmers think about the structured

performance profile of an application or platform


Performance Modeling from 10.000 Feet

Platform or System Model

(Hardware, Middleware)

Application Model

(Algorithm, Structure)

Performance Model


Of course it’s more complex than that


How to Derive a Platform Model



Application Model


Performance Model


Example: Blue Waters - Overview

• >300.000 compute cores

• Based on POWER7

• 10 PF/s peak (not important, more later)

• 1 PF/s sustained

• Direct network with 1.1 TB/s per hub

• Total approx. 10 PB/s

• >1.2 PB main memory

• >18 PB on-line disk storage

• >500 PB near-line storage


From Chip to Entire Integrated System

13

On-line Storage

Near-line Storage

L-L

ink C

ab

les

Super Node(32 Nodes / 4 CEC)

P7 Chip

(8 cores)

SMP node

(32 cores)

Drawer

(256 cores)

SuperNode

(1024 cores)

Building Block

Blue Waters System

NPCF

Performance Modeling and Simulation on Blue Waters Source: IBM

14Performance Modeling and Simulation on Blue Waters Source: IBM

POWER7 Chip (8 cores)

• Base Technology

• 45 nm, 576 mm2

• 1.2 B transistors

• Chip

• 8 cores

• 4 FMAs/cycle/core

• 32 MB L3 (private/shared)

• Dual DDR3 memory

• 128 GiB/s peak bandwidth

• (1/2 byte/flop)

• Clock range of 3.5 – 4 GHz

15

Quad-chip MCM


L3 Cache/On-Chip Communication

16

• L1 32KB Instruction / core

• L1 32KB Data / core

• L2 = 256KB / core

• L3 = 4MB eDRAM / core

• Fast private and

shared region


MC

1M

C 0

8c uP

MC

1M

C0

8c uP

MC

0M

C1

8c uP

MC

0M

C1

8c uP

P7-0 P7-1

P7-3 P7-2

A

B

X

W

B

A

W

X

B A

Y X

A B

X Y

C

Z

C

Z

W

Z

Z

W

C

Y

C

Y

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

Quad Chip Module (4 chips)

• 32 cores

• 32 cores*8 F/core*4 GHz = 1 TF

• 4 threads per core (max)

• 128 threads per package

• 4x32 MiB L3 cache

• 512 GB/s RAM BW (0.5 B/F)

• 800 W (0.8 W/F)

17Performance Modeling and Simulation on Blue Waters Source: IBM

Adding a Network Interface (Hub)

18

DIM

M 1

5D

IMM

8D

IMM

9D

IMM

14

DIM

M 4

DIM

M 1

2D

IMM

13

DIM

M 5

DIM

M 1

1D

IMM

3D

IMM

2D

IMM

10

DIM

M 0

DIM

M 7

DIM

M 6

DIM

M 1

P7-IH Quad Block Diagram32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor MCM + Hub SCM w/ On-Board Optics

MC

1

Mem

Mem

Mem

Mem

MC

0

Mem

Mem

8c uP

MC

1

Mem

Mem

Mem

Mem

MC

0

Mem

Mem

Mem

Mem

8c uP

7 Inter-Hub Board Level L-Buses

3.0Gb/s @ 8B+8B, 90% sus. peak

D0-D15 Lr0-Lr23

320 GB/s 240 GB/s

28x XMIT/RCV pairs@ 10 Gb/s

832

624

5+

5G

B/s

(6

x=

5+

1)

Hub Chip Module

22

+2

2G

B/s

164

Ll0

22

+2

2G

B/s

164

Ll1

22

+2

2G

B/s

164

Ll2

22

+2

2G

B/s

164

Ll3

22

+2

2G

B/s

164

Ll4

22

+2

2G

B/s

164

Ll5

22

+2

2G

B/s

164

Ll6

12x 12x

12x 12x

12x 12x

12x 12x

12x 12x

12x 12x

10

+1

0G

B/s

(1

2x=

10

+2

)12x 12x 12x 12x

12x 12x 12x 12x

12x 12x 12x 12x

12x 12x 12x 12x

7+

7G

B/s

72

EG2

PC

Ie 6

1x

7+

7G

B/s

72

EG1

PC

Ie 1

6x

7+

7G

B/s

72

EG2

PC

Ie 8

x

MC

0

Mem

Mem

Mem

Mem

MC

1

Mem

Mem

Mem

Mem

8c uP

MC

0

Mem

Mem

Mem

Mem

MC

1

Mem

Mem

Mem

Mem

8c uP

Mem

Mem

P7-0 P7-1

P7-3 P7-2

A

B

X

W

B

A

W

X

B A

Y X

A B

X Y

C

Z

C

Z

W

Z

Z

W

C

Y

C

Y

Z

W

Y

X

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

A Clk Grp

B Clk Grp

C Clk Grp

D Clk Grp

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

MC

0M

C 1

DIM

M 1

Mem

Mem

DIM

M 0

Mem

Mem

DIM

M 5Mem

Mem

DIM

M 4Mem

Mem

DIM

M 7

Mem

Mem

DIM

M 6

Mem

Mem

DIM

M 1

0

Mem

Mem

DIM

M 1

1

Mem

Mem

DIM

M 2

Mem

Mem

DIM

M 3

Mem

Mem

DIM

M 1

4Mem

Mem

DIM

M 1

5Mem

Mem

DIM

M 8Mem

Mem

DIM

M 9Mem

Mem

DIM

M 1

2Mem

Mem

DIM

M 1

3Mem

Mem

• Connects QCM to PCI-e

• Two 16x and one 8x PCI-e slot

• Connects 8 QCM's via low latency,

high bandwidth, copper fabric.

• Provides a message passing

mechanism with very

high bandwidth

• Provides the lowest possible

latency between 8 QCM's


EI-

3 P

HY

s

Torrent

Diff PHYs

L lo

ca

l

HU

B T

o H

UB

Co

pp

er

Bo

ard

Wir

ing

L remote

4 Drawer Interconnect to Create a Supernode

Optical

LR

0 B

us

Op

tic

al

6x 6x

LR

23

Bu

s

Op

tic

al

6x 6x

LL0 Bus

Copper

8B

8B

8B

8B

LL1 Bus

Copper

8B

8B

LL2 Bus

Copper

8B

8B

LL4 Bus

Copper

8B

8B

LL5 Bus

Copper

8B

8B

LL6 Bus

Copper

8B

8B

LL3 Bus

Copper

Dif

f P

HY

s

PX

0 B

us

16

x

16

x

PC

I-E

IO P

HY

Ho

t P

lug

Ctl

PX

1 B

us

16

x

16

x

PC

I-E

IO P

HY

Ho

t P

lug

Ctl

PX

2 B

us

8x8x

PC

I-E

IO P

HY

Ho

t P

lug

Ctl

FS

I

FS

P1

-A

FS

I

FS

P1

-B

I2C

TP

MD

-A, T

MP

D-B

SV

IC

MD

C-A

SV

IC

MD

C-B

I2C

SE

EP

RO

M 1

I2C

SE

EP

RO

M 2

24

L remote

BusesHUB to QCM Connections

Address/Data

D B

us

Inte

rco

nn

ec

t o

f S

up

ern

od

es

Op

tic

al

D0 Bus

Optical

12x

12x

D15 Bus

Optical

12x

12x

16

D B

us

es

28

I2C

I2C_0 + Int

I2C_27 + Int

I2C

To

Op

tic

al

Mo

du

les

TO

D S

yn

c

8B

Z-B

us

8B

Z-B

us

TO

D S

yn

c

8B

Y-B

us

8B

Y-B

us

TO

D S

yn

c

8B

X-B

us

8B

X-B

us

TO

D S

yn

c

8B

W-B

us

8B

W-B

us

1.1 TB/s HUB

19

• 192 GB/s Host Connection

• 336 GB/s to 7 other local nodes

• 240 GB/s to local-remote nodes

• 320 GB/s to remote nodes

• 40 GB/s to general purpose I/O

• cf. “The PERCS interconnect” @HotI‟10

Performance Modeling and Simulation on Blue Waters

Hub Chip

Source: IBM

20

First Level Interconnect

L-Local

HUB to HUB Copper Wiring

256 Cores

DCA-0 Connector (Top DCA)

DCA-1 Connector (Bottom DCA)

1st Level Local Interconnect (256 cores)

HUB

7

HUB

6

HUB

4

HUB

3

HUB

5

HUB

1

HUB

0

HUB

2

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P1-C

17

-C1

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l F

an-o

ut fr

om

HU

B M

od

ule

s

2,3

04

Fib

er

'L-L

ink'

64/40 Optical'D-Link'


P7-0

P7-2

P7-3P7-1

QCM 0

U-P1-M1

P7-0

P7-2

P7-3P7-1

QCM 1

U-P1-M2

P7-0

P7-2

P7-3P7-1

QCM 2

U-P1-M3

P7-0

P7-2

P7-3P7-1

QCM 3

U-P1-M4

P7-0

P7-2

P7-3P7-1

QCM 4

U-P1-M5

P7-0

P7-2

P7-3P7-1

QCM 5

U-P1-M6

P7-0

P7-2

P7-3P7-1

QCM 6

U-P1-M7

P7-0

P7-2

P7-3P7-1

QCM 7

U-P1-M8

P1-C

16

-C1

P1-C

15

-C1

P1-C

14

-C1

P1-C

13

-C1

P1-C

12

-C1

P1-C

11

-C1

P1-C

10

-C1

P1-C

9-C

1

P1-C

8-C

1

P1-C

7-C

1

P1-C

6-C

1

P1-C

5-C

1

P1-C

4-C

1

P1-C

3-C

1

P1-C

2-C

1

P1-C

1-C

1

N0

-DIM

M1

5N

0-D

IMM

14

N0

-DIM

M1

3N

0-D

IMM

12

N0

-DIM

M1

1N

0-D

IMM

10

N0

-DIM

M0

9N

0-D

IMM

08

N0

-DIM

M0

7N

0-D

IMM

06

N0

-DIM

M0

5N

0-D

IMM

04

N0

-DIM

M0

3N

0-D

IMM

02

N0

-DIM

M0

1N

0-D

IMM

00

N1

-DIM

M1

5N

1-D

IMM

14

N1

-DIM

M1

3N

1-D

IMM

12

N1

-DIM

M1

1N

1-D

IMM

10

N1

-DIM

M0

9N

1-D

IMM

08

N1

-DIM

M0

7N

1-D

IMM

06

N1

-DIM

M0

5N

1-D

IMM

04

N1

-DIM

M0

3N

1-D

IMM

02

N1

-DIM

M0

1N

1-D

IMM

00

N2

-DIM

M1

5N

2-D

IMM

14

N2

-DIM

M1

3N

2-D

IMM

12

N2

-DIM

M1

1N

2-D

IMM

10

N2

-DIM

M0

9N

2-D

IMM

08

N2

-DIM

M0

7N

2-D

IMM

06

N2

-DIM

M0

5N

2-D

IMM

04

N2

-DIM

M0

3N

2-D

IMM

02

N2

-DIM

M0

1N

2-D

IMM

00

N3

-DIM

M1

5N

3-D

IMM

14

N3

-DIM

M1

3N

3-D

IMM

12

N3

-DIM

M1

1N

3-D

IMM

10

N3

-DIM

M0

9N

3-D

IMM

08

N3

-DIM

M0

7N

3-D

IMM

06

N3

-DIM

M0

5N

3-D

IMM

04

N3

-DIM

M0

3N

3-D

IMM

02

N3

-DIM

M0

1N

3-D

IMM

00

N4

-DIM

M1

5N

4-D

IMM

14

N4

-DIM

M1

3N

4-D

IMM

12

N4

-DIM

M1

1N

4-D

IMM

10

N4

-DIM

M0

9N

4-D

IMM

08

N4

-DIM

M0

7N

4-D

IMM

06

N4

-DIM

M0

5N

4-D

IMM

04

N4

-DIM

M0

3N

4-D

IMM

02

N4

-DIM

M0

1N

4-D

IMM

00

N5

-DIM

M1

5N

5-D

IMM

14

N5

-DIM

M1

3N

5-D

IMM

12

N5

-DIM

M1

1N

5-D

IMM

10

N5

-DIM

M0

9N

5-D

IMM

08

N5

-DIM

M0

7N

5-D

IMM

06

N5

-DIM

M0

5N

5-D

IMM

04

N5

-DIM

M0

3N

5-D

IMM

02

N5

-DIM

M0

1N

5-D

IMM

00

N6

-DIM

M1

5N

6-D

IMM

14

N6

-DIM

M1

3N

6-D

IMM

12

N6

-DIM

M1

1N

6-D

IMM

10

N6

-DIM

M0

9N

6-D

IMM

08

N6

-DIM

M0

7N

6-D

IMM

06

N6

-DIM

M0

5N

6-D

IMM

04

N6

-DIM

M0

3N

6-D

IMM

02

N6

-DIM

M0

1N

6-D

IMM

00

N7

-DIM

M1

5N

7-D

IMM

14

N7

-DIM

M1

3N

7-D

IMM

12

N7

-DIM

M1

1N

7-D

IMM

10

N7

-DIM

M0

9N

7-D

IMM

08

N7

-DIM

M0

7N

7-D

IMM

06

N7

-DIM

M0

5N

7-D

IMM

04

N7

-DIM

M0

3N

7-D

IMM

02

N7

-DIM

M0

1N

7-D

IMM

00

Drawer

• 8 nodes

• 32 chips

• 256 cores


21Performance Modeling and Simulation on Blue Waters Photo taken at SC09

P7-IH Board Layout 2nd Level of Interconnect (1,024 cores)

4.6 TB/s

Bisection BWBW of 1150

10G-E ports



2nd

Level Interconnect (1,024 cores)

HUB

7

61x96mm

HUB

6

61x96mm

HUB

4

61x96mm

HUB

3

61x96mm

HUB

5

61x96mm

HUB

1

61x96mm

HUB

0

61x96mm

HUB

2

61x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l Fa

n-o

ut fro

m

HU

B M

od

ule

s

2,3

04

Fib

er 'L

-Lin

k'


FSP/CLK-A


FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7



2nd


HUB

7

61x96mm

HUB

6

61x96mm

HUB

4

61x96mm

HUB

3

61x96mm

HUB

5

61x96mm

HUB

1

61x96mm

HUB

0

61x96mm

HUB

2

61x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l Fa

n-o

ut fro

m

HU

B M

od

ule

s

2,3

04

Fib

er 'L

-Lin

k'


FSP/CLK-A


FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7



2nd


HUB

761x96mm

HUB

661x96mm

HUB

461x96mm

HUB

361x96mm

HUB

561x96mm

HUB

161x96mm

HUB

061x96mm

HUB

261x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l F

an-o

ut fr

om

HU

B M

od

ule

s

2,3

04

Fib

er

'L-L

ink'


FSP/CLK-A


FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7



2nd


HUB

761x96mm

HUB

661x96mm

HUB

461x96mm

HUB

361x96mm

HUB

561x96mm

HUB

161x96mm

HUB

061x96mm

HUB

261x96mm

P

C

I

e

9

P

C

I

e

10

P

C

I

e

11

P

C

I

e

12

P

C

I

e

13

P

C

I

e

14

P

C

I

e

15

P

C

I

e

16

P

C

I

e

17

P

C

I

e

1

P

C

I

e

2

P

C

I

e

3

P

C

I

e

4

P

C

I

e

5

P

C

I

e

6

P

C

I

e

7

P

C

I

e

8

Op

tica

l F

an-o

ut fr

om

HU

B M

od

ule

s

2,3

04

Fib

er

'L-L

ink'


FSP/CLK-A


FSP/CLK-B

P7-0

P7-2

P7-3P7-1

QCM 0

P7-0

P7-2

P7-3P7-1

QCM 1

P7-0

P7-2

P7-3P7-1

QCM 2

P7-0

P7-2

P7-3P7-1

QCM 3

P7-0

P7-2

P7-3P7-1

QCM 4

P7-0

P7-2

P7-3P7-1

QCM 5

P7-0

P7-2

P7-3P7-1

QCM 6

P7-0

P7-2

P7-3P7-1

QCM 7

22

Second Level Interconnect

Optical „L-Remote‟ Links from HUB

4 drawers

1,024 Cores

L-L

ink C

ab

les

Super Node(32 Nodes / 4 CEC)

Supernode


National Petascale Computing Facility

23Performance Modeling and Simulation on Blue Waters Photo: NCSA

Designing a Platform Model

• Key Platform parameters:

• Memory

• CPU and Caches

• Network

• I/O subsystem (e.g., disk)

• OS (e.g., noise)

• Middleware Parameters (e.g., MPI)


Determine a Memory Model

• Memory parameters

• Latency – time to access a single element (cf. L)

• Access Rate – per-element overhead (cf. g)

• Bandwidth - linear access time (cf. G)

• Netgauge memory benchmark

• Random pointer-chase to measure latency

• Bulk random access to measure access rate (cf. GUPS)

• Bulk linear access to measure bandwidth (cf. stream)


Single Core Memory on the IBM 780 (P7 MR)

26

512 MB (8 Byte) random read: 143 MB/s, write: 320 MB/s, copy: 121 MB/s

latency: 132 ns

compiled with gcc 4.3.4


cached uncached

Single Core Modeling Strategy

• Modeling performance of modern CPUs is well

understood but very complex

• Upper bounds are (among others) FLOP-rate and

memory performance

• Details are not too important as we mostly care

about scalability

• Semi-empirical modeling: run application

benchmarks to determine single-core models

• Simulation: simulate CPU performance (Mambo)


Determining a Network Model

• The network is most important

• Becomes dominant at scale

• We cannot simply “run” benchmarks to determine a

model as in the single-core case (too expensive)

• Parameters:

• Topology (bisection bandwidth)

• Collective parameters (e.g., Alltoall bandwidth)

• Point-to-point parameters (e.g., LogGP)

• latency, CPU overhead, injection rate/gap, bandwidth


• LL Topology

• 24 GB/s

• 7 links/Hub

• Fully connected

• 8 Hubs

29Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-

Performance Interconnect”

• LR Topology

• 5 GB/s

• 24 links/Hub

• Fully connected

• 4 Drawers

• 32 Hubs



31

• D Topology

• 10 GB/s

• 16 links/Hub

• Fully

connected

• 512 SNs

• 2048 Drawers

• 16384 Hubs

Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-


Routing

• Deterministic routing

• Hits Borodin-Hopcroft bound ( )

• Bad worst-case performance

• Indirect (random) routing

• Increases latency

• Effectively halves (global) bandwidth

• Worst-case becomes very unlikely


SN-SN Direct Routing

SuperNode A

Direct Routing

SuperNode B

L-Hops: 2

D-Hops: 1

Total: 3 hops

D3

LR12

LR21

Performance Modeling and Simulation on Blue Waters 33Source: B. Arimilli et al. “The PERCS High-


SN-SN Indirect Routing

SuperNode A SuperNode B

SuperNode x

L-Hops: 3

D-Hops: 2

Total: 5 hops

D5

D12

LR21

LL7

LR30

Total paths = # SNs – 2

Performance Modeling and Simulation on Blue Waters 34Source: B. Arimilli et al. “The PERCS High-


Example Model: BW Global Alltoall Bandwidth

• Assume all communications happen simultaneously

• Derive upper (expected) bound for direct routing

• Simple counting argument

• Each CN can be reached through series of LL, LR, D

• Not more than one D link or LL-LR, LR-LL, LL-LL, LR-LR

• Denote e(P) as number of nodes reachable through path P

• e(LL) =7, e(LR)=24, e(D)=d (variable number of D-links)

• # nodes reachable in two hops:

• e(LL-D)=e(D-LL)=7d

• e(LR-D)=e(D-LR) = 24d



Example: Alltoall Bandwidth Continued …

• # nodes reachable in three hops:

• e(LL-D-LL) = 49d

• e(LL-D-LR)=e(LR-D-LL) = 168d

• e(LR-D-LR)=576d

• Number of paths from each source: 31+1024d

• Now count the number of paths (i.e., congestion) through

each LL, LR, D link c(L): (omitted uninteresting parts of summation)

• c(LL) = 1+64d

• c(LR) = 1+64d

• c(D) = 1024



Example: Alltoall Bandwidth Continued …

• Effective (expected) bandwidths for each link-type:

• bandwidth = link bandwidth / congestion

• b(LL) = 24 GB/s / (1+64d)

• b(LR) = 5 GB/s / (1+64d)

• b(D) = 10 GB/s / 1024

• If d<8, D is the bottleneck, otherwise LR

• Bandwidth per PE: 5 GB/s / 1+64d * total # paths

• d=9 (294912 cores): 80.13 GB/s

• d=10 (327680 cores): 80.11 GB/s total ~ 0.8 PB/s

• Connection to P7 chips: 4*24 GB/s = 96 GB/s ~ 83%



Influence of the Routing Strategy

• Indirect routing would effectively half the bandwidth

• First route to intermediate supernodes

• Not desirable for global Alltoall case

• Other patterns could benefit

• Example: bad pattern for direct routing:

• all nodes in SN A send to all nodes in SN B

• congestion of 32, i.e., 312.5 MB/s

• (optimal) indirect routing would improve to 5 GB/s

• Detailed routing strategy is still under discussion

• May be influenced by outcome of application modeling


Complete Network Models

• LogGP parameters can be measured (Netgauge)

• Or estimated based on documentation

• Traffic patterns need to be modeled or simulated

• Performance depends on task-to-node mapping

• Significant effort but only done once per machine

• NCSA is working on a detailed system model of

Blue Waters


How to Derive an Application Model



Application Model


Performance Model


Application Models

• Are often much more complex than platform models

• Serial performance models:

• CPU + Cache requirements

• Memory requirements

• I/O requirements

• Parallel performance models:

• Amdahl‟s law (scaling)

• Communication overhead

• Communication/computation overlap


Rough Application Classification

42

Static Adaptive Dynamic

fixed control-

and dataflowfixed control-flow

and variable dataflow

variable control-

and dataflow

e.g., structured

n-dimensional

meshes/tori

e.g., MILC, WRF,

Abinit, Sweep3D

e.g., TDDFT/Octopus,

Enzo, ACES III

e.g., AMR, N-body,

unstructured mesh

e.g., high-impact AMR,

graph computations

e.g., PBGL, BFS

easy to model … … hard to model


A Strategy for Static Applications

• Two of the three NSF Petascale benchmarks are static

• Turbulence & Quantum Chromodynamics

• Application modeling in three simple steps:

1. Identify all performance-critical parameters

2. Identify code-blocks that share a performance signature

• derive serial performance model for each

3. Determine application communication pattern

• including communication sizes


An Application Modeling Example: MILC

• MIMD Lattice Computation

• Gains deeper insights in

fundamental laws of physics

• Determine the predictions of

lattice field theories (QCD &

Beyond Standard Model)

• Major NSF application

• Challenge:

• High accuracy (computationally intensive) required for

comparison with results from experimental programs in

high energy & nuclear physics


MILC - Performance-critical Parameters

45

Name simple complex comment

P X Number of processes

nx, ny, nz, nt X Lattice size in x,y,z,t

warms, trajecs X Warmup rounds and trajectories

traj_between_meas X Number of “steps” in each trajectory

beta, mass1, mass2, error_for_propagator

X Physical parameters – influence convergence of conjugate gradient

max_cg_iterations X Limits CG iterations per step

• If parameters are more complex (e.g., input files) then the

user has to distill them into single values (domain specific)


MILC – Critical Blocks

• Identify sub-trees in

call-graph with same

performance characteristic

• Five blocks in MILC

46

Name Function

LL load_longlinks

FL load_fatlinks

CG ks_congrad

GF imp_gauge_force

FF eo_fermion_force

Ignored

insignificant

sub-trees


Communication Pattern

• Four-dimensional p2p communication topology

• Prime-factor decomposition of P (→ square)

• Total number of p2p messages

• Counted manually (profiling tools and source)

• Collective Communication

• Single MPI_Allreduce per CG iteration

47

Type Number of Messages

FF (trajecs + warms) · steps · 1616

GF … (for LL, FL, CG)


Ideas for Automation – Tool support

• Model each function as a critical block

• Automatic decomposition leads to more blocks

• User needs to provide asymptotic scaling function

• e.g., T ~ nx*ny*nz*nt,

• Statistical runs could automatically fit the

parameters (can model cache linearly)

• Similar techniques can be used for message

counting

• Should investigate tool support for this!


How to Derive an Application Model



Application Model


Performance Model


Applying a Machine Model

1. Determine sequential baseline

• Use single-core application runs as accurate

CPU/memory model

2. Process-to-node mapping

• On-node vs. off-node communication

• Network congestion

3. Model communication overheads

• Uses message counts and collective counts


Test system for Model Validation

• BW hardware details are still confidential

• Single Power7 MR is not useful for validation

• No scaling runs possible

• We use a 1920-core Power5+ system

• 120 nodes 16 cores per node

• HPS switch

• POE/IBM MPI Version 5.1


Sequential Baseline

• Stepwise linear function to represent cache influence

• Chose two steps, different CPUs might need more

• Volume V = nx*ny*nz*nt; Type B = {LL, FL, GF, CG, FF}

• Cache holds s(B) data elements


Example block: GF


System Model: Communication Parameters

55

Intra-node Inter-node


On-node Memory Contention

• Two cores share one memory controller

• Congestion has non-trivial performance effects

• Empirical analysis

• Assume fixed 10%

slowdown


Process-to-Node Mapping

• Trivial linear

default mapping

• With 4 processes

per node:

• 6 internal edges

• 10 remote edges

• Wrap-around

• Looses two internal edges

• Unbalanced communication


Optimized Process-to-Node Mapping

• Optimal mapping

• cf. Lagrange multiplier

• 8 internal edges

• 8 remote edges

• Similar for 4d mapping

• 16 cores, optimal sub-block:

• ½ remote edges


Parallel Performance Example: GF


Parallel Performance Model

• Example: quantify maximum

benefit of MPI

datatypes!

• cf. EuroMPI‟10


Scaling with the Number of Processes


V=64 V=104

Generating a P7 Model and Prediction

• Back to this guy:


Sequential Baseline – Now very Simple!

63

Power5+ Power7 MR


Power7 MR Sequential Model

• relative error is less than 10%


Communication Parameters (P7 MR)

• Intra-node parameters benchmarked (Netgauge)

• Network parameters from (public) documentation

• Latency ~ 1

• Overhead ~ 0.5

• Gap ~ 0.5

• Gap per Byte (bandwidth) ~ 0.2 ns (5 GB/s)

• Allreduce :


Process-to-Node Mapping

• assume two 2x2x2x2 sub-blocks per node

• 8 edges per process

• 16 * 4 local edges per sub-block

• 2*2*2 edges between the two blocks

• 72 local edges

• 56 remote edges

• local : remote ratio = 1 : 0.7778


Complete P7 MR Prediction

• Shared memory communication causes high overhead


Weak Scaling to 300000 Cores


V=64 V=104

Benefits of the Model• Allows to estimate (with varying P and V)

• Communication overheads

• on-node vs. off-node communication

• evaluate decomposition and mapping strategies offline

• Time spent in message packing

• upper bound for benefits of datatypes (cf. EuroMPI‟10)

• Scaling bottlenecks

• Allreduce is relatively non-critical for scaling up

• ignoring OS noise for now

• Predict runtime at scale

• estimate overheads at large scale

• …


Takeaways, Questions & Discussion


• Performance Modeling should become routine!

• Modeling during application and system design

• Maintain models through

application lifetime

• Improves productivity

• Needs tool support

• Existing, needs glue

• It‟s relatively simple!

• For static applications

analytical performance modeling and simulation for blue waters · 2018. 3. 5. · analytical...

Documents