analytical performance modeling and simulation for blue waters · 2018. 3. 5. · analytical...
TRANSCRIPT
Analytical Performance Modeling and
Simulation for Blue Waters
Torsten Hoefler
With contributions from: Greg Bauer, Steven Gottlieb, William
Gropp, William Kramer, Marc Snir, IBM, and the Blue Waters team
Keynote at the Workshop on Productivity and Performance (PROPER 2010)
August 30th Ischia Italy
Imagine …
• … you‟re planning to construct a multi-million
Dollar Supercomputer …
• … that consumes as much energy as a small
[european] town …
• … to solve computational problems at an
international scale and advance science to the
next level …
• … with “hero-runs” of [insert verb here] scientific
applications that cost $10k and more per run …
2Performance Modeling and Simulation on Blue Waters
… and all you have (now) is …
• … then you better plan ahead! (same for Exascale)
3Performance Modeling and Simulation on Blue Waters
HPC is all about “Performance” – What is this?
• FLOP/s?
• Merriam Webster “flop: to fail completely”
• HPCC: MB/s? GUPS? FFT-rate?
• It‟s a multi-dimensional vector space
• Identify independent vectors
• network: bandwidth, latency, injection rate, …
• memory and I/O: bandwidth, latency, random access rate, …
• CPU: latency (pipeline depth), # execution units, clock speed, …
• Categorize each vector for a particular machine
• open a “feasible region” in the n-dimensional space
4Performance Modeling and Simulation on Blue Waters
Example: Memory Subsystem
• each application has particular coordinates
5Performance Modeling and Simulation on Blue Waters
La
ten
cy
Injection Rate
some graph or
“informatics”
applications regular mesh
computations
highly irregular
mesh computations
• Application A
• Application B
Multi-dimensional Performance Metrics
• Very helpful for HW/SW co-design
• Good for comparison of applications
• Good for comparison of machines
• Design application-optimized (specific) HW
• Not good for absolute estimates
• App. A on machine A vs. app. B on machine B
• “Absolute performance” is very hard to define
• Most useful (universal) metric: “time to solve problem X”
• Also most important as time is fundamentally limited
6Performance Modeling and Simulation on Blue Waters
What is Performance Modeling?
• Understand the resource usage of an application
on a particular architecture
• We focus mostly on time as a resource
• Generate analytic expressions to estimate runtime
• Closely related to “Performance Engineering”
• Often builds on empirical techniques
• Also cutting into complexity theory
• More pragmatic (asymptotes often insufficient)
• Complex (low-order terms cannot be dropped)
Performance Modeling and Simulation on Blue Waters 7
Performance Modeling and Productivity?
• Predict resources and costs to solve a particular
problem instance
• Evaluate effectiveness of a computing platform to
solve a particular problem
• Understand bottlenecks, i.e., HW/SW Co-Design
• Find performance bugs and assess upper and
lower bounds of code optimizations
• Makes programmers think about the structured
performance profile of an application or platform
8Performance Modeling and Simulation on Blue Waters
Performance Modeling from 10.000 Feet
Platform or System Model
(Hardware, Middleware)
Application Model
(Algorithm, Structure)
Performance Model
Performance Modeling and Simulation on Blue Waters 9
Of course it’s more complex than that
10Performance Modeling and Simulation on Blue Waters
How to Derive a Platform Model
Platform or System Model
(Hardware, Middleware)
Application Model
(Algorithm, Structure)
Performance Model
Performance Modeling and Simulation on Blue Waters 11
Example: Blue Waters - Overview
• >300.000 compute cores
• Based on POWER7
• 10 PF/s peak (not important, more later)
• 1 PF/s sustained
• Direct network with 1.1 TB/s per hub
• Total approx. 10 PB/s
• >1.2 PB main memory
• >18 PB on-line disk storage
• >500 PB near-line storage
12Performance Modeling and Simulation on Blue Waters
From Chip to Entire Integrated System
13
On-line Storage
Near-line Storage
L-L
ink C
ab
les
Super Node(32 Nodes / 4 CEC)
P7 Chip
(8 cores)
SMP node
(32 cores)
Drawer
(256 cores)
SuperNode
(1024 cores)
Building Block
Blue Waters System
NPCF
Performance Modeling and Simulation on Blue Waters Source: IBM
14Performance Modeling and Simulation on Blue Waters Source: IBM
POWER7 Chip (8 cores)
• Base Technology
• 45 nm, 576 mm2
• 1.2 B transistors
• Chip
• 8 cores
• 4 FMAs/cycle/core
• 32 MB L3 (private/shared)
• Dual DDR3 memory
• 128 GiB/s peak bandwidth
• (1/2 byte/flop)
• Clock range of 3.5 – 4 GHz
15
Quad-chip MCM
Performance Modeling and Simulation on Blue Waters Source: IBM
L3 Cache/On-Chip Communication
16
• L1 32KB Instruction / core
• L1 32KB Data / core
• L2 = 256KB / core
• L3 = 4MB eDRAM / core
• Fast private and
shared region
Performance Modeling and Simulation on Blue Waters Source: IBM
MC
1M
C 0
8c uP
MC
1M
C0
8c uP
MC
0M
C1
8c uP
MC
0M
C1
8c uP
P7-0 P7-1
P7-3 P7-2
A
B
X
W
B
A
W
X
B A
Y X
A B
X Y
C
Z
C
Z
W
Z
Z
W
C
Y
C
Y
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
Quad Chip Module (4 chips)
• 32 cores
• 32 cores*8 F/core*4 GHz = 1 TF
• 4 threads per core (max)
• 128 threads per package
• 4x32 MiB L3 cache
• 512 GB/s RAM BW (0.5 B/F)
• 800 W (0.8 W/F)
17Performance Modeling and Simulation on Blue Waters Source: IBM
Adding a Network Interface (Hub)
18
DIM
M 1
5D
IMM
8D
IMM
9D
IMM
14
DIM
M 4
DIM
M 1
2D
IMM
13
DIM
M 5
DIM
M 1
1D
IMM
3D
IMM
2D
IMM
10
DIM
M 0
DIM
M 7
DIM
M 6
DIM
M 1
P7-IH Quad Block Diagram32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor MCM + Hub SCM w/ On-Board Optics
MC
1
Mem
Mem
Mem
Mem
MC
0
Mem
Mem
8c uP
MC
1
Mem
Mem
Mem
Mem
MC
0
Mem
Mem
Mem
Mem
8c uP
7 Inter-Hub Board Level L-Buses
3.0Gb/s @ 8B+8B, 90% sus. peak
D0-D15 Lr0-Lr23
320 GB/s 240 GB/s
28x XMIT/RCV pairs@ 10 Gb/s
832
624
5+
5G
B/s
(6
x=
5+
1)
Hub Chip Module
22
+2
2G
B/s
164
Ll0
22
+2
2G
B/s
164
Ll1
22
+2
2G
B/s
164
Ll2
22
+2
2G
B/s
164
Ll3
22
+2
2G
B/s
164
Ll4
22
+2
2G
B/s
164
Ll5
22
+2
2G
B/s
164
Ll6
12x 12x
12x 12x
12x 12x
12x 12x
12x 12x
12x 12x
10
+1
0G
B/s
(1
2x=
10
+2
)12x 12x 12x 12x
12x 12x 12x 12x
12x 12x 12x 12x
12x 12x 12x 12x
7+
7G
B/s
72
EG2
PC
Ie 6
1x
7+
7G
B/s
72
EG1
PC
Ie 1
6x
7+
7G
B/s
72
EG2
PC
Ie 8
x
MC
0
Mem
Mem
Mem
Mem
MC
1
Mem
Mem
Mem
Mem
8c uP
MC
0
Mem
Mem
Mem
Mem
MC
1
Mem
Mem
Mem
Mem
8c uP
Mem
Mem
P7-0 P7-1
P7-3 P7-2
A
B
X
W
B
A
W
X
B A
Y X
A B
X Y
C
Z
C
Z
W
Z
Z
W
C
Y
C
Y
Z
W
Y
X
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
DIM
M 1
Mem
Mem
DIM
M 0
Mem
Mem
DIM
M 5Mem
Mem
DIM
M 4Mem
Mem
DIM
M 7
Mem
Mem
DIM
M 6
Mem
Mem
DIM
M 1
0
Mem
Mem
DIM
M 1
1
Mem
Mem
DIM
M 2
Mem
Mem
DIM
M 3
Mem
Mem
DIM
M 1
4Mem
Mem
DIM
M 1
5Mem
Mem
DIM
M 8Mem
Mem
DIM
M 9Mem
Mem
DIM
M 1
2Mem
Mem
DIM
M 1
3Mem
Mem
• Connects QCM to PCI-e
• Two 16x and one 8x PCI-e slot
• Connects 8 QCM's via low latency,
high bandwidth, copper fabric.
• Provides a message passing
mechanism with very
high bandwidth
• Provides the lowest possible
latency between 8 QCM's
Performance Modeling and Simulation on Blue Waters Source: IBM
EI-
3 P
HY
s
Torrent
Diff PHYs
L lo
ca
l
HU
B T
o H
UB
Co
pp
er
Bo
ard
Wir
ing
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR
0 B
us
Op
tic
al
6x 6x
LR
23
Bu
s
Op
tic
al
6x 6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Dif
f P
HY
s
PX
0 B
us
16
x
16
x
PC
I-E
IO P
HY
Ho
t P
lug
Ctl
PX
1 B
us
16
x
16
x
PC
I-E
IO P
HY
Ho
t P
lug
Ctl
PX
2 B
us
8x8x
PC
I-E
IO P
HY
Ho
t P
lug
Ctl
FS
I
FS
P1
-A
FS
I
FS
P1
-B
I2C
TP
MD
-A, T
MP
D-B
SV
IC
MD
C-A
SV
IC
MD
C-B
I2C
SE
EP
RO
M 1
I2C
SE
EP
RO
M 2
24
L remote
BusesHUB to QCM Connections
Address/Data
D B
us
Inte
rco
nn
ec
t o
f S
up
ern
od
es
Op
tic
al
D0 Bus
Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D B
us
es
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To
Op
tic
al
Mo
du
les
TO
D S
yn
c
8B
Z-B
us
8B
Z-B
us
TO
D S
yn
c
8B
Y-B
us
8B
Y-B
us
TO
D S
yn
c
8B
X-B
us
8B
X-B
us
TO
D S
yn
c
8B
W-B
us
8B
W-B
us
1.1 TB/s HUB
19
• 192 GB/s Host Connection
• 336 GB/s to 7 other local nodes
• 240 GB/s to local-remote nodes
• 320 GB/s to remote nodes
• 40 GB/s to general purpose I/O
• cf. “The PERCS interconnect” @HotI‟10
Performance Modeling and Simulation on Blue Waters
Hub Chip
Source: IBM
20
First Level Interconnect
L-Local
HUB to HUB Copper Wiring
256 Cores
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
1st Level Local Interconnect (256 cores)
HUB
7
HUB
6
HUB
4
HUB
3
HUB
5
HUB
1
HUB
0
HUB
2
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P1-C
17
-C1
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l F
an-o
ut fr
om
HU
B M
od
ule
s
2,3
04
Fib
er
'L-L
ink'
64/40 Optical'D-Link'
64/40 Optical'D-Link'
P7-0
P7-2
P7-3P7-1
QCM 0
U-P1-M1
P7-0
P7-2
P7-3P7-1
QCM 1
U-P1-M2
P7-0
P7-2
P7-3P7-1
QCM 2
U-P1-M3
P7-0
P7-2
P7-3P7-1
QCM 3
U-P1-M4
P7-0
P7-2
P7-3P7-1
QCM 4
U-P1-M5
P7-0
P7-2
P7-3P7-1
QCM 5
U-P1-M6
P7-0
P7-2
P7-3P7-1
QCM 6
U-P1-M7
P7-0
P7-2
P7-3P7-1
QCM 7
U-P1-M8
P1-C
16
-C1
P1-C
15
-C1
P1-C
14
-C1
P1-C
13
-C1
P1-C
12
-C1
P1-C
11
-C1
P1-C
10
-C1
P1-C
9-C
1
P1-C
8-C
1
P1-C
7-C
1
P1-C
6-C
1
P1-C
5-C
1
P1-C
4-C
1
P1-C
3-C
1
P1-C
2-C
1
P1-C
1-C
1
N0
-DIM
M1
5N
0-D
IMM
14
N0
-DIM
M1
3N
0-D
IMM
12
N0
-DIM
M1
1N
0-D
IMM
10
N0
-DIM
M0
9N
0-D
IMM
08
N0
-DIM
M0
7N
0-D
IMM
06
N0
-DIM
M0
5N
0-D
IMM
04
N0
-DIM
M0
3N
0-D
IMM
02
N0
-DIM
M0
1N
0-D
IMM
00
N1
-DIM
M1
5N
1-D
IMM
14
N1
-DIM
M1
3N
1-D
IMM
12
N1
-DIM
M1
1N
1-D
IMM
10
N1
-DIM
M0
9N
1-D
IMM
08
N1
-DIM
M0
7N
1-D
IMM
06
N1
-DIM
M0
5N
1-D
IMM
04
N1
-DIM
M0
3N
1-D
IMM
02
N1
-DIM
M0
1N
1-D
IMM
00
N2
-DIM
M1
5N
2-D
IMM
14
N2
-DIM
M1
3N
2-D
IMM
12
N2
-DIM
M1
1N
2-D
IMM
10
N2
-DIM
M0
9N
2-D
IMM
08
N2
-DIM
M0
7N
2-D
IMM
06
N2
-DIM
M0
5N
2-D
IMM
04
N2
-DIM
M0
3N
2-D
IMM
02
N2
-DIM
M0
1N
2-D
IMM
00
N3
-DIM
M1
5N
3-D
IMM
14
N3
-DIM
M1
3N
3-D
IMM
12
N3
-DIM
M1
1N
3-D
IMM
10
N3
-DIM
M0
9N
3-D
IMM
08
N3
-DIM
M0
7N
3-D
IMM
06
N3
-DIM
M0
5N
3-D
IMM
04
N3
-DIM
M0
3N
3-D
IMM
02
N3
-DIM
M0
1N
3-D
IMM
00
N4
-DIM
M1
5N
4-D
IMM
14
N4
-DIM
M1
3N
4-D
IMM
12
N4
-DIM
M1
1N
4-D
IMM
10
N4
-DIM
M0
9N
4-D
IMM
08
N4
-DIM
M0
7N
4-D
IMM
06
N4
-DIM
M0
5N
4-D
IMM
04
N4
-DIM
M0
3N
4-D
IMM
02
N4
-DIM
M0
1N
4-D
IMM
00
N5
-DIM
M1
5N
5-D
IMM
14
N5
-DIM
M1
3N
5-D
IMM
12
N5
-DIM
M1
1N
5-D
IMM
10
N5
-DIM
M0
9N
5-D
IMM
08
N5
-DIM
M0
7N
5-D
IMM
06
N5
-DIM
M0
5N
5-D
IMM
04
N5
-DIM
M0
3N
5-D
IMM
02
N5
-DIM
M0
1N
5-D
IMM
00
N6
-DIM
M1
5N
6-D
IMM
14
N6
-DIM
M1
3N
6-D
IMM
12
N6
-DIM
M1
1N
6-D
IMM
10
N6
-DIM
M0
9N
6-D
IMM
08
N6
-DIM
M0
7N
6-D
IMM
06
N6
-DIM
M0
5N
6-D
IMM
04
N6
-DIM
M0
3N
6-D
IMM
02
N6
-DIM
M0
1N
6-D
IMM
00
N7
-DIM
M1
5N
7-D
IMM
14
N7
-DIM
M1
3N
7-D
IMM
12
N7
-DIM
M1
1N
7-D
IMM
10
N7
-DIM
M0
9N
7-D
IMM
08
N7
-DIM
M0
7N
7-D
IMM
06
N7
-DIM
M0
5N
7-D
IMM
04
N7
-DIM
M0
3N
7-D
IMM
02
N7
-DIM
M0
1N
7-D
IMM
00
Drawer
• 8 nodes
• 32 chips
• 256 cores
Performance Modeling and Simulation on Blue Waters Source: IBM
21Performance Modeling and Simulation on Blue Waters Photo taken at SC09
P7-IH Board Layout 2nd Level of Interconnect (1,024 cores)
4.6 TB/s
Bisection BWBW of 1150
10G-E ports
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
7
61x96mm
HUB
6
61x96mm
HUB
4
61x96mm
HUB
3
61x96mm
HUB
5
61x96mm
HUB
1
61x96mm
HUB
0
61x96mm
HUB
2
61x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l Fa
n-o
ut fro
m
HU
B M
od
ule
s
2,3
04
Fib
er 'L
-Lin
k'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
7
61x96mm
HUB
6
61x96mm
HUB
4
61x96mm
HUB
3
61x96mm
HUB
5
61x96mm
HUB
1
61x96mm
HUB
0
61x96mm
HUB
2
61x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l Fa
n-o
ut fro
m
HU
B M
od
ule
s
2,3
04
Fib
er 'L
-Lin
k'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
761x96mm
HUB
661x96mm
HUB
461x96mm
HUB
361x96mm
HUB
561x96mm
HUB
161x96mm
HUB
061x96mm
HUB
261x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l F
an-o
ut fr
om
HU
B M
od
ule
s
2,3
04
Fib
er
'L-L
ink'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
761x96mm
HUB
661x96mm
HUB
461x96mm
HUB
361x96mm
HUB
561x96mm
HUB
161x96mm
HUB
061x96mm
HUB
261x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l F
an-o
ut fr
om
HU
B M
od
ule
s
2,3
04
Fib
er
'L-L
ink'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
22
Second Level Interconnect
Optical „L-Remote‟ Links from HUB
4 drawers
1,024 Cores
L-L
ink C
ab
les
Super Node(32 Nodes / 4 CEC)
Supernode
Performance Modeling and Simulation on Blue Waters Source: IBM
National Petascale Computing Facility
23Performance Modeling and Simulation on Blue Waters Photo: NCSA
Designing a Platform Model
• Key Platform parameters:
• Memory
• CPU and Caches
• Network
• I/O subsystem (e.g., disk)
• OS (e.g., noise)
• Middleware Parameters (e.g., MPI)
24Performance Modeling and Simulation on Blue Waters
Determine a Memory Model
• Memory parameters
• Latency – time to access a single element (cf. L)
• Access Rate – per-element overhead (cf. g)
• Bandwidth - linear access time (cf. G)
• Netgauge memory benchmark
• Random pointer-chase to measure latency
• Bulk random access to measure access rate (cf. GUPS)
• Bulk linear access to measure bandwidth (cf. stream)
25Performance Modeling and Simulation on Blue Waters
Single Core Memory on the IBM 780 (P7 MR)
26
512 MB (8 Byte) random read: 143 MB/s, write: 320 MB/s, copy: 121 MB/s
latency: 132 ns
compiled with gcc 4.3.4
Performance Modeling and Simulation on Blue Waters
cached uncached
Single Core Modeling Strategy
• Modeling performance of modern CPUs is well
understood but very complex
• Upper bounds are (among others) FLOP-rate and
memory performance
• Details are not too important as we mostly care
about scalability
• Semi-empirical modeling: run application
benchmarks to determine single-core models
• Simulation: simulate CPU performance (Mambo)
27Performance Modeling and Simulation on Blue Waters
Determining a Network Model
• The network is most important
• Becomes dominant at scale
• We cannot simply “run” benchmarks to determine a
model as in the single-core case (too expensive)
• Parameters:
• Topology (bisection bandwidth)
• Collective parameters (e.g., Alltoall bandwidth)
• Point-to-point parameters (e.g., LogGP)
• latency, CPU overhead, injection rate/gap, bandwidth
28Performance Modeling and Simulation on Blue Waters
• LL Topology
• 24 GB/s
• 7 links/Hub
• Fully connected
• 8 Hubs
29Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
• LR Topology
• 5 GB/s
• 24 links/Hub
• Fully connected
• 4 Drawers
• 32 Hubs
30Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
31
• D Topology
• 10 GB/s
• 16 links/Hub
• Fully
connected
• 512 SNs
• 2048 Drawers
• 16384 Hubs
Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
Routing
• Deterministic routing
• Hits Borodin-Hopcroft bound ( )
• Bad worst-case performance
• Indirect (random) routing
• Increases latency
• Effectively halves (global) bandwidth
• Worst-case becomes very unlikely
32Performance Modeling and Simulation on Blue Waters
SN-SN Direct Routing
SuperNode A
Direct Routing
SuperNode B
L-Hops: 2
D-Hops: 1
Total: 3 hops
D3
LR12
LR21
Performance Modeling and Simulation on Blue Waters 33Source: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
SN-SN Indirect Routing
SuperNode A SuperNode B
SuperNode x
L-Hops: 3
D-Hops: 2
Total: 5 hops
D5
D12
LR21
LL7
LR30
Total paths = # SNs – 2
Performance Modeling and Simulation on Blue Waters 34Source: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
Example Model: BW Global Alltoall Bandwidth
• Assume all communications happen simultaneously
• Derive upper (expected) bound for direct routing
• Simple counting argument
• Each CN can be reached through series of LL, LR, D
• Not more than one D link or LL-LR, LR-LL, LL-LL, LR-LR
• Denote e(P) as number of nodes reachable through path P
• e(LL) =7, e(LR)=24, e(D)=d (variable number of D-links)
• # nodes reachable in two hops:
• e(LL-D)=e(D-LL)=7d
• e(LR-D)=e(D-LR) = 24d
35Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
Example: Alltoall Bandwidth Continued …
• # nodes reachable in three hops:
• e(LL-D-LL) = 49d
• e(LL-D-LR)=e(LR-D-LL) = 168d
• e(LR-D-LR)=576d
• Number of paths from each source: 31+1024d
• Now count the number of paths (i.e., congestion) through
each LL, LR, D link c(L): (omitted uninteresting parts of summation)
• c(LL) = 1+64d
• c(LR) = 1+64d
• c(D) = 1024
36Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
Example: Alltoall Bandwidth Continued …
• Effective (expected) bandwidths for each link-type:
• bandwidth = link bandwidth / congestion
• b(LL) = 24 GB/s / (1+64d)
• b(LR) = 5 GB/s / (1+64d)
• b(D) = 10 GB/s / 1024
• If d<8, D is the bottleneck, otherwise LR
• Bandwidth per PE: 5 GB/s / 1+64d * total # paths
• d=9 (294912 cores): 80.13 GB/s
• d=10 (327680 cores): 80.11 GB/s total ~ 0.8 PB/s
• Connection to P7 chips: 4*24 GB/s = 96 GB/s ~ 83%
37Performance Modeling and Simulation on Blue WatersSource: B. Arimilli et al. “The PERCS High-
Performance Interconnect”
Influence of the Routing Strategy
• Indirect routing would effectively half the bandwidth
• First route to intermediate supernodes
• Not desirable for global Alltoall case
• Other patterns could benefit
• Example: bad pattern for direct routing:
• all nodes in SN A send to all nodes in SN B
• congestion of 32, i.e., 312.5 MB/s
• (optimal) indirect routing would improve to 5 GB/s
• Detailed routing strategy is still under discussion
• May be influenced by outcome of application modeling
38Performance Modeling and Simulation on Blue Waters
Complete Network Models
• LogGP parameters can be measured (Netgauge)
• Or estimated based on documentation
• Traffic patterns need to be modeled or simulated
• Performance depends on task-to-node mapping
• Significant effort but only done once per machine
• NCSA is working on a detailed system model of
Blue Waters
39Performance Modeling and Simulation on Blue Waters
How to Derive an Application Model
Platform or System Model
(Hardware, Middleware)
Application Model
(Algorithm, Structure)
Performance Model
Performance Modeling and Simulation on Blue Waters 40
Application Models
• Are often much more complex than platform models
• Serial performance models:
• CPU + Cache requirements
• Memory requirements
• I/O requirements
• Parallel performance models:
• Amdahl‟s law (scaling)
• Communication overhead
• Communication/computation overlap
41Performance Modeling and Simulation on Blue Waters
Rough Application Classification
42
Static Adaptive Dynamic
fixed control-
and dataflowfixed control-flow
and variable dataflow
variable control-
and dataflow
e.g., structured
n-dimensional
meshes/tori
e.g., MILC, WRF,
Abinit, Sweep3D
e.g., TDDFT/Octopus,
Enzo, ACES III
e.g., AMR, N-body,
unstructured mesh
e.g., high-impact AMR,
graph computations
e.g., PBGL, BFS
easy to model … … hard to model
Performance Modeling and Simulation on Blue Waters
A Strategy for Static Applications
• Two of the three NSF Petascale benchmarks are static
• Turbulence & Quantum Chromodynamics
• Application modeling in three simple steps:
1. Identify all performance-critical parameters
2. Identify code-blocks that share a performance signature
• derive serial performance model for each
3. Determine application communication pattern
• including communication sizes
43Performance Modeling and Simulation on Blue Waters
An Application Modeling Example: MILC
• MIMD Lattice Computation
• Gains deeper insights in
fundamental laws of physics
• Determine the predictions of
lattice field theories (QCD &
Beyond Standard Model)
• Major NSF application
• Challenge:
• High accuracy (computationally intensive) required for
comparison with results from experimental programs in
high energy & nuclear physics
44Performance Modeling and Simulation on Blue Waters
MILC - Performance-critical Parameters
45
Name simple complex comment
P X Number of processes
nx, ny, nz, nt X Lattice size in x,y,z,t
warms, trajecs X Warmup rounds and trajectories
traj_between_meas X Number of “steps” in each trajectory
beta, mass1, mass2, error_for_propagator
X Physical parameters – influence convergence of conjugate gradient
max_cg_iterations X Limits CG iterations per step
• If parameters are more complex (e.g., input files) then the
user has to distill them into single values (domain specific)
Performance Modeling and Simulation on Blue Waters
MILC – Critical Blocks
• Identify sub-trees in
call-graph with same
performance characteristic
• Five blocks in MILC
46
Name Function
LL load_longlinks
FL load_fatlinks
CG ks_congrad
GF imp_gauge_force
FF eo_fermion_force
Ignored
insignificant
sub-trees
Performance Modeling and Simulation on Blue Waters
Communication Pattern
• Four-dimensional p2p communication topology
• Prime-factor decomposition of P (→ square)
• Total number of p2p messages
• Counted manually (profiling tools and source)
• Collective Communication
• Single MPI_Allreduce per CG iteration
47
Type Number of Messages
FF (trajecs + warms) · steps · 1616
GF … (for LL, FL, CG)
Performance Modeling and Simulation on Blue Waters
Ideas for Automation – Tool support
• Model each function as a critical block
• Automatic decomposition leads to more blocks
• User needs to provide asymptotic scaling function
• e.g., T ~ nx*ny*nz*nt,
• Statistical runs could automatically fit the
parameters (can model cache linearly)
• Similar techniques can be used for message
counting
• Should investigate tool support for this!
48Performance Modeling and Simulation on Blue Waters
How to Derive an Application Model
Platform or System Model
(Hardware, Middleware)
Application Model
(Algorithm, Structure)
Performance Model
Performance Modeling and Simulation on Blue Waters 49
Applying a Machine Model
1. Determine sequential baseline
• Use single-core application runs as accurate
CPU/memory model
2. Process-to-node mapping
• On-node vs. off-node communication
• Network congestion
3. Model communication overheads
• Uses message counts and collective counts
50Performance Modeling and Simulation on Blue Waters
Test system for Model Validation
• BW hardware details are still confidential
• Single Power7 MR is not useful for validation
• No scaling runs possible
• We use a 1920-core Power5+ system
• 120 nodes 16 cores per node
• HPS switch
• POE/IBM MPI Version 5.1
51Performance Modeling and Simulation on Blue Waters
Sequential Baseline
• Stepwise linear function to represent cache influence
• Chose two steps, different CPUs might need more
• Volume V = nx*ny*nz*nt; Type B = {LL, FL, GF, CG, FF}
• Cache holds s(B) data elements
52Performance Modeling and Simulation on Blue Waters
Example block: GF
53Performance Modeling and Simulation on Blue Waters
System Model: Communication Parameters
55
Intra-node Inter-node
Performance Modeling and Simulation on Blue Waters
On-node Memory Contention
• Two cores share one memory controller
• Congestion has non-trivial performance effects
• Empirical analysis
• Assume fixed 10%
slowdown
56Performance Modeling and Simulation on Blue Waters
Process-to-Node Mapping
• Trivial linear
default mapping
• With 4 processes
per node:
• 6 internal edges
• 10 remote edges
• Wrap-around
• Looses two internal edges
• Unbalanced communication
57Performance Modeling and Simulation on Blue Waters
Optimized Process-to-Node Mapping
• Optimal mapping
• cf. Lagrange multiplier
• 8 internal edges
• 8 remote edges
• Similar for 4d mapping
• 16 cores, optimal sub-block:
• ½ remote edges
58Performance Modeling and Simulation on Blue Waters
Parallel Performance Example: GF
59Performance Modeling and Simulation on Blue Waters
Parallel Performance Model
• Example: quantify maximum
benefit of MPI
datatypes!
• cf. EuroMPI‟10
60Performance Modeling and Simulation on Blue Waters
Scaling with the Number of Processes
61Performance Modeling and Simulation on Blue Waters
V=64 V=104
Generating a P7 Model and Prediction
• Back to this guy:
62Performance Modeling and Simulation on Blue Waters
Sequential Baseline – Now very Simple!
63
Power5+ Power7 MR
Performance Modeling and Simulation on Blue Waters
Power7 MR Sequential Model
• relative error is less than 10%
64Performance Modeling and Simulation on Blue Waters
Communication Parameters (P7 MR)
• Intra-node parameters benchmarked (Netgauge)
• Network parameters from (public) documentation
• Latency ~ 1
• Overhead ~ 0.5
• Gap ~ 0.5
• Gap per Byte (bandwidth) ~ 0.2 ns (5 GB/s)
• Allreduce :
65Performance Modeling and Simulation on Blue Waters
Process-to-Node Mapping
• assume two 2x2x2x2 sub-blocks per node
• 8 edges per process
• 16 * 4 local edges per sub-block
• 2*2*2 edges between the two blocks
• 72 local edges
• 56 remote edges
• local : remote ratio = 1 : 0.7778
66Performance Modeling and Simulation on Blue Waters
Complete P7 MR Prediction
• Shared memory communication causes high overhead
67Performance Modeling and Simulation on Blue Waters
Weak Scaling to 300000 Cores
68Performance Modeling and Simulation on Blue Waters
V=64 V=104
Benefits of the Model• Allows to estimate (with varying P and V)
• Communication overheads
• on-node vs. off-node communication
• evaluate decomposition and mapping strategies offline
• Time spent in message packing
• upper bound for benefits of datatypes (cf. EuroMPI‟10)
• Scaling bottlenecks
• Allreduce is relatively non-critical for scaling up
• ignoring OS noise for now
• Predict runtime at scale
• estimate overheads at large scale
• …
69Performance Modeling and Simulation on Blue Waters
Takeaways, Questions & Discussion
Performance Modeling and Simulation on Blue Waters 70
• Performance Modeling should become routine!
• Modeling during application and system design
• Maintain models through
application lifetime
• Improves productivity
• Needs tool support
• Existing, needs glue
• It‟s relatively simple!
• For static applications