![Page 1: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/1.jpg)
A PCIe Congestion-Aware Performance Modelfor Densely Populated Accelerator Servers
Maxime Martinasso∗, Grzegorz Kwasniewski†, Sadaf R. Alam∗,Thomas C. Schulthess∗‡§, Torsten Hoefler†∗Swiss National Supercomputing Centre, ETH Zurich, 6900 Lugano, Switzerland
†Department of Computer Science, ETH Zurich, Universitatstr. 6, 8092 Zurich, Switzerland‡ Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland
§Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA
![Page 2: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/2.jpg)
Why more densely populated accelerator servers?
accelerators are faster and more energy-efficient than CPU
densely populated accelerator servers are high performance nodes
reduce space occupancy of the data center
SC16 – A PCIe performance model | 2
![Page 3: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/3.jpg)
Cray CS Storm – new MeteoSwiss supercomputer
2 cabinets
12 hybrid computing nodes per cabinet
2 Intel Haswell 12-core CPUs per node
8 NVIDIA Tesla K80 GPU acceleratorsper node
2 GPU processors per accelerator
192 GPU processors in total
360 GPU teraflops in total
Production system
GPUs connected by PCI-Express
SC16 – A PCIe performance model | 3
![Page 4: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/4.jpg)
generation 3, 16 GB/s using x16 wide lane
dual simplex (a pair of unidirectional links)
exchange buffer availability between pair of ports of a link
tree-based topology
Building a densely populatedaccelerator servers with PCIe:
PortRCPCIe RootComplex
PCIe Switch GPU
Legend:
x16 Link
RC
CPU CPU
Root Complex
SC16 – A PCIe performance model | 4
![Page 5: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/5.jpg)
generation 3, 16 GB/s using x16 wide lane
dual simplex (a pair of unidirectional links)
exchange buffer availability between pair of ports of a link
tree-based topology
Building a densely populatedaccelerator servers with PCIe:
PortRCPCIe RootComplex
PCIe Switch GPU
Legend:
x16 Link
RC
CPU CPU
Root Complex
GPUK80
SC16 – A PCIe performance model | 4
![Page 6: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/6.jpg)
generation 3, 16 GB/s using x16 wide lane
dual simplex (a pair of unidirectional links)
exchange buffer availability between pair of ports of a link
tree-based topology
Building a densely populatedaccelerator servers with PCIe:
PortRCPCIe RootComplex
PCIe Switch GPU
Legend:
x16 Link
RC
CPU CPU
Root Complex
GPUK80
GPUK80
SC16 – A PCIe performance model | 4
![Page 7: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/7.jpg)
generation 3, 16 GB/s using x16 wide lane
dual simplex (a pair of unidirectional links)
exchange buffer availability between pair of ports of a link
tree-based topology
Building a densely populatedaccelerator servers with PCIe:
PortRCPCIe RootComplex
PCIe Switch GPU
Legend:
x16 Link
RC
CPU CPU
Root Complex
GPUK80
GPUK80
GPUK80
SC16 – A PCIe performance model | 4
![Page 8: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/8.jpg)
generation 3, 16 GB/s using x16 wide lane
dual simplex (a pair of unidirectional links)
exchange buffer availability between pair of ports of a link
tree-based topology
Building a densely populatedaccelerator servers with PCIe:
PortRCPCIe RootComplex
PCIe Switch GPU
Legend:
x16 Link
RC
CPU CPU
Root Complex
GPUK80
GPUK80
GPUK80
GPUK80
SC16 – A PCIe performance model | 4
![Page 9: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/9.jpg)
generation 3, 16 GB/s using x16 wide lane
dual simplex (a pair of unidirectional links)
exchange buffer availability between pair of ports of a link
tree-based topology
Building a densely populatedaccelerator servers with PCIe:
PortRCPCIe RootComplex
PCIe Switch GPU
Legend:
x16 Link
RC
K80
0 1 2 3 4 5 6 7
CPU CPU
Root Complex
SC16 – A PCIe performance model | 4
![Page 10: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/10.jpg)
Communication conflicts0→7, 1→4, 6→7
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
SC16 – A PCIe performance model | 5
![Page 11: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/11.jpg)
Communication conflicts0→7, 1→4, 6→7
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
Upstream portconflict
SC16 – A PCIe performance model | 5
![Page 12: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/12.jpg)
Communication conflicts0→7, 1→4, 6→7
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
Upstream portconflict
Downstream portconflict
SC16 – A PCIe performance model | 5
![Page 13: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/13.jpg)
Communication conflicts0→7, 1→4, 6→7
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
Upstream portconflict
Downstream portconflict
Crossing the Root Complexconflict
SC16 – A PCIe performance model | 5
![Page 14: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/14.jpg)
Communication conflicts0→7, 1→4, 6→7
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
Upstream portconflict
Downstream portconflict
Crossing the Root Complexconflict
Head-of-Line blocking
SC16 – A PCIe performance model | 5
![Page 15: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/15.jpg)
Motivation – COSMO halo exchange
GPU0 GPU1
GPU5GPU4
GPU3GPU2
GPU6 GPU7
2D domain decomposition GPU0 GPU1
GPU4
GPU3
GPU7GPU6
GPU2
GPU5
3D domain decomposition
Which order of communications is the fastest?
0→4 1→0 2→1 3→7 4→0 5→4 6→2 7→60→1 1→2 2→6 3→2 4→5 5→6 6→7 7→3
1→5 2→3 5→1 6→5
0→1 1→0 2→1 3→2 4→0 5→1 6→2 7→30→4 1→5 2→6 3→7 4→5 5→6 6→5 7→6
1→2 2→3 5→4 6→7. . .
SC16 – A PCIe performance model | 6
![Page 16: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/16.jpg)
Motivation – COSMO halo exchange
GPU0 GPU1
GPU5GPU4
GPU3GPU2
GPU6 GPU7
2D domain decomposition GPU0 GPU1
GPU4
GPU3
GPU7GPU6
GPU2
GPU5
3D domain decomposition
Which order of communications is the fastest?
0→4 1→0 2→1 3→7 4→0 5→4 6→2 7→60→1 1→2 2→6 3→2 4→5 5→6 6→7 7→3
1→5 2→3 5→1 6→5
0→1 1→0 2→1 3→2 4→0 5→1 6→2 7→30→4 1→5 2→6 3→7 4→5 5→6 6→5 7→6
1→2 2→3 5→4 6→7. . .
2D domain decomposition example: 20, 376 possibilities
3D domain decomposition has more than 1.6 Million possibilities
SC16 – A PCIe performance model | 6
![Page 17: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/17.jpg)
PCIe performance model
We want to identify the congestion factors ρ ∈ [0, 1] which limit theavailable bandwidth per communication at each communication phase.
Step (D)Step (C)Step (B)Step (A)
Update messages
Head-of-LineBlocking
Downstreamport
conflicts+ RC
Upstreamport
conflicts
Source Arbitration
Communicationgraph
model: compute
Advance to next time stepUpdate communication graph
Remainingmessages?
Yes
No
End
SC16 – A PCIe performance model | 7
![Page 18: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/18.jpg)
Communication phase – update messages
Elapsed time Lc , message size Mc , set of communication phases Sc :
Time
LC0 = t0 = 1B ·
m0,0
ρ0,0and MC0 = m0,0
SC16 – A PCIe performance model | 8
![Page 19: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/19.jpg)
Communication phase – update messages
Elapsed time Lc , message size Mc , set of communication phases Sc :
Time
LC0 = t0 + t1 = 1B · (
m0,0
ρ0,0+
m0,1
ρ0,1) and MC0 = m0,0 + m0,1
SC16 – A PCIe performance model | 8
![Page 20: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/20.jpg)
Communication phase – update messages
Elapsed time Lc , message size Mc , set of communication phases Sc :
Time
Lc =∑
i∈Scti =
1B
∑i∈Sc
mc,i
ρc,iand Mc =
∑i∈Sc
mc,i
(1) start time and Mc are known(2) ρc,i are given by the model
with (1) and (2) Lc are computable
SC16 – A PCIe performance model | 8
![Page 21: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/21.jpg)
Model conflicts on switch
We want to identify the congestion factors ρ ∈ [0, 1] which limit theavailable bandwidth per communication at each communication phase.
Each communication enters a switch with a congestion factor ρ andleaves with a congestion factor ρ′
If∑
i ρi > 1 then an arbitration policy is required,ρ′ = ρ otherwise
SC16 – A PCIe performance model | 9
![Page 22: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/22.jpg)
Upstream port conflict
Proportional sharing of availablebandwidth
SC16 – A PCIe performance model | 10
![Page 23: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/23.jpg)
Downstream port conflict
Round-robin policy
Performance reduction for crossingthe root complex
CR – set of communications crossing the root complex;n – number of grouped communication sets;R – congestion factor of a grouped communication set;τ – congestion factor for crossing the root complex;
– if CR = ∅ then
R′ =1
n
– if CR 6= ∅ then
R′ =
{min(max( 1
n − τ, 0),R) if R contains comm. ∈ CR
min( 1n + τ,R) otherwise
SC16 – A PCIe performance model | 11
![Page 24: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/24.jpg)
Complete example
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
τ = 0.2
Step (D)Step (C)Step (B)Step (A)
Update messages
Head-of-LineBlocking
Downstreamport
conflicts+ RC
Upstreamport
conflicts
Source Arbitration
comm. Step (A) Step (B) Step (C) Step (D)
(a) 0→2 1 1/2 1/2 3/10 3/10
(b) 1→4 1 1/2 3/10 3/10 3/10
(c) 3→2 1 1 1/2 1/2 7/10
(d) 6→4 1 1 7/10 7/10 7/10
SC16 – A PCIe performance model | 12
![Page 25: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/25.jpg)
Complete example
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
τ = 0.2Message size: 300MBBandwidth: 11.6 GB/s
Step (D)Step (C)Step (B)Step (A)
Update messages
Head-of-LineBlocking
Downstreamport
conflicts+ RC
Upstreamport
conflicts
Source Arbitration
comm. Step (A) Step (B) Step (C) Step (D)
(a) 0→2 1 1/2 1/2 3/10 3/10
(b) 1→4 1 1/2 3/10 3/10 3/10
(c) 3→2 1 1 1/2 1/2 7/10
(d) 6→4 1 1 7/10 7/10 7/10
Congestion graph step 1
comm. cong. factor data remaining elapsed time
(a) 0→2 3/10 128 MB 36 ms
(b) 1→4 3/10 128 MB 36 ms
(c) 3→2 7/10 0 MB 36 ms
(d) 6→4 7/10 0 MB 36 ms
SC16 – A PCIe performance model | 12
![Page 26: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/26.jpg)
Complete example
RC
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8
12
16
18
13
9 10 11
1514
17
19
τ = 0.2Message size: 300MBBandwidth: 11.6 GB/s
Step (D)Step (C)Step (B)Step (A)
Update messages
Head-of-LineBlocking
Downstreamport
conflicts+ RC
Upstreamport
conflicts
Source Arbitration
Communicationgraph
model: compute
Advance to next time stepUpdate communication graph
Remainingmessages?
Yes
No
End
Congestion graph step 1
comm. cong. factor data remaining elapsed time
(a) 0→2 3/10 128 MB 36 ms
(b) 1→4 3/10 128 MB 36 ms
(c) 3→2 7/10 0 MB 36 ms
(d) 6→4 7/10 0 MB 36 ms
Congestion graph step 2
comm. cong. factor data remaining elapsed time
(a) 0→2 1/2 0 MB 65 ms
(b) 1→4 1/2 0 MB 65 ms
SC16 – A PCIe performance model | 12
![Page 27: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/27.jpg)
Model Validation
Architecture parameters:B = 11.6GB/sτ = 0.1735
22,259 graphs:non-isomorphiccudaMemcpyAsyncCommunication pattern:scatter, gather, all-to-allEntire set of graphs forsubsets of GPUsRandomly generated'100K communications
Message size: 300 MB
Time no contention:Tref = 25.3ms
95% of communication are in range +/- 15%
SC16 – A PCIe performance model | 13
![Page 28: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/28.jpg)
Back to the motivation – COSMO halo exchangeUpper limit on time to solution, throughput approach
Running mode one instance per socket (8 GPUs)
Large domain size 256x256x80 per GPU
One step triggers 312 halo exchanges
Message size: 40 KB to 254 KB
Uses MPI
(3!)4 × (2!)4 = 20, 736 communication graphs for 2D domain
Congestion graphs sorted by elapsed time0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
Tim
e [
s]
current implemented graph in COSMO
1.6x 1.9x
SC16 – A PCIe performance model | 14
GPU0 GPU1
GPU5GPU4
GPU3GPU2
GPU6 GPU7
2D domain decomposition GPU0 GPU1
GPU4
GPU3
GPU7GPU6
GPU2
GPU5
3D domain decomposition
![Page 29: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/29.jpg)
Fastest schedule for 2D decomposition
RC
0 1 2 3 4 5 6 7
RC
0 1 2 3 4 5 6 7
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
SC16 – A PCIe performance model | 15
![Page 30: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/30.jpg)
Fastest schedule for 2D decomposition
SC16 – A PCIe performance model | 15
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
RC
![Page 31: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/31.jpg)
Fastest schedule for 2D decomposition
SC16 – A PCIe performance model | 15
RC
0 1 2 3 4 5 6 74 5 6 7
0 1 2 3
RC
![Page 32: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/32.jpg)
Fastest schedule for 2D decomposition
SC16 – A PCIe performance model | 15
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
RC
![Page 33: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/33.jpg)
Fastest schedule for 3D decomposition
RC
0 1 2 3 4 5 6 7
RC
0 1 2 3 4 5 6 7
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
SC16 – A PCIe performance model | 16
![Page 34: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/34.jpg)
Fastest schedule for 3D decomposition
SC16 – A PCIe performance model | 16
RC
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
![Page 35: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/35.jpg)
Fastest schedule for 3D decomposition
SC16 – A PCIe performance model | 16
RC
RC
0 1 2 3 4 5 6 74 5 6 7
0 1 2 3
![Page 36: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/36.jpg)
Fastest schedule for 3D decomposition
SC16 – A PCIe performance model | 16
RC
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
![Page 37: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/37.jpg)
COSMO improvement – fastest schedule
RC
0 1 2 3 4 5 6 7
RC
0 1 2 3 4 5 6 7
RC
0 1 2 3 4 5 6 7
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
COSMO gain: 5.6% per halo exchange step,gain is limited by MPI 2-sided overhead.
SC16 – A PCIe performance model | 17
![Page 38: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/38.jpg)
Conclusion
- Latency not modeled
- MPI 2-sided overhead not modeled (use one-sided?)
+ Captures all PCIe features including congestion
+ Simple model only 2 parameters (B and τ )
+ Precise for large messages
+ Design of topology-aware algorithms
+ COSMO halo exchange performance gain
SC16 – A PCIe performance model | 18
![Page 39: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/39.jpg)
Thank you for your attention.
![Page 40: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/40.jpg)
2 8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32MMessage size [Byte]
0
1
2
3
4
5
6
7
8
9
10
11
12
Measu
red b
andw
itdh [
GB
/s]
1.88xslower forconflict 1
1.80xslower forconflict 2
1.44xslower forconflict 3
1.21x
0→ 1 or 0→ 3 alone4→ 1 alone0→ 3 in conflict 10→ 1 in conflict 20→ 1 in conflict 3
τ = 1− 1/1.21
SC16 – A PCIe performance model | 19
![Page 41: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R](https://reader033.vdocuments.us/reader033/viewer/2022052105/6040026a94369d20114b8273/html5/thumbnails/41.jpg)
RCSocket
K80
TOPOLOGY T2
0 1 2 3 4 5 6 7
PortRCPCIe RootComplex
PCIe Switch GPU
K80
TOPOLOGY T1
0 1 2 3 4 5 6 7
RCSock
et
Legend:
x16 Link
SC16 – A PCIe performance model | 20