![Page 1: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/1.jpg)
Reducing Communication in Parallel Graph Computations
Aydın Buluç Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting Albuquerque, NM
![Page 2: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/2.jpg)
Acknowledgments (past & present)
• Krste Asanovic (UC Berkeley) • Grey Ballard (UC Berkeley) • ScoA Beamer (UC Berkeley) • Jim Demmel (UC Berkeley) • Erika Duriakova (UC Dublin) • Armando Fox (UC Berkeley) • John Gilbert (UCSB) • Laura Grigori (INRIA) • Shoaib Kamil (MIT) • Ben Lipshitz (UC Berkeley) • Adam Lugowski (UCSB)
• Lenny Oliker (Berkeley Lab) • Lenny Oliker (Berkeley Lab) • Dave PaAerson (UC Berkeley) • Steve Reinhardt (YarcData) • Oded Schwartz (UC Berkeley) • Edgar Solomonik (UC Berkeley) • Sivan Toledo (Tel Aviv Univ) • Sam Williams (Berkeley Lab)
This work is funded by:
![Page 3: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/3.jpg)
Informal Outline
• Large-‐scale graphs • Communica8on costs of parallel graph computaVons • CommunicaVon takes energy • Linear algebraic primi8ves enable scalable graph computaVon • Semiring abstracVon glues graphs to linear algebra • Direc8on-‐op8mizing BFS does >6X less communicaVon • New communicaVon-‐opVmal sparse matrix-‐matrix mul8ply • New communicaVon-‐opVmal dense all-‐pairs shortest-‐paths
![Page 4: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/4.jpg)
Graph abstracVons in Computer Science
Compiler op8miza8on: Control flow graph : graph dominators Register allocaVon: graph coloring
Scien8fic compu8ng: PrecondiVoning: support graphs, spanning trees Sparse direct solvers: chordal graphs Parallel compuVng: graph separators 10
1 3
2
4
5
6
7
8
9
G+(A) [chordal]
Computer Networks: RouVng: shortest path algorithms Web crawling: graph traversal Interconnect design: Cayley graphs
![Page 5: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/5.jpg)
Large graphs are everywhere
WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong
Internet structure Social interactions
Scientific datasets: biological, chemical, cosmological, ecological, …
![Page 6: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/6.jpg)
Two kinds of costs: Arithmetic (FLOPs) Communication: moving data
Running time = γ ⋅ #FLOPs + β ⋅ #Words + (α ⋅ #Messages)
CPU M
CPU M
CPU M
CPU M
Distributed
CPU M
RAM
Sequential Fast/local memory of size M
P processors 6
23%/Year 26%/Year
59%/Year 59%/Year
2004: trend transition into multi-core, further communication costs
Develop faster algorithms: minimize communicaVon (to lower bound if possible)
Model and MoVvaVon
![Page 7: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/7.jpg)
Communication-avoiding algorithms: Save time
Communication-avoiding algorithms: Save energy
Image courtesy of John Shalf (LBNL)
CommunicaVon -‐> Energy
![Page 8: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/8.jpg)
- Often no surface to volume ratio. - Very little data reuse in existing algorithmic formulations * - Already heavily communication bound
CommunicaVon crucial for graphs
100
1 4 16 64 256
1024
4096
Perc
enta
ge o
f Tim
e Sp
ent
Number of Cores
ComputationCommunication
0
20
40
60
80
2D sparse matrix-‐matrix mulVply emulaVng: -‐ Graph contracVon -‐ AMG restricVon operaVons
Scale 23 R-‐MAT (scale-‐free graph) 8mes order 4 restricVon operator Cray XT4, Franklin, NERSC
B., Gilbert. Parallel sparse matrix-‐matrix mulVplicaVon and indexing: ImplementaVon and experiments. SIAM Journal of Scien1fic Compu1ng, 2012
![Page 9: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/9.jpg)
Sparse matrix-‐sparse matrix mulVplicaVon (Sparse GEMM)
x
The Combinatorial BLAS implements these, and more, on arbitrary semirings, e.g. (×, +), (and, or), (+, min)
Sparse matrix-‐sparse vector mulVplicaVon (SpMV)
x
.*
Linear-‐algebraic primiVves
Element-‐wise operaVons Sparse matrix indexing
![Page 10: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/10.jpg)
The case for sparse matrices
Many irregular applications contain coarse-grained parallelism that can be exploited
by abstractions at the proper level.
Tradi8onal graph computa8ons Data driven,
unpredictable communicaVon.
Irregular and unstructured, poor locality of reference
Fine grained data accesses, dominated by latency
![Page 11: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/11.jpg)
The case for sparse matrices
Many irregular applications contain coarse-grained parallelism that can be exploited
by abstractions at the proper level.
Tradi8onal graph computa8ons
Graphs in the language of linear algebra
Data driven, unpredictable communicaVon.
Fixed communicaVon paAerns
Irregular and unstructured, poor locality of reference
OperaVons on matrix blocks exploit memory hierarchy
Fine grained data accesses, dominated by latency
Coarse grained parallelism, bandwidth limited
![Page 12: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/12.jpg)
1 2
3
4 7
6
5
AT
1
7
7 1 from
to
Breadth-‐first search in the language of linear
algebra
![Page 13: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/13.jpg)
1 2
3
4 7
6
5
X AT
1
7
7 1 from
to
ATX
à
1
1
1
1
1 parents:
ParVcular semiring operaVons: Mul8ply: select Add: minimum
![Page 14: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/14.jpg)
1 2
3
4 7
6
5
X
4
2
2
AT
1
7
7 1 from
to
ATX
à
2
4
4
2
2 4
1
1 parents: 4
2
2
Mul8ple traverses outgoing edges Add chooses among incoming edges
![Page 15: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/15.jpg)
1 2
3
4 7
6
5
X
3
AT
1
7
7 1 from
to
ATX
à 3
5
7
3
1
1 parents: 4
2
2
5
3
Select vertex with minimum label as parent
![Page 16: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/16.jpg)
X AT
1
7
7 1 from
to
ATX
à
6
1 2
3
4 7
6
5
Result: DeterminisVc breadth-‐first search
![Page 17: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/17.jpg)
Performance observaVons of level-‐synchronous BFS
Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Per
cent
age
of to
tal
0
10
20
30
40
50
60 # of vertices in frontier arrayExecution time
Youtube social network
When the fron8er is at its peak, almost all edge examinaVons “fail” to claim a child
SyntheVc network
![Page 18: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/18.jpg)
BoAom-‐up BFS algorithm [Single-‐node: Beamer et al. 2012]
Classical (top-‐down) algorithm is opVmal in worst case, but pessimisVc for low-‐diameter graphs (previous slide).
Direc8on-‐op8mizing approach: -‐ Switch from top-‐down to
boAom-‐up search -‐ When the majority of the
verVces are discovered.
![Page 19: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/19.jpg)
DirecVon-‐opVmizing BFS on distributed memory
• Kronecker (Graph500): 16 billion verVces and 256 billion edges. • Implemented on top of Combinatorial BLAS (i.e. uses 2D decomposiVon)
Beamer, B., Asanović, and PaAerson, "Distributed Memory Breadth-‐First Search Revisited: Enabling BoAom-‐Up Search”, MTAAP’13, in conjunc1on with IPDPS
![Page 20: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/20.jpg)
DirecVon-‐opVmizing BFS reduces communicaVon
k: average vertex degree of the graph
CommunicaVon volume reducVon based on analyVcal model derived in the paper, assuming 4 boAom-‐up steps (typical for power-‐law graphs)
![Page 21: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/21.jpg)
DirecVon-‐opVmizing BFS saves energy
For a fixed-‐sized real input, direcVon-‐opVmizing algorithm completes at the same Vme using 1/16th of processors (and energy)
![Page 22: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/22.jpg)
• Parallel breadth-first search is implemented with sparse GEMM
• Work-efficient, highly-parallel implementation of Brandes’ algorithm
Betweenness Centrality using Sparse GEMM
6
B AT AT � B
à
1 2
3
4 7 5
Betweenness Centrality is a measure of influence.
CB(v): Among all the shortest paths, what fracVon passes through v?
![Page 23: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/23.jpg)
Sparse Matrix-‐Matrix MulVplicaVon
Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness
centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry,…
How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?
What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices
Erdős-Rényi(n,d) graphs aka G(n, p=d/n)
![Page 24: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/24.jpg)
The computaVon cube and sparsity-‐independent algorithms
Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),
A B C
The computa1on (discrete) cube:
• A face for each (input/output) matrix
• A grid point for each mulVplicaVon
1D algorithms 2D algorithms 3D algorithms
How about sparse algorithms?
![Page 25: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/25.jpg)
The computaVon cube and sparsity-‐independent algorithms
Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),
A B C
The computa1on (discrete) cube:
• A face for each (input/output) matrix
• A grid point for each mulVplicaVon
1D algorithms 2D algorithms 3D algorithms
How about sparse algorithms?
Sparsity independent algorithms: assigning grid-‐points to processors is independent of sparsity structure. -‐ In parVcular: if Cij is non-‐zero, who holds it? -‐ all standard algorithms are sparsity independent
AssumpVons: -‐ Sparsity independent algorithms -‐ input (and output) are sparse: -‐ The algorithm is load balanced
![Page 26: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/26.jpg)
Algorithms aAaining lower bounds
Previous Sparse Classical:
[Ballard, et al. SIMAX’11]
( ) ⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⋅ΩPM
M
FLOPs3
#⎟⎟⎠
⎞⎜⎜⎝
⎛Ω=
MPnd 2
New Lower bound for Erdős-Rényi(n,d) :
[here] Expected (Under some technical assumpVons) No previous algorithm aAain these.
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎭⎬⎫
⎩⎨⎧
ΩPnd
Pdn 2
,min
û No algorithm aAain this bound!
ü Two new algorithms achieving the bounds (Up to a logarithmic factor)
i. Recursive 3D, based on [Ballard, et al. SPAA’12]
ii. IteraVve 3D, based on [Solomonik & Demmel EuroPar’11]
Ballard, B., Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaVon opVmal parallel mulVplicaVon of sparse random matrices. In SPAA 2013.
![Page 27: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/27.jpg)
3D IteraVve Sparse GEMM [Dense case: Solomonik and Demmel, 2011]
c =Θ pd 2"
#$
%
&'OpVmal replicas:
(theoreVcally)
3D Algorithms: Using extra memory (c replicas) reduces communicaVon volume by a factor of c1/2 compared to 2D
![Page 28: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/28.jpg)
Rerformance results (Erdős-‐Rényi graphs)
![Page 29: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/29.jpg)
IteraVve 2D-‐3D performance results (R-‐MAT graphs)
4 16
Seco
nds
Number of grid layers (c)
p=4096 (s20)
p=16384 (s22)
p=65536 (s24)
Imbalance Merge HypersparseGEMM Comm_Reduce Comm_Bcast
0
5
10
15
20
251 4 16 1 4 16 1
2X Faster
5X faster if load balanced
![Page 30: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/30.jpg)
• Input: Directed graph with “costs” on edges • Find least-‐cost paths between all reachable vertex pairs • Classical algorithm: Floyd-‐Warshall
• It turns out a previously overlooked recursive version is more parallelizable than the triple nested loop
for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(i→k)+w(k→j) < w(i→j))
w(i→j):= w(i→k) + w(k→j)
1 5 2 3 4 1
5
2
3
4
k = 1 case
All-‐pairs shortest-‐paths problem
![Page 31: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/31.jpg)
6
3
12
5
43
39
5
7
-3 4
4-1
1
45
-2
0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 ∞ 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0
#
$
%%%%%%%
&
'
(((((((
A B
C D A
B D
C
A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;
+ is “min”, × is “add”
V1 V2
![Page 32: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/32.jpg)
6
3
12
5
43
39
5
7
-3 4
4-1
1
45
-2
0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0
#
$
%%%%%%%
&
'
(((((((
∞3
"
#$
%
&' 5 9!
"#$
∞ ∞8 12
"
#$
%
&'
C B
=
The cost of 3-1-2 path
Π =
1 1 1 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6
"
#
$$$$$$$
%
&
'''''''
a(3, 2) = a(3,1)+ a(1, 2) then! →! Π(3, 2) =Π(1, 2)
8
Distances Parents
![Page 33: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/33.jpg)
6
3
12
5
43
36
5
7
-3 4
4-1
1
45
-2
0 5 6 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0
#
$
%%%%%%%
&
'
(((((((
Π =
1 1 2 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6
"
#
$$$$$$$
%
&
'''''''
8
D = D*: no change
5 9!"
#$
0 18 0
!
"#
$
%&
B
=
D
5 6!"
#$
Path: 1-2-3
a(1,3) = a(1, 2)+ a(2,3) then! →! Π(1,3) =Π(2,3)
Distances Parents
![Page 34: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/34.jpg)
Wbc-2.5D(n, p) =O(n2 cp )
Sbc-2.5D(p) =O cp log2(p)( )
Bandwidth:
Latency:
c: number of replicas Optimal for any memory size !
CommunicaVon-‐avoiding APSP on distributed memory
![Page 35: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/35.jpg)
c=1
0
200
400
600
800
1000
1200
1 4 16 64 256
1024 1 4 16 64 25
610
24
GFl
ops
Number of compute nodes
n=4096
n=8192
c=16c=4
CommunicaVon-‐avoiding APSP on distributed memory
Solomonik, B., and J. Demmel. “Minimizing communication in all-pairs shortest paths”, Proceedings of the IPDPS. 2013.
![Page 36: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting](https://reader034.vdocuments.us/reader034/viewer/2022051915/6006505519821a1afb4feef6/html5/thumbnails/36.jpg)
Conclusions