optimizing graph algorithms for improved cache performance
DESCRIPTION
Optimizing Graph Algorithms for Improved Cache Performance. Aya Mire & Amir Nahir. Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park, Michael Penner, Viktor K Prasanna. 1. 2. The Problem with Graphs…. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/1.jpg)
Optimizing Graph Algorithms for Improved Cache
Performance
Aya Mire & Amir Nahir
Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park,
Michael Penner, Viktor K Prasanna
![Page 2: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/2.jpg)
The Problem with Graphs…
Graph problems pose unique challenges to improving cache performance due to their irregular data access patterns.
1
2
99
![Page 3: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/3.jpg)
Agenda
• A recursive implementation of the Floyd-Warshall Algorithm.
• A tiled implementation of the Floyd-Warshall Algorithm.
• Efficient data structures for general graph problem.
• Optimizations for the maximum matching algorithm.
![Page 4: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/4.jpg)
Analysis model
• All proofs and complexity analysis will be based on the I/O model.i.e: the goal of the improved algorithm is to minimize the number of cpu-memory transactions.
CPU
Cache
Main Memory
ABC
cost(A) ≪ cost(B)
cost(C) ≪ cost(B)
![Page 5: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/5.jpg)
Analysis model
All proofs will assume total control of the cache. i.e if the cache is big enough to hold two data blocks, than the two can be held in the cache without running over each other (no conflict misses)
![Page 6: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/6.jpg)
The Floyd Warshall Algorithm
• An ‘all pairs shortest path’ algorithm.• Works by iteratively calculating Dk,
where Dk is the matrix of all pair shortest paths going through vertices 1, 2, …k.
• Each iteration depends on the result of the previous one.
• Time complexity: Θ(|V|3).
![Page 7: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/7.jpg)
The Floyd Warshall Algorithm
Pseudo Code:for k from 1 to |V| for i from 1 to |V|
for j from 1 to |V| Di,j
(k) ← min Di,j(k-1) , Di,k
(k-1) + Dk,j(k-1)
return D(|V|)
![Page 8: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/8.jpg)
The Floyd Warshall Algorithm
The algorithm accesses the entire matrix in each iteration.
The dependency of the kth iteration on the results of the (k-1)th iteration eliminate the ability to perform data reuse.
![Page 9: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/9.jpg)
Lemma 1
Suppose Di,j(k) is computed as
Di,j(k) ← min Di,j
(k-1) , Di,k(k’) +
Dk,j(k’’)
for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.
Di,j(k) ← min Di,j
(k-1) , Di,k(k-1) + Di,k
(k-1)
![Page 10: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/10.jpg)
Lemma 1 - Proof
To distinguish between the traditional FW Algorithm, we’ll use Ti,j
(k) to denote the results calculated using the “new” computation way.
⇒ Ti,j(k) ← min Ti,j
(k-1) , Ti,k(k’) + Tk,j
(k’’)
for k-1 ≤ k’ , k’’ ≤ |V|
Suppose Di,j(k) is computed as
Di,j(k) ← min Di,j
(k-1) , Di,k(k’) + Dk,j
(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.
![Page 11: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/11.jpg)
Lemma 1 - Proof
First, we’ll show that for 1 ≤ k ≤ |V| the following inequality holds:
Ti,j(k) ≤ Di,j
(k)
We Prove this by induction.
Base case: by definition we haveTi,j
(0) = Di,j(0)
![Page 12: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/12.jpg)
Lemma 1 - Proof
Induction step: suppose Ti,j
(k) ≤ Di,j(k) for k = m-1. Then:
Ti,j(m) ← min Ti,j
(m-1) , Ti,m(m’) + Tm,j
(m’’)
≤ min Di,j(m-1) , Ti,m
(m’) + Tm,j(m’’)
≤ min Di,j(m-1) , Ti,m
(m-1) + Tm,j(m-1)
≤ min Di,j(m-1) , Di,m
(m-1) + Dm,j(m-1)
= Di,j(m)
for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j
(k)
Ti,j(k) ← min Ti,j
(k-1) , Ti,k(k’) + Ti,k
(k’’)
By step of induction
Limiting the choices for
intermediate vertices
makes path same or longer
By step of induction
By definition
![Page 13: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/13.jpg)
Lemma 1 - Proof
On the other hand, since the traditional algorithm computes the shortest paths at termination, and since Ti,j
(|V|) is the length of some path, we have:
Di,j(|V|) ≤ Ti,j
(|V|)
⇒ Di,j(|V|) = Ti,j
(|V|)
for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j
(k)
Suppose Di,j(k) is computed as
Di,j(k) ← min Di,j
(k-1) , Di,k(k’) + Dk,j
(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.
![Page 14: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/14.jpg)
FW’s Algorithm – Recursive Implementation
We first consider the basic case of a two-node graph.
w1
w2
1
2
-W1
W2-
Floyd-Warshall (T)
T11 = min T11, T11 + T11
T12 = min T12, T11 + T12
T21 = min T21, T21 + T11
T22 = min T22, T21 + T12
T22 = min T22, T22 + T22
T21 = min T21, T22 + T21
T12 = min T12, T12 + T22
T11 = min T11, T12 + T21
![Page 15: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/15.jpg)
FW’s Algorithm – Recursive Implementation
The general case
I II
III IV
Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …
![Page 16: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/16.jpg)
FW’s Recursive Algorithm –Correctness
It can be shown, that for each action Di,j
(k) ← min Di,j(k-1) , Di,k
(k-1) + Dk,j(k-1)
in FW’s traditional implementation, there is a corresponding action, Ti,j
(k) ← min Ti,j(k-1) , Ti,k
(k’) + Tk,j(k’’) ,
where k-1 ≤ k’ , k’’ ≤ |V|.Hence the algorithm’s correctness follows
from lemma 1.
![Page 17: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/17.jpg)
FW’s Recursive Algorithm – How does it actually work…
TI(0) TII
(0)
TIV(0)TIII
(0)
Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …
T (0)T (|V|)
TI(|V|/2) TII
(|V|/2)
TIII(|V|/2) TIV
(|V|/2)TIV(|V|)TIII
(|V|)
TII(|V|)TI
(|V|)
![Page 18: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/18.jpg)
FW’s Recursive Algorithm - Example
1
2
3
68
4
5
7
10
2
30
8
5
8
3
49
1
720
12345678
110502
23085
38
4
5
6349
71720
8
50
![Page 19: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/19.jpg)
FW’s Recursive Algorithm – Example
12345678
1102
23085
38
4
5
6349
717
8
Floyd-Warshall (T)
T11 = min T11, T11 + T11
T12 = min T12, T11 + T12
T21 = min T21, T21 + T11
T22 = min T22, T21 + T12
T22 = min T22, T22 + T22
T21 = min T21, T22 + T21
T12 = min T12, T12 + T22
T11 = min T11, T12 + T21
1850
2016
1-3-4
7-6-8
![Page 20: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/20.jpg)
FW’s Recursive Algorithm – Example
12345678
1102
285
38
4
5
6349
717
8
Floyd-Warshall (T)
T11 = min T11, T11 + T11
T12 = min T12, T11 + T12
T21 = min T21, T21 + T11
T22 = min T22, T21 + T12
T22 = min T22, T22 + T22
T21 = min T21, T22 + T21
T12 = min T12, T12 + T22
T11 = min T11, T12 + T21
18
166
12
6 11
3011
31
7-2-47-6-5
11
7-2-87-6-4
10
1-6-81-6-52-6-51-6-4 52-6-4
![Page 21: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/21.jpg)
Representing the Matrix in an efficient way
We usually store matrices in the memory in one of two ways:
Using either of these layouts will not improve performance since the algorithm breaks the matrix into quadrants.
0123
4567
891011
12131415
0123456789101112131415
048121591326101437815
Column-major layout:
Row-major layout:
![Page 22: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/22.jpg)
Representing the Matrix in an efficient way
The Z-Morton layout:perform the following operations recursively until the quadrant size is of a single data unit:
divide the matrix into four quadrants.store quadrant I, II, III, IV in the memory.
For example:
0123
4567
891011
12131415
0145236789121310111415
![Page 23: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/23.jpg)
Complexity Analysis
The running time ofthe algorithm is givenby T(|V|) = 8·T(|V|/2) = Θ(|V|3)
Without considering Cache the number of cpu-memory transactionsis exactly as the running time
Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …
![Page 24: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/24.jpg)
Complexity Analysis - Theorem
There exists some B, where B = O(|cache|1/2), such that,
when using the FW-Recursive implementation, with the matrix stored in the Z-Morton layout, the number of cpu-memory transactions will be reduced by a factor of B.
⇒ there will be O(|V|3/B) cpu-memory transactions.
![Page 25: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/25.jpg)
Complexity Analysis
After k recursive calls, the size of a quadrant’s dimension is |V|/2k.
There exists some k, such that B ≜ |V|/2k and3 · B2 ≤ |cache|
Once the above condition is fulfilled, 3 matrices of size B2 can be placed in the cache, and no further cpu-memory transactions are required.
⇒B = O(|cache|1/2)
Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …
![Page 26: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/26.jpg)
Complexity Analysis
Therefore we get:
O(|V|/B)3 · O(B2)
⇒ the number of cpu-memory transactions is reduced by a factor of B.
Transaction complexity of FW, when the size of the matrix dimension is
|V|/B, and there’s no cache
Transactions required in order to bring a BxB
quadrant into the cache
= O(|V|3/B)
![Page 27: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/27.jpg)
Complexity Analysis – lower bound
In “I/O complexity: The Red Blue Pebble Game” J.Hong and H.Kung have shown that the lower bound on cpu-memory transactions for multiplying matrices is Ω(N3/B)
where B = O(|cache|1/2)
![Page 28: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/28.jpg)
Complexity Analysis – lower bound – Theorem
The lower bound on cpu-memory transactions for the Floyd Warshall algorithm is Ω(|V|3/B)
where B = O(|cache|1/2)
Proof: by reduction
![Page 29: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/29.jpg)
Complexity Analysis – lower bound theorem - Proof
for k from 1 to N for i from 1 to N
for j from 1 to N Ck,i += Ak,j · Bj,I
|V|
|V||V|
Di,j
(k) ← min Di,j(k-
1) ,Di,k
(k-1) + Dk,j(k-1)
![Page 30: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/30.jpg)
Complexity Analysis - Conclusion
The algorithm’s complexity: O(|V|3/B)Lower bound for FW: Ω(|V|3/B) The recursive implementation is
asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).
![Page 31: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/31.jpg)
FW’s Algorithm – Recursive Implementation - Comments
Note, that the size of the cache is not part of the algorithm’s parameters, neither it is needed in order to store the matrix in the Z-Morton layout.
Therefore:the algorithm is cache- oblivious
![Page 32: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/32.jpg)
FW’s Algorithm – Recursive Implementation - Comments
Though the analysis model included only a single hierarchy of cache, since no special attributes were defined, the proofs can be generalized to multiple levels of cache.
L0 Cache
Main Memory
L1 Cache
L2 Cache
![Page 33: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/33.jpg)
FW’s Algorithm – Recursive Implementation - Comments
Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.
![Page 34: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/34.jpg)
FW’s Algorithm – Recursive Implementation - ImprovementThe algorithm can be further improved
by making it cache conscious: performing the recursive calls until the problem size is reduced to B, and solving the B-size problem in the traditional way (saves recursive calls’ overhead)
This modification showed up to 2x improvement of running time on some of the machines.
![Page 35: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/35.jpg)
![Page 36: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/36.jpg)
FW’s Algorithm – Tiled Implementation
Consider a special case of lemma 1 when k’, k’’ are restricted such that
k - 1 ≤ k’, k’’ ≤ k - 1 + BWhere |cache| ≤ 3 · B2
( B = O(|cache|1/2))
Suppose Di,j(k) is computed as
Di,j(k) ← min Di,j
(k-1) , Di,k(k’) + Dk,j
(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.
This leads to the following tiled implementation of FW’s algorithm
![Page 37: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/37.jpg)
FW’s Algorithm – Tiled Implementation
Divide the matrix into BxB tilesPerform |V|/B iterations:
during the tth iteration:I. update the (t,t)th blockII. update the remainder of the tth row
and tth columnIII. update the rest of the matrix
![Page 38: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/38.jpg)
FW’s Algorithm – Tiled Implementation
Each iterationconsists of threephases:
Phase I:performing FW’s algorithms on the (t,t)th tile
(which is self-dependent).
Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix
![Page 39: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/39.jpg)
FW’s Algorithm – Tiled Implementation
Phase II:updating theremainder of row t:
Ai,j(k) ← minAi,j
(k-1), Ai,k(tB) + Ak,j
(k-1)updating the remainder of column t:
Ai,j(k) ← minAi,j
(k-1), Ai,k(k-1) + Ak,j
(tB)
Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix
During the tth iteration, k goes from i·(B-1) to i·B
![Page 40: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/40.jpg)
FW’s Algorithm – Tiled Implementation
Phase III:updating the rest of the matrix:
Ai,j(k) ← minAi,j
(k-1), Ai,k(tB) + Ak,j
(tB)
Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix
During the tth iteration, k goes from i·(B-1) to i·B
![Page 41: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/41.jpg)
FW’s Algorithm – Tiled Example
1
2
3
68
4
5
7
10
2
30
8
5
8
3
49
1
720
12345678
110502
23085
38
4
5
6349
71720
8
50
![Page 42: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/42.jpg)
FW’s Algorithm – Tiled Example
12345678
110502
23085
38
4
5
6349
717
8
Divide the matrix into BxB tiles
Perform N/B iterations:
during the tth iteration:
I. update the (t,t)th block
II. update the remainder of
the tth row and tth column
III. update the rest of the
matrix206
7-2-8
![Page 43: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/43.jpg)
FW’s Algorithm – Tiled Example
12345678
1102
23085
38
4
5
6349
7176
8
Divide the matrix into BxB tiles
Perform N/B iterations:
during the tth iteration:
I. update the (t,t)th block
II. update the remainder of
the tth row and tth column
III. update the rest of the
matrix
50181-3-4
![Page 44: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/44.jpg)
FW’s Algorithm – Tiled Example
12345678
1102
285
38
4
5
6349
7176
8
Divide the matrix into BxB tiles
Perform N/B iterations:
during the tth iteration:
I. update the (t,t)th block
II. update the remainder of
the tth row and tth column
III. update the rest of the
matrix
6
12
11
1-6-52-6-5
7-6-5
1-6-4 5182-6-43011
111-6-8
7-6-410
![Page 45: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/45.jpg)
FW’s Algorithm – Tiled Example
12345678
11052
21185
38
4
5
6349
7176
8
Divide the matrix into BxB tiles
Perform N/B iterations:
during the tth iteration:
I. update the (t,t)th block
II. update the remainder of
the tth row and tth column
III. update the rest of the
matrix
6
12
11
11
10
![Page 46: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/46.jpg)
Representing the Matrix in an efficient way
In order to match the data access pattern, a tile must be stored in continuous memory.
Therefore, the Z-Morton layout is used.
0123
4567
891011
12131415
0145236789121310111415
![Page 47: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/47.jpg)
FW’s Tiled Algorithm –correctness
Let Di,j(k) be the result of the kth
iteration of the traditional FW’s implementation.
Even though Di,j(k) and Ai,j
(k) may not be equal during the “inner” iterations, it can be shown, using induction, that at the end of each iteration, Di,j
(k) = Ai,j
(k) (where k = t·B)
![Page 48: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/48.jpg)
Complexity Analysis - Theorem
There exists some B, where B = O(|cache|1/2), such that,
when using the FW-Tiled implementation, the number of cpu-memory transactions will be reduced by a factor of B.
⇒ there will be O(|V|3/B) cpu-memory transactions.
![Page 49: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/49.jpg)
Complexity Analysis
There are |V|/B x |V|/B tiles in the matrix.
There are |V|/B iterations in the algorithm, in each iteration, all tiles are accessed.
Updating a tile requires holding at most 3 tiles in the cache. ⇒ 3 · B2 ≤ |cache|
Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix
![Page 50: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/50.jpg)
Complexity Analysis
Therefore we get:
(|V|/B) · [(|V|/B)x (|V|/B)] · O(B2)
⇒ the number of cpu-memory transactions is reduced by a factor of B.
The number of iterations
Transactions required in order to bring a BxB tile
into the cache
= O(|V|3/B)
The size of the matrix
![Page 51: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/51.jpg)
Complexity Analysis - Conclusion
The algorithm’s complexity: O(|V|3/B)Lower bound for FW: Ω(|V|3/B) The tiled implementation is
asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).
![Page 52: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/52.jpg)
FW’s Algorithm – Tiled Implementation - Comments
Note, that when using the tiling method, the size of the cache is one of the algorithm’s parameters
Therefore:the tiled algorithm is cache - conscious
![Page 53: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/53.jpg)
FW’s Algorithm – Tiled Implementation - Comments
Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.
![Page 54: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/54.jpg)
FW’s Algorithm – experimental results
Both algorithms (recursive and tiled) have shown a 30% improvement in L1 cache misses and 50% improvement in L2 cache misses for problem size of 1024 and 2048 vertices.
The results for both algorithms are nearly identical! (less than 1% difference)
![Page 55: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/55.jpg)
Dijkstra’s algorithm for Single Source Shortest Paths & Prim’s Algorithm for
Minimum Spanning Tree
Dijkstra’s Algorithm:S ← ∅Q ← VWhile Q ≠ ∅ u ← extract-min (Q) S ← S ∪ u for each v ∊ adj(u) update d[v]Return S
Prim’s Algorithm:Q ← Vfor each u ∊ Q do key(u) ← ∞key (root) ← 0While Q ≠ ∅ u ← extract-min (Q) for each v ∊ adj(u) if v ∊ Q and weight(u,v) < key(v) than key(v) ← weight(u,v)
Both Algorithms have the same data access pattern
![Page 56: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/56.jpg)
Graph representation
There are two commonly used graph representations.
The Adjacency matrix:A(i,j) = the cost of the edge from vertex i to vertex j.
Elements are accessed in adjacent fashion.
Representation size of O(|V|2)
![Page 57: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/57.jpg)
Graph representation
The adjacency list representation: a pointer-based representation where
a list of adjacent vertices is stored for each vertex in the graph, each node in the list holds the cost of the edge from the given vertex to the adjacent vertex.
Representation size of O(|V| + |E|) Pointer-based representation leads to
cache pollution.
![Page 58: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/58.jpg)
Adjacency Array representation
For each vertex in the graph, there exists an array of adjacent vertices.
Representation size of O(|V| + |E|)
Elements are accessed in adjacent fashion.
123… |V|
viwivjwj
![Page 59: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/59.jpg)
Matching Algorithm for Bipartite Graph
Matching: A set M of edges in a graph is a
matching if no vertex of the graph is end of more than one edge in M.
A matching is considered maximum if it is larger than any other matching.
1
2
3
4
1 – 4 is a maximal matching
1 – 3 , 2 – 4 is a maximum matching
![Page 60: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/60.jpg)
Matching Algorithm for Bipartite Graph
Let M be a matching.All edges in the graph are divided into
two groups: matching-edges and non-matching-edges.
A vertex is called free if it is not an end of any matching edge.
![Page 61: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/61.jpg)
Matching Algorithm for Bipartite Graph
A path P = u0, e1, u1, … , ek, uk is called an augmenting path (with respect to M) if:
- u0 and uk are free.
- the even numbered edges e2, e4, …,
ek-1 are matching edges.
The set of edges M\e2,e4, …,ek-1 ∪ e1,e3, …,ek
is also a matching; it has one edge more than Mhas.So, if we find an augmenting path, we can construct
a larger matching.
![Page 62: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/62.jpg)
Finding Augmenting paths in a Bipartite Graph
In bipartite graphs, each augmenting path has one end in A and one end in B. following such augmenting path starting from its end in A, we traverse non-matching edges from A to B and matching edges from B to A.
By turning the graph into a directed graph (all matching edges are directed vB → vA, all the rest vA → vB), we turn the problem into a simple path finding problem in a directed graph.
![Page 63: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/63.jpg)
Matching Algorithm for Bipartite Graph
The Algorithm:while (there exists an augmenting path)
increase |M| by one using the augmenting path
return MAlgorithm’s complexity: O(|V|·|E|)
![Page 64: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/64.jpg)
Matching Algorithm for Bipartite Graph – first
optimizationIn order to find augmenting paths, we
use the BFS algorithm, which has similar data access pattern to that of Dijsktra/Prim.
Therefore, using the adjacency array instead of the adjacency list / matrix improves running time.
![Page 65: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/65.jpg)
Matching Algorithm for Bipartite Graph – second
optimizationWe try to reduce the size of the
working set as in tiling:I. Partition G into g[1], g[2], … g[p].II. Find the maximum matching in g[i]
for each i ∊ 1,2,.. P using the basic algorithm.
III. Unite all sub-matches into M.IV. Find maximum matching in G using
basic algorithm (starting with M).
![Page 66: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/66.jpg)
Matching Algorithm for Bipartite Graph – second
optimizationIf the sizes of sub-graphs are chosen
appropriately, each of which fits into the cache, it generates minimal cpu-memory transactions of O(|V| + |E|) during phase II, because a single loading of each data element into the cache is necessary.
Finding the best size for a sub-graph is by experiment.
![Page 67: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/67.jpg)
Matching Algorithm for Bipartite Graph – best case
In the best case, the maximum matching is found in phase II, the algorithm’s cpu-memory transactions complexity is O(|V| + |E|)
That leaves us with the problem of partition the graph optimally.
I. Partition G into g[1], g[2], … g[p].II. Find the maximum matching in g[i]
for each i ∊ 1,2,.. P using the basic
algorithm.III. Unite all sub-matches into M.IV. Find maximum matching in G using
basic algorithm (starting with M).
![Page 68: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/68.jpg)
Partitioning the Bipartite Graph
The goal: to partition the edges into two groups such that the best matching possible is found within each group.
Algorithm:I. Arbitrarily partition the vertices into 4
equal partitions.II. Count the number of edges between
each pair of partitions.III. Combine partitions into two partitions
such that as many “internal” edges as possible are created.
![Page 69: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/69.jpg)
Conclusions
Using efficient data representation methods can highly improve algorithms’ running time.
Further improvement can be achieved by methods as tiling and recursion.
Other graph algorithms, such as Bellman-Ford, BFS & DFS can be improved by the above, because of their data access pattern.
![Page 70: Optimizing Graph Algorithms for Improved Cache Performance](https://reader033.vdocuments.us/reader033/viewer/2022051517/568157f2550346895dc56eb4/html5/thumbnails/70.jpg)