![Page 1: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/1.jpg)
Reza Zadeh
Partitioning for PageRank
@Reza_Zadeh | http://reza-zadeh.com
![Page 2: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/2.jpg)
Motivation Recall from first lecture that network bandwidth is ~100× as expensive as memory bandwidth
One way Spark avoids using it is through locality-aware scheduling for RAM and disk
Another important tool is controlling the partitioning of RDD contents across nodes
![Page 3: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/3.jpg)
Spark PageRank Given directed graph, compute node importance. Two RDDs: » Neighbors (a sparse graph/matrix)
» Current guess (a vector)
Best representation for vector and matrix?
![Page 4: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/4.jpg)
Example
![Page 5: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/5.jpg)
Execution
![Page 6: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/6.jpg)
Solution
![Page 7: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/7.jpg)
New Execution
![Page 8: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/8.jpg)
How it works Each RDD has an optional Partitioner object# Any shuffle operation on an RDD with a Partitioner will respect that Partitioner#
Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set
#
![Page 9: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/9.jpg)
Examples
![Page 10: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/10.jpg)
Main Conclusion Controlled partitioning can avoid unnecessary all-to-all communication, saving computation
Repeated joins generalizes to repeated Matrix Multiplication, opening many algorithms from Numerical Linear Algebra
![Page 11: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/11.jpg)
Performance
![Page 12: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/12.jpg)
RDD partitioner
![Page 13: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors](https://reader034.vdocuments.us/reader034/viewer/2022052009/601ecba90ea61746707bd397/html5/thumbnails/13.jpg)
Custom Partitioning