partitioning for pagerankrezab/dao/notes/partitioning_pagerank.pdf1. 2. 3. start each page at a rank...

13
Reza Zadeh Partitioning for PageRank @Reza_Zadeh | http://reza-zadeh.com

Upload: others

Post on 01-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Reza Zadeh

Partitioning for PageRank

@Reza_Zadeh | http://reza-zadeh.com

Page 2: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Motivation Recall from first lecture that network bandwidth is ~100× as expensive as memory bandwidth

One way Spark avoids using it is through locality-aware scheduling for RAM and disk

Another important tool is controlling the partitioning of RDD contents across nodes

Page 3: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Spark PageRank Given directed graph, compute node importance. Two RDDs: »  Neighbors (a sparse graph/matrix)

»  Current guess (a vector)

Best representation for vector and matrix?

Page 4: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Example

Page 5: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Execution

Page 6: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Solution

Page 7: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

New Execution

Page 8: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

How it works Each RDD has an optional Partitioner object# Any shuffle operation on an RDD with a Partitioner will respect that Partitioner#

Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set

#

Page 9: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Examples

Page 10: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Main Conclusion Controlled partitioning can avoid unnecessary all-to-all communication, saving computation

Repeated joins generalizes to repeated Matrix Multiplication, opening many algorithms from Numerical Linear Algebra

Page 11: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Performance

Page 12: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

RDD partitioner

Page 13: Partitioning for PageRankrezab/dao/notes/Partitioning_PageRank.pdf1. 2. 3. Start each page at a rank of 1 On each iteration, have page contribute rank /lneighborspl to its neighbors

Custom Partitioning