pagerank in multithreading

16
PageRank Multithreading Shujian Zhang

Upload: shujian-zhang

Post on 19-Feb-2017

151 views

Category:

Software


0 download

TRANSCRIPT

Page 1: PageRank in Multithreading

PageRank MultithreadingShujian Zhang

Page 2: PageRank in Multithreading

Look at how much fun they’re having. …..however, there are a lot of web pages, and a lot of

links, and it becomes a LOT of work to calculate

PageRank is fun!

Page 3: PageRank in Multithreading

OverviewHow does PageRank work?

-Directed graph (nodes point to other nodes, but it’s a one-way street)

-Adjacency matrix constructed from graph-Each page given an equal weight to distribute to the pages it

points to-Pages without any other pages pointing to it given weight of

1/total#pagesto represent the random chance that someone goes directly

to that page-Adjacency matrix is multiplied by the PageRank vector iteratively

until thePageRanks begin to approach an equilibrium and change no

further-”Damping Factor” applied to simulate a random stop in page

exploration

Formula Represented as:

Where R is the PR matrix, M is the adjacency matrix, t is the number of iterations done, d is the damping factor, and N is the total number of pages. After limited iterations, the pagerank value will converge.

Page 4: PageRank in Multithreading

OverviewThe adjacency matrix is populated with1’s initially, and then 1/#NodesPointedAt, as a given side pointing to 3 other sites givesa ⅓ chance to navigate to either one of the 3.

In a very basic representation, the adjacencymatrix is multiplied by the page rank vector,in this case, on its initial run. All pages haveequal weight from the beginning.

To deal with the situation (there are some pages which they never link to other pages), we needto set up a possibility which represents the person may jump to other pages by inputting addressin browser.

Page 5: PageRank in Multithreading

OverviewDamping Factor-When a node points to no one else, over time,it will possibly become a sink, and hold all ofthe weight.

The damping factor simulates a user getting bored of their current train of pages, and going to a random website. This makes surethat sinks don’t happen, as someone stuck on C might become bored and navigate toany other node.

Page 6: PageRank in Multithreading

Sequential ImplementationThe sequential implementation is basically just a big loop through the matrix,

performing the page rank calculation on every row of the matrix, updating the page rank vector, and then repeating for N number of times.

Pseudocode:

for n timesfor row in adjmat

for i in rowtempPR[rowIndex] = PRcalc(row[i], PR[i])

PR = tempPR

Page 7: PageRank in Multithreading

Parallel implementationRunning this problem concurrently is actually very slick. Each thread can handle a row of the adjacency matrix and calculate PR for each node. Each thread writes the new pagerank to a temporary vector, and after all threads have calculated pagerank for each node, the new pagerank vector replaces the old one. No threads will ever write to the same location, so there is no need to use mutexes or any other kind of read/write control. The only thing to keep control of is making sure the threads don’t outpace each other and get ahead or behind on the iteration.

*pseudocode for thread:row = adjmat[next]

for i in rowthisRank += (do PR formula on row[i], pagerank[i])

tempPR[next] = thisRank

*main threadpagerank = tempPR

Page 8: PageRank in Multithreading

Parallel implementation (CPP)Global variables:

Create Matrix (in main function):

Page 9: PageRank in Multithreading

Parallel implementation (CPP)Create Pthread, and we can decide how many times we need to run the program.

Page 10: PageRank in Multithreading

Parallel implementation (CPP)PageRank algorithm for Parallel implementation

Page 11: PageRank in Multithreading

Output ( two times running)For testing the algorithm is correct or not, we ran our code in a data set with 4 nodes.

Sequential ( one thread) Parallel ( two threads)

Page 12: PageRank in Multithreading

The pagerank value is converging after 30 times running.

Output ( 30 times sequential running)

Page 13: PageRank in Multithreading

Output ( 30 times parallel running)

The pagerank value is converging after 30 times running.

So algorithm of code is CORRECT!

Page 14: PageRank in Multithreading

Running timeFor this part, we ran our code with a big data set named Wiki-

Vote( about 8000 nodes). And we ran it for 20 times

The Wiki-Vote was downloaded from http://snap.stanford.edu/data/. It is directed graph. Its description is Wikipedia who-votes-on-whom network.

For sequential(1 thread) running time : 15.18 sec.

2 threads: 7.89 sec.

4 threads: 4.21 sec.

16 threads: 3.77 sec.

Page 15: PageRank in Multithreading
Page 16: PageRank in Multithreading

Conclusion After more than 20 times running, the Pagerank value shows

convergence. When running the small data set, there is no different running

times between sequence and multi threads. When running the big data set and running for many times,

there will be obviously different running times between sequence and multi threads. As the number of thread increasing, the running time is decreasing.

Thank you!