pagerank in multithreading

Post on 19-Feb-2017

151 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PageRank MultithreadingShujian Zhang

Look at how much fun they’re having. …..however, there are a lot of web pages, and a lot of

links, and it becomes a LOT of work to calculate

PageRank is fun!

OverviewHow does PageRank work?

-Directed graph (nodes point to other nodes, but it’s a one-way street)

-Adjacency matrix constructed from graph-Each page given an equal weight to distribute to the pages it

points to-Pages without any other pages pointing to it given weight of

1/total#pagesto represent the random chance that someone goes directly

to that page-Adjacency matrix is multiplied by the PageRank vector iteratively

until thePageRanks begin to approach an equilibrium and change no

further-”Damping Factor” applied to simulate a random stop in page

exploration

Formula Represented as:

Where R is the PR matrix, M is the adjacency matrix, t is the number of iterations done, d is the damping factor, and N is the total number of pages. After limited iterations, the pagerank value will converge.

OverviewThe adjacency matrix is populated with1’s initially, and then 1/#NodesPointedAt, as a given side pointing to 3 other sites givesa ⅓ chance to navigate to either one of the 3.

In a very basic representation, the adjacencymatrix is multiplied by the page rank vector,in this case, on its initial run. All pages haveequal weight from the beginning.

To deal with the situation (there are some pages which they never link to other pages), we needto set up a possibility which represents the person may jump to other pages by inputting addressin browser.

OverviewDamping Factor-When a node points to no one else, over time,it will possibly become a sink, and hold all ofthe weight.

The damping factor simulates a user getting bored of their current train of pages, and going to a random website. This makes surethat sinks don’t happen, as someone stuck on C might become bored and navigate toany other node.

Sequential ImplementationThe sequential implementation is basically just a big loop through the matrix,

performing the page rank calculation on every row of the matrix, updating the page rank vector, and then repeating for N number of times.

Pseudocode:

for n timesfor row in adjmat

for i in rowtempPR[rowIndex] = PRcalc(row[i], PR[i])

PR = tempPR

Parallel implementationRunning this problem concurrently is actually very slick. Each thread can handle a row of the adjacency matrix and calculate PR for each node. Each thread writes the new pagerank to a temporary vector, and after all threads have calculated pagerank for each node, the new pagerank vector replaces the old one. No threads will ever write to the same location, so there is no need to use mutexes or any other kind of read/write control. The only thing to keep control of is making sure the threads don’t outpace each other and get ahead or behind on the iteration.

*pseudocode for thread:row = adjmat[next]

for i in rowthisRank += (do PR formula on row[i], pagerank[i])

tempPR[next] = thisRank

*main threadpagerank = tempPR

Parallel implementation (CPP)Global variables:

Create Matrix (in main function):

Parallel implementation (CPP)Create Pthread, and we can decide how many times we need to run the program.

Parallel implementation (CPP)PageRank algorithm for Parallel implementation

Output ( two times running)For testing the algorithm is correct or not, we ran our code in a data set with 4 nodes.

Sequential ( one thread) Parallel ( two threads)

The pagerank value is converging after 30 times running.

Output ( 30 times sequential running)

Output ( 30 times parallel running)

The pagerank value is converging after 30 times running.

So algorithm of code is CORRECT!

Running timeFor this part, we ran our code with a big data set named Wiki-

Vote( about 8000 nodes). And we ran it for 20 times

The Wiki-Vote was downloaded from http://snap.stanford.edu/data/. It is directed graph. Its description is Wikipedia who-votes-on-whom network.

For sequential(1 thread) running time : 15.18 sec.

2 threads: 7.89 sec.

4 threads: 4.21 sec.

16 threads: 3.77 sec.

Conclusion After more than 20 times running, the Pagerank value shows

convergence. When running the small data set, there is no different running

times between sequence and multi threads. When running the big data set and running for many times,

there will be obviously different running times between sequence and multi threads. As the number of thread increasing, the running time is decreasing.

Thank you!

top related