a parallel data mining algorithm for pagerank computation

A Parallel Data Mining Algorithm for

PageRank Computation

Massinissa Saoudi, Massinissa Lounis, Ahcène Bounceur, Reinhardt Euler, M-Tahar Kechadi.

Lab-STICC UMR CNRS 6285 - Université de Bretagne Occidentale

Lab-CASL- University College Dublin, Ireland.

This work is part of the research project PERSEPTEUR supported by the French National Research Agency ANR.

23/06/2016 1

Journées Big Data Mining and Visualization

Problem statement & Motivation

PageRank algorithm

Example of computing

Proposed solution

Conclusion & Perspectives

Outline

• One of the most difficult problems in web search is how to rank the accurate results of user query which generates hundreds of thousand of pages containing the query terms ?

PageRank : is the Data Mining algorithm which allows to determine the importance of Web pages.

• The PageRank algorithm is general and not limited to any large web graph,

• It is limited to the size of device memory (RAM) and computation capacity (CPU).

We need to parallelize the computing of PageRank using CUDA.

We need to reduce Data storage.

PageRank

• The PageRank vector [1] is a measure (ranking weight) of web page quality based on the structure of the hyperlink graph.

• The basic idea of PageRank is to assign higher scores (ri) to web pages which have many in-links, but relatively few out-links.

[1] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web.

PageRank

• The PageRank equation can be written as follows:

𝑀 = 𝑑 × 𝐴𝑡 +1 − 𝑑

𝑛 × 𝐸

• where

– 𝑑 is the damping factor [0,1],

– 𝑛 is the number of pages,

– 𝐴𝑡 represents the transposed matrix of the column stochastic matrix A,

– 𝐴 is an 𝑛 × 𝑛 link matrix whose entries a𝑖𝑗 are defined as follows:

• a𝑖𝑗 =

𝑂𝑢𝑡𝐿𝑖𝑛𝑘 𝑗

– 𝐸 is a constant 𝑛 × 𝑛 matrix with entries 𝑒𝑖𝑗 equal to 1.

if there is a link from page j to page i

if node j is a dangling page

otherwise

PageRank

• To calculate PageRank vector 𝑟𝑖, we use the matrix 𝑀 defined as follows:

– 𝑟𝑖 + 1 = 𝑟𝑖 ×𝑀

• Initialize r0 to an N column vector with non-negative components,

• Repeatedly replace 𝑟𝑖 by the product 𝑟𝑖 ×𝑀 until it converges i.e.,

– 𝑟𝑖 + 1− 𝑟𝑖 < 𝜀

• We will represent the structure of a network with a matrix.

• The adjacency matrix for the network below is:

0 1 1 10 0 0 111

• We compute the column-stochastic matrix for G:

0 1/3 1/3 1/30 0 0 11/21/2

𝐴𝑡 =

0 0 1/2 1/21/3 0 0 01/31/3

• Form a stochastic matrix M from our matrix 𝐴𝑡.

where 𝑑 = 0,85 and 𝑛 = 4

• Initialize 𝑟0 and compute PageRank vector, i.e.,

where ε = 0,00001

0.0375 0.0375 0.4625 0.46250.3208 0.0375 0.0375 0.03750.32080.3208

0.03750.8875

0.03750.4625

0.46250.0375

𝑟0 =

, 𝑟0 ×𝑀 =

10,4330,8581,708

, 𝑟1 ×𝑀 =

1,2410,4331,1591,166

, … , 𝑟14 ×𝑀 =

1,1560,4771,0411,326

C D 𝟏, 𝟑𝟐𝟔 𝟏, 𝟎𝟒𝟏

𝟎, 𝟒𝟕𝟕 𝟏, 𝟏𝟓𝟔

• To load data from file to the matrix A is very hard due to its size: it may contain millions to billions of web pages and hyperlinks which will be impossible to load in RAM.

Solution:

• We have proposed a new structure of web graph G representation called Optimization Structure File (OSF):

• We save only the position(i, j) of the page where a𝑖𝑗 equals to 1 on the adjacency matrix G than to store unnecessary information of other pages.

Solution

0 1 1 10 0 0 111

, Gt =

0 0 1 11 0 0 011

OSF_G’ =

0 0 1 2 2 3 3 3 Line

2 3 0 0 3 0 1 2 Column

Solution

• To compute M , we have proved that:

– If a𝑖𝑗 = 0 in 𝐴𝑡 then

• m𝑖𝑗 =1−𝑑

𝑛 which is constant (c)

• The new structure of M is :

0.0375 0.0375 0.4625 0.46250.3208 0.0375 0.0375 0.03750.32080.3208

0.03750.8875

0.03750.4625

0.46250.0375

0 0 1 2 2 3 3 3 Line

2 3 0 0 3 0 1 2 Column

0.4625 0.4625 0.3208 0.3208 0.3208 0.4625 0.3208 0.8875 Value

C = 0,0375

Solution

• Histogram for 10K pages is approximately 3GB

Reduction of 94 %

Solution

• We have proposed a parallel GPU model:

• We transform all treatments into elementary matrix operations.

• Each operation will be implemented in a CUDA kernel (GPU function) using threads

in one or two dimensions. • These kernels will be executed within a loop until the convergence condition is achieved.

• delta = 𝑟𝑖 + 1− 𝑟𝑖 , precision = 𝜀 • The convergence condition is reduced to

complexity of O(1) on the GPU instead of sequential loop with complexity of O(n) in CPU.

Solution

Two types of block disposition architectures are proposed:

• The first is one dimension architecture

i.e., all threads are organized in a vector format,

• We have used this architecture (1) in

kernels 4 and 5,

• The second architecture use two dimension disposition,

• We have used this architecture (2) in kernels 1, 2 and 3.

• We have proposed a blocks division model:

Simulation Parameters

• Using of Unified Device Architecture (CUDA) to program a GPU device for general purpose computation.

• Using of NVIDIA GeForce GTX 680 card which has :

– 1536 cores,

– 6.0 GB/S memory speed,

– 2048 MB of RAM capacity.

• PageRank is evaluated by various real web graphs of the Utrecht University UF Sparse Matrix Collection repository [2].

[2] A. van Heukelum, Uf sparse matrix collection, institute for theoretical physics, utrecht university.

Results

• The speed of the GPU computation increases by a factor of 100 compared to the CPU computation speed.

Results

Conclusion & Perspectives

Proposition of a parallel model for PageRank computation,

Proposition of a new structure (OSF) to store the web graph data,

Future work :

Apply the PageRank algorithm within different fields, in particularly, wireless sensor networks.

Thank you for your attention

Any questions ?

Massinissa SAOUDI Massinissa.Saoudi@univ-brest.fr

a parallel data mining algorithm for pagerank computation

Documents

coms 4995 w: parallel functional programming parallel...

limits to parallel computation

numerical methods for rapid computation of...

java parallel computation on hadoop

e cient computation of pagerank - stanford...

parallel quantum computation

pram model for parallel computation

applying parallel computation algorithms the design of...

anatomy of parallel computation with tensors

parallel skyline computation on multicore...

cosc 6374 parallel computation parallel design patterns...

parallel java course - intel developer zone · parallel...

analysis of rayleigh quotient in extrapolation method to...

cosc 6374 parallel computation parallel design patterns...

intro parallel computation

web and pagerank - computer science at...

parallel computation of the jacobian matrix for nonlinear...

survey of parallel computation

embarrassingly parallel computation for occlusion culling

biologically inspired massively-parallel computation