a parallel data mining algorithm for pagerank computation

A Parallel Data Mining Algorithm for

PageRank Computation

Massinissa Saoudi, Massinissa Lounis, Ahcène Bounceur, Reinhardt Euler, M-Tahar Kechadi.

Lab-STICC UMR CNRS 6285 - Université de Bretagne Occidentale

Lab-CASL- University College Dublin, Ireland.

This work is part of the research project PERSEPTEUR supported by the French National Research Agency ANR.

23/06/2016 1

Journées Big Data Mining and Visualization

Problem statement & Motivation

PageRank algorithm

Example of computing

Proposed solution

Conclusion & Perspectives

2

Outline


(1/2)

3

• One of the most difficult problems in web search is how to rank the accurate results of user query which generates hundreds of thousand of pages containing the query terms ?

PageRank : is the Data Mining algorithm which allows to determine the importance of Web pages.


(2/2)

4

• The PageRank algorithm is general and not limited to any large web graph,

• It is limited to the size of device memory (RAM) and computation capacity (CPU).

We need to parallelize the computing of PageRank using CUDA.

We need to reduce Data storage.

PageRank

5

• The PageRank vector [1] is a measure (ranking weight) of web page quality based on the structure of the hyperlink graph.

• The basic idea of PageRank is to assign higher scores (ri) to web pages which have many in-links, but relatively few out-links.

[1] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web.

PageRank

6

• The PageRank equation can be written as follows:

𝑀 = 𝑑 × 𝐴𝑡 +1 − 𝑑

𝑛 × 𝐸

• where

– 𝑑 is the damping factor [0,1],

– 𝑛 is the number of pages,

– 𝐴𝑡 represents the transposed matrix of the column stochastic matrix A,

– 𝐴 is an 𝑛 × 𝑛 link matrix whose entries a𝑖𝑗 are defined as follows:

• a𝑖𝑗 =

1

𝑂𝑢𝑡𝐿𝑖𝑛𝑘 𝑗

1

𝑛

0

– 𝐸 is a constant 𝑛 × 𝑛 matrix with entries 𝑒𝑖𝑗 equal to 1.

if there is a link from page j to page i

if node j is a dangling page

otherwise

PageRank

• To calculate PageRank vector 𝑟𝑖, we use the matrix 𝑀 defined as follows:

– 𝑟𝑖 + 1 = 𝑟𝑖 ×𝑀

• Initialize r0 to an N column vector with non-negative components,

• Repeatedly replace 𝑟𝑖 by the product 𝑟𝑖 ×𝑀 until it converges i.e.,

– 𝑟𝑖 + 1− 𝑟𝑖 < 𝜀

7


• We will represent the structure of a network with a matrix.

• The adjacency matrix for the network below is:

G =

0 1 1 10 0 0 111

00

01

10

• We compute the column-stochastic matrix for G:

A =

0 1/3 1/3 1/30 0 0 11/21/2

00

01/2

1/20

𝐴𝑡 =

0 0 1/2 1/21/3 0 0 01/31/3

01

01/2

1/20

8

A B

C D


• Form a stochastic matrix M from our matrix 𝐴𝑡.

where 𝑑 = 0,85 and 𝑛 = 4

• Initialize 𝑟0 and compute PageRank vector, i.e.,

where ε = 0,00001

9

M =

0.0375 0.0375 0.4625 0.46250.3208 0.0375 0.0375 0.03750.32080.3208

0.03750.8875

0.03750.4625

0.46250.0375

𝑟0 =

1111

, 𝑟0 ×𝑀 =

10,4330,8581,708

, 𝑟1 ×𝑀 =

1,2410,4331,1591,166

, … , 𝑟14 ×𝑀 =

1,1560,4771,0411,326

A B

C D 𝟏, 𝟑𝟐𝟔 𝟏, 𝟎𝟒𝟏

𝟎, 𝟒𝟕𝟕 𝟏, 𝟏𝟓𝟔

• To load data from file to the matrix A is very hard due to its size: it may contain millions to billions of web pages and hyperlinks which will be impossible to load in RAM.

Solution:

• We have proposed a new structure of web graph G representation called Optimization Structure File (OSF):

• We save only the position(i, j) of the page where a𝑖𝑗 equals to 1 on the adjacency matrix G than to store unnecessary information of other pages.

10

Solution

G =

0 1 1 10 0 0 111

00

01

10

, Gt =

0 0 1 11 0 0 011

01

01

10

OSF_G’ =

0 0 1 2 2 3 3 3 Line

2 3 0 0 3 0 1 2 Column

Solution

• To compute M , we have proved that:

– If a𝑖𝑗 = 0 in 𝐴𝑡 then

• m𝑖𝑗 =1−𝑑

𝑛 which is constant (c)

• The new structure of M is :

11

M =

0.0375 0.0375 0.4625 0.46250.3208 0.0375 0.0375 0.03750.32080.3208

0.03750.8875

0.03750.4625

0.46250.0375

0 0 1 2 2 3 3 3 Line

2 3 0 0 3 0 1 2 Column

0.4625 0.4625 0.3208 0.3208 0.3208 0.4625 0.3208 0.8875 Value

OSF_M

C = 0,0375

Solution

12

• Histogram for 10K pages is approximately 3GB

Reduction of 94 %

Solution

• We have proposed a parallel GPU model:

13

• We transform all treatments into elementary matrix operations.

• Each operation will be implemented in a CUDA kernel (GPU function) using threads

in one or two dimensions. • These kernels will be executed within a loop until the convergence condition is achieved.

• delta = 𝑟𝑖 + 1− 𝑟𝑖 , precision = 𝜀 • The convergence condition is reduced to

complexity of O(1) on the GPU instead of sequential loop with complexity of O(n) in CPU.

Solution

14

Two types of block disposition architectures are proposed:

• The first is one dimension architecture

i.e., all threads are organized in a vector format,

• We have used this architecture (1) in

kernels 4 and 5,

• The second architecture use two dimension disposition,

• We have used this architecture (2) in kernels 1, 2 and 3.

• We have proposed a blocks division model:

Simulation Parameters

• Using of Unified Device Architecture (CUDA) to program a GPU device for general purpose computation.

• Using of NVIDIA GeForce GTX 680 card which has :

– 1536 cores,

– 6.0 GB/S memory speed,

– 2048 MB of RAM capacity.

• PageRank is evaluated by various real web graphs of the Utrecht University UF Sparse Matrix Collection repository [2].

15

[2] A. van Heukelum, Uf sparse matrix collection, institute for theoretical physics, utrecht university.

Results

16

Results

17

• The speed of the GPU computation increases by a factor of 100 compared to the CPU computation speed.

Results

18

Conclusion & Perspectives

Proposition of a parallel model for PageRank computation,

Proposition of a new structure (OSF) to store the web graph data,

Future work :

Apply the PageRank algorithm within different fields, in particularly, wireless sensor networks.

19

Thank you for your attention

Any questions ?

Massinissa SAOUDI [email protected]

20

a parallel data mining algorithm for pagerank computation

Documents