a parallel data mining algorithm for pagerank computation
Post on 15-Feb-2022
3 Views
Preview:
TRANSCRIPT
A Parallel Data Mining Algorithm for
PageRank Computation
Massinissa Saoudi, Massinissa Lounis, Ahcène Bounceur, Reinhardt Euler, M-Tahar Kechadi.
Lab-STICC UMR CNRS 6285 - Université de Bretagne Occidentale
Lab-CASL- University College Dublin, Ireland.
This work is part of the research project PERSEPTEUR supported by the French National Research Agency ANR.
23/06/2016 1
Journées Big Data Mining and Visualization
Problem statement & Motivation
PageRank algorithm
Example of computing
Proposed solution
Conclusion & Perspectives
2
Outline
Problem statement & Motivation
(1/2)
3
• One of the most difficult problems in web search is how to rank the accurate results of user query which generates hundreds of thousand of pages containing the query terms ?
PageRank : is the Data Mining algorithm which allows to determine the importance of Web pages.
Problem statement & Motivation
(2/2)
4
• The PageRank algorithm is general and not limited to any large web graph,
• It is limited to the size of device memory (RAM) and computation capacity (CPU).
We need to parallelize the computing of PageRank using CUDA.
We need to reduce Data storage.
PageRank
5
• The PageRank vector [1] is a measure (ranking weight) of web page quality based on the structure of the hyperlink graph.
• The basic idea of PageRank is to assign higher scores (ri) to web pages which have many in-links, but relatively few out-links.
[1] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web.
PageRank
6
• The PageRank equation can be written as follows:
𝑀 = 𝑑 × 𝐴𝑡 +1 − 𝑑
𝑛 × 𝐸
• where
– 𝑑 is the damping factor [0,1],
– 𝑛 is the number of pages,
– 𝐴𝑡 represents the transposed matrix of the column stochastic matrix A,
– 𝐴 is an 𝑛 × 𝑛 link matrix whose entries a𝑖𝑗 are defined as follows:
• a𝑖𝑗 =
1
𝑂𝑢𝑡𝐿𝑖𝑛𝑘 𝑗
1
𝑛
0
– 𝐸 is a constant 𝑛 × 𝑛 matrix with entries 𝑒𝑖𝑗 equal to 1.
if there is a link from page j to page i
if node j is a dangling page
otherwise
PageRank
• To calculate PageRank vector 𝑟𝑖, we use the matrix 𝑀 defined as follows:
– 𝑟𝑖 + 1 = 𝑟𝑖 ×𝑀
• Initialize r0 to an N column vector with non-negative components,
• Repeatedly replace 𝑟𝑖 by the product 𝑟𝑖 ×𝑀 until it converges i.e.,
– 𝑟𝑖 + 1− 𝑟𝑖 < 𝜀
7
Example of computing
• We will represent the structure of a network with a matrix.
• The adjacency matrix for the network below is:
G =
0 1 1 10 0 0 111
00
01
10
• We compute the column-stochastic matrix for G:
A =
0 1/3 1/3 1/30 0 0 11/21/2
00
01/2
1/20
𝐴𝑡 =
0 0 1/2 1/21/3 0 0 01/31/3
01
01/2
1/20
8
A B
C D
Example of computing
• Form a stochastic matrix M from our matrix 𝐴𝑡.
where 𝑑 = 0,85 and 𝑛 = 4
• Initialize 𝑟0 and compute PageRank vector, i.e.,
where ε = 0,00001
9
M =
0.0375 0.0375 0.4625 0.46250.3208 0.0375 0.0375 0.03750.32080.3208
0.03750.8875
0.03750.4625
0.46250.0375
𝑟0 =
1111
, 𝑟0 ×𝑀 =
10,4330,8581,708
, 𝑟1 ×𝑀 =
1,2410,4331,1591,166
, … , 𝑟14 ×𝑀 =
1,1560,4771,0411,326
A B
C D 𝟏, 𝟑𝟐𝟔 𝟏, 𝟎𝟒𝟏
𝟎, 𝟒𝟕𝟕 𝟏, 𝟏𝟓𝟔
• To load data from file to the matrix A is very hard due to its size: it may contain millions to billions of web pages and hyperlinks which will be impossible to load in RAM.
Solution:
• We have proposed a new structure of web graph G representation called Optimization Structure File (OSF):
• We save only the position(i, j) of the page where a𝑖𝑗 equals to 1 on the adjacency matrix G than to store unnecessary information of other pages.
10
Solution
G =
0 1 1 10 0 0 111
00
01
10
, Gt =
0 0 1 11 0 0 011
01
01
10
OSF_G’ =
0 0 1 2 2 3 3 3 Line
2 3 0 0 3 0 1 2 Column
Solution
• To compute M , we have proved that:
– If a𝑖𝑗 = 0 in 𝐴𝑡 then
• m𝑖𝑗 =1−𝑑
𝑛 which is constant (c)
• The new structure of M is :
11
M =
0.0375 0.0375 0.4625 0.46250.3208 0.0375 0.0375 0.03750.32080.3208
0.03750.8875
0.03750.4625
0.46250.0375
0 0 1 2 2 3 3 3 Line
2 3 0 0 3 0 1 2 Column
0.4625 0.4625 0.3208 0.3208 0.3208 0.4625 0.3208 0.8875 Value
OSF_M
C = 0,0375
Solution
• We have proposed a parallel GPU model:
13
• We transform all treatments into elementary matrix operations.
• Each operation will be implemented in a CUDA kernel (GPU function) using threads
in one or two dimensions. • These kernels will be executed within a loop until the convergence condition is achieved.
• delta = 𝑟𝑖 + 1− 𝑟𝑖 , precision = 𝜀 • The convergence condition is reduced to
complexity of O(1) on the GPU instead of sequential loop with complexity of O(n) in CPU.
Solution
14
Two types of block disposition architectures are proposed:
• The first is one dimension architecture
i.e., all threads are organized in a vector format,
• We have used this architecture (1) in
kernels 4 and 5,
• The second architecture use two dimension disposition,
• We have used this architecture (2) in kernels 1, 2 and 3.
• We have proposed a blocks division model:
Simulation Parameters
• Using of Unified Device Architecture (CUDA) to program a GPU device for general purpose computation.
• Using of NVIDIA GeForce GTX 680 card which has :
– 1536 cores,
– 6.0 GB/S memory speed,
– 2048 MB of RAM capacity.
• PageRank is evaluated by various real web graphs of the Utrecht University UF Sparse Matrix Collection repository [2].
15
[2] A. van Heukelum, Uf sparse matrix collection, institute for theoretical physics, utrecht university.
Results
17
• The speed of the GPU computation increases by a factor of 100 compared to the CPU computation speed.
Conclusion & Perspectives
Proposition of a parallel model for PageRank computation,
Proposition of a new structure (OSF) to store the web graph data,
Future work :
Apply the PageRank algorithm within different fields, in particularly, wireless sensor networks.
19
top related