having fun with pagerank and mapreducestatic.last.fm/.../paolo_castagna-pagerank.pdf ·...
TRANSCRIPT
Having Fun with PageRank and
MapReduce
rdquo
ldquo
Hadoop User Group (HUG) UK ndash 14th April 2009 ndash httphugukorg
Paolo Castagna ndash HP Labs Bristol UK
PageRank
p
p1
p2
p3
12
13
1 23
3
21
rr
rrp
r1
r2
r3
2
PageRank
ipj
j
i
Bp j
p
pp
rr
Bpibackward links (ie links to pi)
|pj| number of forward links (ie links from pj)
recursive definition
3
PageRank
N
pr
p
prpr
i
Bp j
jk
ik
ipj
10
1
rk(pi) pagerank of page pi at k iteration
N total number of pages
iterative computation
4
Random Surfer
bull A surfer follows links at random indefinitely
bull Time spent on a given page measure the importance of that page
bull Problems
ndash rank sinks (accumulate too much)
ndash cycles (could cause periodicity)
bull Dangling pages Jump to any other page
bull Bored Teleportation (fixes rank sinks and eliminates cycles)
5
Dangling Pages
00
1
jp
j
jpipj p
jk
Bp j
jk
ikN
pr
p
prpr
if |pj| is zero
N total number of pages
random jump
independent from pi
6
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
PageRank
p
p1
p2
p3
12
13
1 23
3
21
rr
rrp
r1
r2
r3
2
PageRank
ipj
j
i
Bp j
p
pp
rr
Bpibackward links (ie links to pi)
|pj| number of forward links (ie links from pj)
recursive definition
3
PageRank
N
pr
p
prpr
i
Bp j
jk
ik
ipj
10
1
rk(pi) pagerank of page pi at k iteration
N total number of pages
iterative computation
4
Random Surfer
bull A surfer follows links at random indefinitely
bull Time spent on a given page measure the importance of that page
bull Problems
ndash rank sinks (accumulate too much)
ndash cycles (could cause periodicity)
bull Dangling pages Jump to any other page
bull Bored Teleportation (fixes rank sinks and eliminates cycles)
5
Dangling Pages
00
1
jp
j
jpipj p
jk
Bp j
jk
ikN
pr
p
prpr
if |pj| is zero
N total number of pages
random jump
independent from pi
6
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
PageRank
ipj
j
i
Bp j
p
pp
rr
Bpibackward links (ie links to pi)
|pj| number of forward links (ie links from pj)
recursive definition
3
PageRank
N
pr
p
prpr
i
Bp j
jk
ik
ipj
10
1
rk(pi) pagerank of page pi at k iteration
N total number of pages
iterative computation
4
Random Surfer
bull A surfer follows links at random indefinitely
bull Time spent on a given page measure the importance of that page
bull Problems
ndash rank sinks (accumulate too much)
ndash cycles (could cause periodicity)
bull Dangling pages Jump to any other page
bull Bored Teleportation (fixes rank sinks and eliminates cycles)
5
Dangling Pages
00
1
jp
j
jpipj p
jk
Bp j
jk
ikN
pr
p
prpr
if |pj| is zero
N total number of pages
random jump
independent from pi
6
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
PageRank
N
pr
p
prpr
i
Bp j
jk
ik
ipj
10
1
rk(pi) pagerank of page pi at k iteration
N total number of pages
iterative computation
4
Random Surfer
bull A surfer follows links at random indefinitely
bull Time spent on a given page measure the importance of that page
bull Problems
ndash rank sinks (accumulate too much)
ndash cycles (could cause periodicity)
bull Dangling pages Jump to any other page
bull Bored Teleportation (fixes rank sinks and eliminates cycles)
5
Dangling Pages
00
1
jp
j
jpipj p
jk
Bp j
jk
ikN
pr
p
prpr
if |pj| is zero
N total number of pages
random jump
independent from pi
6
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Random Surfer
bull A surfer follows links at random indefinitely
bull Time spent on a given page measure the importance of that page
bull Problems
ndash rank sinks (accumulate too much)
ndash cycles (could cause periodicity)
bull Dangling pages Jump to any other page
bull Bored Teleportation (fixes rank sinks and eliminates cycles)
5
Dangling Pages
00
1
jp
j
jpipj p
jk
Bp j
jk
ikN
pr
p
prpr
if |pj| is zero
N total number of pages
random jump
independent from pi
6
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Dangling Pages
00
1
jp
j
jpipj p
jk
Bp j
jk
ikN
pr
p
prpr
if |pj| is zero
N total number of pages
random jump
independent from pi
6
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Teleportation
j
jp
j
jpipj p
jk
p
jk
Bp j
jk
ikN
prd
N
pr
p
prdpr 1
00
1
if there are loops or someone gets bored
d=085 dumping factor
random jump
independent from pi
jp
jk pr 1
7
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
PageRank
N
pr
N
d
N
pr
p
prdpr
i
p
jk
Bp j
jk
ik
jp
j
jpipj
1
1
0
1
00
8
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Adjacency Matrix
000000
000000
110000
010000
010000
001110
1
2
3
4
5
6
9
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Hyperlink Matrix
1
2
3
4
5
6
000000
000000
0000
010000
010000
000
21
21
31
31
31
H
10
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Sparse Matrices
11
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Adjacency List
1
2
3
4
5
6
better for sparse matrices
1 2 3 4
2 5
3 5
4 5 6
5
6
12
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
PageRank
TTT
k6
5
4
3
2
1
21
21
31
31
31
T
k6
5
4
3
2
1
T
16
5
4
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
000000
000000
0000
010000
010000
000
N
d
r
r
r
r
r
r
N
d
r
r
r
r
r
r
d
r
r
r
r
r
r
k
TTTTT1
1eaerHrr
N
d
N
dd kkk
a dangling node vector
eT vector of all 113
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Convergence
How many iterations How to check convergence
d
n
10log
n number of significant digits
ε = 10-n tolerance
ikik prpr 1
14
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Convergence
significant digits iterations1 15
2 29
3 43
4 57
5 71
6 86
7 100
8 114
9 128
10 142
11 156
12 171
13 185
15
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Power Method
bull Slow to converge
bull Each iteration complexity O(N)
bull Overall complexity iterations O(N) = O(N)
bull Minimal storage H sparse matrix no completely dense matrices need to be stored
16
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Storage Requirements
bull Sparse hyperlink matrix H
ndash number of non zero elements (each a double)
bull Sparse binary dangling node vector
ndash number of dangling nodes (each a boolean)
bull PageRank values for the current iteration
ndash N elements (each a double)
bull (optional) PageRank values for the previous iteration to measure tolerance error
ndash N elements (each a double)
17
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Implementing PageRankwith MapReduce
18
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Adjacency List
1 r1 2 3 4
2 r2 5
3 r3 5
4 r4 5 6
5 r5
6 r6
19
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
MapReducejob 3 ndash page ranks from backward links
bull mapndash input
bull key = p value = ( rp p1 p2 pn )
ndash outputbull key = pi value = rp n i = ( 1 2 n )
bull key = p value = ( p1 p2 pn )
bull reducendash input
bull key = p values = ( rj nj ) ( p1 p2 pn )
ndash outputbull key = p value = ( rp p1 p2 pn )
N
dr
n
rdr d
j j
j
p
1
20
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
MapReducejob 2 ndash contribution from dangling pages
bull map
ndash input
bull key = value = rp dangling page
ndash output
bull key = 1 value = rp N total number of pages
bull combine and reduce
ndash input
bull key = 1 values = ( rj )
ndash output
bull key = value = rd only one value
j
jd rN
dr
21
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
MapReducejob 1 ndash total number of pages
bull map
ndash input
bull key = p value =
ndash output
bull key = 1 value = 1
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = N
j
jvN
22
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Adjacency List
1 rk+11 rk
1 2 3 4
2 rk+12 rk
2 5
3 rk+13 rk
3 5
4 rk+14 rk
4 5 6
5 rk+15 rk
5
6 rk+16 rk
6
23
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
MapReducejob 4 ndash check for convergence
bull map
ndash input
bull key = p value = ( rk+1p rk
p p1 p2 pn )
ndash output
bull key = 1 value = abs ( rk+1p - rk
p )
bull combine and reduce
ndash input
bull key = 1 values = ( vj )
ndash output
bull key = value = ε ε tolerance
j
jv
24
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Putting all together
bull job 1 ndash total number of pages
bull for max n iterations or until convergence
ndash job 2 ndash contribution from dangling pages
ndash job 3 ndash page ranks from backward links
ndash every y iterations
bull job 4 ndash check for convergence
bull Total number of jobs lt= 1 + 2n + ny
25
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Having Fun with PageRank
bull Intelligent surferndash Change rows of the hyperlink matrix H so long they
remain probability distributionsndash Teleportation vector (aka personalization vector)
instead of random jumps
bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)
bull Social networks bull Ranking schemes evaluation and comparison
techniques (without involving humans)bull Ranking schemes for directed labelled multi-
graphs (aka RDF)
26
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Ranking Papers CiteSeer Dataset
27
CiteSeer (1) (2) PageRank Title (Year)
340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)
549100 no no 000121952 Structure and Complexity of Relational Queries (1982)
548351 no no 000120267 Computable Queries for Relational Data Bases (1980)
527057 yes yes 000114733 Optimization by Simulated Annealing (1983)
516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)
28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)
552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)
328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)
239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)
148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)
311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)
93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)
219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)
567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)
20336 no no 000075584 Generalised Additive Models (1995)
524648 yes no 000074717 Implementing Remote Procedure Calls (1984)
15205 no no 000073840 Congestion Avoidance and Control (1988)
35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)
76766 yes no 000068785 The UNIX Time-Sharing (1974)
351230 no no 000067404 History of Circumscription (1993)
CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Dirty Data
bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic
and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)
28
a b c
b h k p
c
d a a c
e s
f f b
g
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Convergence in Practice
29
Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Convergence in Practice
30
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Convergence in Practice
31
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Convergence in Practice
32
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33
Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo
Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html
bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361
bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422
bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf
bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1
33