having fun with pagerank and mapreducestatic.last.fm/.../paolo_castagna-pagerank.pdf ·...

34
Having Fun with PageRank and MapReduce Hadoop User Group (HUG) UK 14 th April 2009 http://huguk.org/ Paolo Castagna HP Labs, Bristol, UK

Upload: others

Post on 01-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Having Fun with PageRank and

MapReduce

rdquo

ldquo

Hadoop User Group (HUG) UK ndash 14th April 2009 ndash httphugukorg

Paolo Castagna ndash HP Labs Bristol UK

PageRank

p

p1

p2

p3

12

13

1 23

3

21

rr

rrp

r1

r2

r3

2

PageRank

ipj

j

i

Bp j

p

pp

rr

Bpibackward links (ie links to pi)

|pj| number of forward links (ie links from pj)

recursive definition

3

PageRank

N

pr

p

prpr

i

Bp j

jk

ik

ipj

10

1

rk(pi) pagerank of page pi at k iteration

N total number of pages

iterative computation

4

Random Surfer

bull A surfer follows links at random indefinitely

bull Time spent on a given page measure the importance of that page

bull Problems

ndash rank sinks (accumulate too much)

ndash cycles (could cause periodicity)

bull Dangling pages Jump to any other page

bull Bored Teleportation (fixes rank sinks and eliminates cycles)

5

Dangling Pages

00

1

jp

j

jpipj p

jk

Bp j

jk

ikN

pr

p

prpr

if |pj| is zero

N total number of pages

random jump

independent from pi

6

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 2: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

PageRank

p

p1

p2

p3

12

13

1 23

3

21

rr

rrp

r1

r2

r3

2

PageRank

ipj

j

i

Bp j

p

pp

rr

Bpibackward links (ie links to pi)

|pj| number of forward links (ie links from pj)

recursive definition

3

PageRank

N

pr

p

prpr

i

Bp j

jk

ik

ipj

10

1

rk(pi) pagerank of page pi at k iteration

N total number of pages

iterative computation

4

Random Surfer

bull A surfer follows links at random indefinitely

bull Time spent on a given page measure the importance of that page

bull Problems

ndash rank sinks (accumulate too much)

ndash cycles (could cause periodicity)

bull Dangling pages Jump to any other page

bull Bored Teleportation (fixes rank sinks and eliminates cycles)

5

Dangling Pages

00

1

jp

j

jpipj p

jk

Bp j

jk

ikN

pr

p

prpr

if |pj| is zero

N total number of pages

random jump

independent from pi

6

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 3: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

PageRank

ipj

j

i

Bp j

p

pp

rr

Bpibackward links (ie links to pi)

|pj| number of forward links (ie links from pj)

recursive definition

3

PageRank

N

pr

p

prpr

i

Bp j

jk

ik

ipj

10

1

rk(pi) pagerank of page pi at k iteration

N total number of pages

iterative computation

4

Random Surfer

bull A surfer follows links at random indefinitely

bull Time spent on a given page measure the importance of that page

bull Problems

ndash rank sinks (accumulate too much)

ndash cycles (could cause periodicity)

bull Dangling pages Jump to any other page

bull Bored Teleportation (fixes rank sinks and eliminates cycles)

5

Dangling Pages

00

1

jp

j

jpipj p

jk

Bp j

jk

ikN

pr

p

prpr

if |pj| is zero

N total number of pages

random jump

independent from pi

6

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 4: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

PageRank

N

pr

p

prpr

i

Bp j

jk

ik

ipj

10

1

rk(pi) pagerank of page pi at k iteration

N total number of pages

iterative computation

4

Random Surfer

bull A surfer follows links at random indefinitely

bull Time spent on a given page measure the importance of that page

bull Problems

ndash rank sinks (accumulate too much)

ndash cycles (could cause periodicity)

bull Dangling pages Jump to any other page

bull Bored Teleportation (fixes rank sinks and eliminates cycles)

5

Dangling Pages

00

1

jp

j

jpipj p

jk

Bp j

jk

ikN

pr

p

prpr

if |pj| is zero

N total number of pages

random jump

independent from pi

6

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 5: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Random Surfer

bull A surfer follows links at random indefinitely

bull Time spent on a given page measure the importance of that page

bull Problems

ndash rank sinks (accumulate too much)

ndash cycles (could cause periodicity)

bull Dangling pages Jump to any other page

bull Bored Teleportation (fixes rank sinks and eliminates cycles)

5

Dangling Pages

00

1

jp

j

jpipj p

jk

Bp j

jk

ikN

pr

p

prpr

if |pj| is zero

N total number of pages

random jump

independent from pi

6

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 6: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Dangling Pages

00

1

jp

j

jpipj p

jk

Bp j

jk

ikN

pr

p

prpr

if |pj| is zero

N total number of pages

random jump

independent from pi

6

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 7: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Teleportation

j

jp

j

jpipj p

jk

p

jk

Bp j

jk

ikN

prd

N

pr

p

prdpr 1

00

1

if there are loops or someone gets bored

d=085 dumping factor

random jump

independent from pi

jp

jk pr 1

7

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 8: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

PageRank

N

pr

N

d

N

pr

p

prdpr

i

p

jk

Bp j

jk

ik

jp

j

jpipj

1

1

0

1

00

8

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 9: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Adjacency Matrix

000000

000000

110000

010000

010000

001110

1

2

3

4

5

6

9

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 10: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Hyperlink Matrix

1

2

3

4

5

6

000000

000000

0000

010000

010000

000

21

21

31

31

31

H

10

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 11: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Sparse Matrices

11

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 12: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Adjacency List

1

2

3

4

5

6

better for sparse matrices

1 2 3 4

2 5

3 5

4 5 6

5

6

12

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 13: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

PageRank

TTT

k6

5

4

3

2

1

21

21

31

31

31

T

k6

5

4

3

2

1

T

16

5

4

3

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

000000

000000

0000

010000

010000

000

N

d

r

r

r

r

r

r

N

d

r

r

r

r

r

r

d

r

r

r

r

r

r

k

TTTTT1

1eaerHrr

N

d

N

dd kkk

a dangling node vector

eT vector of all 113

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 14: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Convergence

How many iterations How to check convergence

d

n

10log

n number of significant digits

ε = 10-n tolerance

ikik prpr 1

14

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 15: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Convergence

significant digits iterations1 15

2 29

3 43

4 57

5 71

6 86

7 100

8 114

9 128

10 142

11 156

12 171

13 185

15

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 16: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Power Method

bull Slow to converge

bull Each iteration complexity O(N)

bull Overall complexity iterations O(N) = O(N)

bull Minimal storage H sparse matrix no completely dense matrices need to be stored

16

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 17: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Storage Requirements

bull Sparse hyperlink matrix H

ndash number of non zero elements (each a double)

bull Sparse binary dangling node vector

ndash number of dangling nodes (each a boolean)

bull PageRank values for the current iteration

ndash N elements (each a double)

bull (optional) PageRank values for the previous iteration to measure tolerance error

ndash N elements (each a double)

17

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 18: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Implementing PageRankwith MapReduce

18

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 19: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Adjacency List

1 r1 2 3 4

2 r2 5

3 r3 5

4 r4 5 6

5 r5

6 r6

19

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 20: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

MapReducejob 3 ndash page ranks from backward links

bull mapndash input

bull key = p value = ( rp p1 p2 pn )

ndash outputbull key = pi value = rp n i = ( 1 2 n )

bull key = p value = ( p1 p2 pn )

bull reducendash input

bull key = p values = ( rj nj ) ( p1 p2 pn )

ndash outputbull key = p value = ( rp p1 p2 pn )

N

dr

n

rdr d

j j

j

p

1

20

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 21: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

MapReducejob 2 ndash contribution from dangling pages

bull map

ndash input

bull key = value = rp dangling page

ndash output

bull key = 1 value = rp N total number of pages

bull combine and reduce

ndash input

bull key = 1 values = ( rj )

ndash output

bull key = value = rd only one value

j

jd rN

dr

21

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 22: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

MapReducejob 1 ndash total number of pages

bull map

ndash input

bull key = p value =

ndash output

bull key = 1 value = 1

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = N

j

jvN

22

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 23: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Adjacency List

1 rk+11 rk

1 2 3 4

2 rk+12 rk

2 5

3 rk+13 rk

3 5

4 rk+14 rk

4 5 6

5 rk+15 rk

5

6 rk+16 rk

6

23

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 24: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

MapReducejob 4 ndash check for convergence

bull map

ndash input

bull key = p value = ( rk+1p rk

p p1 p2 pn )

ndash output

bull key = 1 value = abs ( rk+1p - rk

p )

bull combine and reduce

ndash input

bull key = 1 values = ( vj )

ndash output

bull key = value = ε ε tolerance

j

jv

24

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 25: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Putting all together

bull job 1 ndash total number of pages

bull for max n iterations or until convergence

ndash job 2 ndash contribution from dangling pages

ndash job 3 ndash page ranks from backward links

ndash every y iterations

bull job 4 ndash check for convergence

bull Total number of jobs lt= 1 + 2n + ny

25

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 26: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Having Fun with PageRank

bull Intelligent surferndash Change rows of the hyperlink matrix H so long they

remain probability distributionsndash Teleportation vector (aka personalization vector)

instead of random jumps

bull CiteRank to rank papers (using a time dependant decay factor to shape probability distributions of the hyperlink matrix)

bull Social networks bull Ranking schemes evaluation and comparison

techniques (without involving humans)bull Ranking schemes for directed labelled multi-

graphs (aka RDF)

26

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 27: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Ranking Papers CiteSeer Dataset

27

CiteSeer (1) (2) PageRank Title (Year)

340126 yes yes 000157983 New Directions in Cryptography Invited Paper (1976)

549100 no no 000121952 Structure and Complexity of Relational Queries (1982)

548351 no no 000120267 Computable Queries for Relational Data Bases (1980)

527057 yes yes 000114733 Optimization by Simulated Annealing (1983)

516071 no no 000112389 Probabilistic Methods in Combinatorics (1974)

28289 yes no 000108669 A Method for Obtaining Digital Signatures and Public-Key Cryptosystems (1978)

552631 no no 000107492 Fast Anisotropic Gauss Filtering (2001)

328445 no yes 000101743 Scheduling Algorithms for Multiprogramming in a Hard-Read-Time Environment (1973)

239544 no no 000096108 Discrepancy in Arithmetic Progressions (1996)

148879 yes no 000094061 Yacc Yet Another Compiler-Compiler (1975)

311874 no yes 000091705 Graph-Based Algorithms for Boolean Function Manipulation (1986)

93436 no no 000090491 Privacy Enhancement for Internet Electronic Mail Part II (1993)

219414 no no 000084181 Privacy Enhancement for Internet Electronic Mail Part III (1993)

567230 no no 000080948 A Timeout-Based Congestion Control Scheme for Window Flow-Controlled Networks (1986)

20336 no no 000075584 Generalised Additive Models (1995)

524648 yes no 000074717 Implementing Remote Procedure Calls (1984)

15205 no no 000073840 Congestion Avoidance and Control (1988)

35316 no no 000069750 Relational Queries Computable in Polynomial Time (1986)

76766 yes no 000068785 The UNIX Time-Sharing (1974)

351230 no no 000067404 History of Circumscription (1993)

CiteSeer - httpciteseeristpsueduCID(1) httpenwikipediaorgwikiList_of_important_publications_in_computer_science(2) httpscholargooglecomscholaras_q=22+22ampnum=100ampas_subj=eng

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 28: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Dirty Data

bull Ignore duplicate links (d a a c) and self-references (f f b)bull Implicit dangling nodes ( h k p s )bull If the data is dirty the Google matrix will not be stochastic

and a unique solution as well as convergence are not guaranteed (with a sufficient high number of iterations you might get as result infin)

28

a b c

b h k p

c

d a a c

e s

f f b

g

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 29: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Convergence in Practice

29

Rows ranking positions Columns iterations Cells document idsRed document should not be in the first 20 results Yellow document in the first 20 results but wrong position Green document in the first 20 results correct position

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 30: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Convergence in Practice

30

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 31: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Convergence in Practice

31

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 32: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Convergence in Practice

32

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 33: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink

Referencesbull ldquoGooglersquos PageRank and Beyond The Science of Search Engine Rankingsrdquo

Amy N Langville and Carl D MeyerPrinceton University Press (2006) ISBN 0-691-12202-4httppressprincetonedutitles8216html

bull ldquoThe anatomy of a large-scale hypertextual Web search enginerdquoSergey Brin and Lawrence PageIn Proc of the Seventh International World Wide Web Conference (WWW 1998)httpilpubsstanfordedu8090361

bull ldquoThe PageRank Citation Ranking Bringing Order to the WebrdquoLawrence Page Sergey Brin Rajeev Motwani and Terry WinogradTechnical Report Stanford InfoLab (1999)httpilpubsstanfordedu8090422

bull ldquoThe Intelligent Surfer Probabilistic Combination of Link and Content Information in PageRankrdquoMatthew Richardson and Pedro DomingosIn Proc of Advances in Neural Information Processing Systems (2002) httpwwwcswashingtoneduhomespedrodpapersnips01bpdf

bull ldquoRanking Scientific Publications Using a Simple Model of Network TrafficrdquoDylan Walker Huafeng Xie Koon-Kiu Yan Sergei MaslovJournal of Statistical Mechanics (2007)httparxivorgabsphysics0612122v1

33

Page 34: Having Fun with PageRank and MapReducestatic.last.fm/.../paolo_castagna-pagerank.pdf · 2020-04-03 · Having Fun with PageRank •Intelligent surfer –Change rows of the hyperlink