google pagerank - free university of bozen-bolzanoricci/isr/slides-2015/1b-google.pdf ·...

71
Google PageRank Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano [email protected] 1

Upload: others

Post on 01-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Google PageRank

Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano [email protected]

1

Page 2: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Content

p  Linear Algebra p  Matrices p  Eigenvalues and eigenvectors

p  Markov chains p  Google PageRank

2

Page 3: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Literature

p  C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. Chapter 21

p  Markov chains description on wikipedia p  Amy N. Langville & Carl D. Meyer,

Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press, 2006.

3

Page 4: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Google

p  Google is the leading search and online advertising company - founded by Larry Page and Sergey Brin (Ph.D. students at Stanford University)

p  “googol” or 10100 is the mathematical term Google was named after

p  Google’s success in search is largely based on its PageRank™ algorithm

p  Gartner reckons that Google now make use of more than 1 million servers, spitting out search results, images, videos, emails and ads

p  Google reports that it spends some 200 to 250 million US dollars a year on IT equipment.

4

Page 5: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Matrices

p  A Matrix is a rectangular array of numbers

p  aij is the element of matrix A in row i and column j p  A is said to be a n x m matrix if it has n rows and m

columns p  A square matrix is a n x n matrix p  The transpose AT of a matrix A is the matrix obtained by

exchanging the rows and the columns

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛=

654321

232221

131211

aaaaaa

A

AT =

a11T a12

T

a21T a22

T

a31T a32

T

!

"

######

$

%

&&&&&&

=

a11 a21

a12 a22

a13 a23

!

"

######

$

%

&&&&&&

=

1 4

2 5

3 6

!

"

#####

$

%

&&&&& 5

Page 6: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  What is the size of these matrices

p  Compute their transpose

6

Page 7: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  What is the size of these matrices

p  Compute their transpose

7

2x3 3x1 3X4

1 20

9 5

−13 −6

"

#

$$$$$

%

&

'''''

4 1 8!"

#$

0 1 2

−1 0 1

−2 −1 0

−3 −2 −1

"

#

$$$$$$$

%

&

'''''''

Page 8: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Matrices

p  A square matrix is diagonal iff has aij = 0 for all i≠j

p  The Identity matrix 1 is the diagonal matrix with 1´s along the diagonal

p  A symmetric matrix A satisfy the condition A=AT

⎟⎟⎠

⎞⎜⎜⎝

⎛=

22

11

00a

aA

⎟⎟⎠

⎞⎜⎜⎝

⎛=

1001

A

8

Page 9: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Is a diagonal matrix symmetric?

p  Make an example of a symmetric matrix

p  Make an example of a 2x3 symmetric matrix

9

Page 10: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Is a diagonal matrix symmetric? n  YES because if it is diagonal then aij = 0 for all

i≠j, hence aij = aji for all i≠j

p  Make an example of a symmetric matrix

p  Make an example of a 2x3 symmetric matrix n  Impossible, a symmetric matrix is a square

matrix 10

1 22 3

!

"#

$

%&

Page 11: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Vectors

p  A vector v is a one-dimensional array of numbers (is an n x 1 matrix – column vector)

p  Example:

p  The standard form of a vector is a column

vector p  The transpose of a column vector vT =(3 5 7)

is a row vector.

v =357

!

"

###

$

%

&&&

11

Page 12: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Operation on matrices

p  Addition: A=(aij), B=(bij), C=(cij) = A+B n  cij = aij + bij

p  Scalar multiplication: λ is a number, λ A = (λaij)

p  Multiplication: if A and B are compatible, i.e., the number of columns of A is equal to the number of rows of B, then n  C=(cij)= AB n  cij = Σk aik bkj

12

Page 13: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Examples

p  If AB=1, then B is said to be the inverse of A and is denoted with A-1

p  If a matrix has an inverse is called invertible or non singular

1 2 3

4 5 6

!

"

###

$

%

&&&

1 4

2 5

3 6

!

"

#####

$

%

&&&&&

= 1*1+ 2*2+3*3 1*4+ 2*5+3*64*1+ 5*2+ 6*3 4*4+ 5*5+ 6*6

!

"#

$

%&= 14 32

32 77

!

"#

$

%&

1 10 1

!

"#

$

%& 1 −10 1

!

"#

$

%&= 1 0

0 1

!

"#

$

%&

13

It is symmetric. Is it a general fact? Is AATalways symmetric?

Page 14: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Compute the following operations

14

Page 15: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Compute the following operations

15

Page 16: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Rank of a Matrix

p  The row (column) rank of a matrix is the maximum number of rows (columns) that are linearly independent

p  The vectors v1, …, vn are linearly independent iff there is no linear combination a1v1+ … + anvn (with coefficients ai not all 0) of the vectors that is equal to 0

p  Example 1: (1 2 3), (1 4 6), and (0 2 3) are linearly dependent: show it

p  Example 2: (1 2 3) and (1 4 6) are not linearly dependent: show it

p  The kernel of a matrix A is the subspace of vectors v such that Av=0

16

Page 17: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise solution

p  1*(1 2 3)T -1*(1 4 6)T + 1*(0 2 3)T =(0 0 0)T

p  (1 -1 1)T is in the kernel of the matrix:

p  a*(1 2 3) + b*(1 4 6) = (0 0 0) n  Then a=-b and also a = -2b, absurd.

17

1 1 02 4 23 6 3

!

"

###

$

%

&&&

1−11

!

"

###

$

%

&&&=

000

!

"

###

$

%

&&&

1 1 02 4 23 6 3

!

"

###

$

%

&&&

Page 18: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Rank and Determinant

p  Theorem. A n x n square matrix is nonsingular iff has full rank (i.e. n).

p  Theorem. A matrix has full column rank iff it does not have a null vector

p  Theorem. A n x n matrix A is singular iff the det(A)=0

p  A[ij] is the ij minor, i.e., the matrix obtained by deleting the i-th row and the j-th column from A.

11n

)det()1()det(1

]1[11

11

>

=

⎪⎩

⎪⎨

−= ∑=

+nif

ifAa

aA n

jjj

j

18

Page 19: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Compute the determinant of the following matrices

19

1 1 02 4 23 6 3

!

"

###

$

%

&&&

1 12 4

!

"#

$

%&

Page 20: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Compute the determinant of the following matrices

20

1 1 02 4 23 6 3

!

"

###

$

%

&&&

1 12 4

!

"#

$

%& = 1*4-1*2 = 2

= 1*(4*3-2*6)-(2*3-2*3)=0

http://www.bluebit.gr/matrix-calculator/

Page 21: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Eigenvectors and Eigenvalues

p  Definition. If M is a square matrix, v is a nonzero vector and λ is a number such that n  M v = λ v

p  then v is said to be an (right) eigenvector of A with eigenvalue λ

p  If v is an eigenvector of M with eigenvalue λ, then so is any nonzero multiple of v

p  Only the direction matters.

21

Page 22: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Example p  The matrix

p  Has two (right) eigenvectors: n  v1 =(1 1)t and v2 = (3 1)t

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M

Prove that

22

Is it singular?

Page 23: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Example p  The matrix

p  Has two eigenvectors: n  v1 =(1 1)t and v2 = (3 1)t

p  Mv1 = (-1 -1)t = -1 v1

n  The eigenvalue is -1 p  Mv2 = (3 1)t = 1 v2

n  The eigenvalue is 1

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M

Prove that

23

Is it singular?

Page 24: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Transformation

p  There is a lot of distortion in these directions (1 0)t, (1 1)t, (0 1)t

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M

24

Page 25: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Transformation along eigenvectors

p  There are two independent directions which are not twisted at all by the matrix M: (1 1) and (3 1)

p  one of them is flipped (1 1)

p  We see less distortion if our box is oriented in the two special directions.

25

Page 26: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Results

p  Theorem: every square matrix has at least one eigenvector

p  The usual situation is that an n x n matrix has n linearly independent eigenvectors

p  If there are n of them, they are a useful basis for Rn.

p  Unfortunately, it can happen that there are fewer than n of them.

26

Page 27: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Finding Eigenvectors

p  M v = λ v n  v is an eigenvector and is λ an eigenvalue

p  If λ = 0, then finding eigenvectors is the same as finding nonzero vectors in the null space – iff det(M) = 0, i.e., the matrix is singular

p  If λ != 0, then finding the eigenvectors is equivalent to finding the null space for the matrix M – λ1 (1 is the identity matrix)

p  The matrix M – λ1 has a non zero vector in the null space iff det(M – λ1) = 0

p  det(M – λ1) = 0 is called the characteristic equation. 27

Page 28: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M Find the eigenvalues and the eigenvectors of this matrix

28

1)  Find the solutions λ of the characteristic equation (eigenvalues)

2)  Find the eigenvectors corresponding to the found eigenvalues.

Page 29: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise Solution

p  det(M – λ1) = 0 n  (2 - λ)(-2 - λ) + 3 = λ2 - 1

p  The solutions are +1 and -1

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M Find the eigenvalues and the eigenvectors of this matrix

29

Page 30: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise Solution

p  det(M – λ1) = 0 n  (2 - λ)(-2 - λ) + 3 = λ2 - 1

p  The solutions are +1 and -1 p  Now we have to solve the set of linear

equations n  Mv=v (for the first eigenvalue)

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M Find the eigenvalues and the eigenvectors of this matrix

30

Page 31: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise Solution

p  det(M – λ1) = 0 n  (2 - λ)(-2 - λ) + 3 = λ2 - 1

p  The solutions are +1 and -1 p  Now we have to solve the set of linear

equations n  Mv=v (for the first eigenvalue)

n  Has solution x=3y, (3 1)t – and all vectors obtained multiplying this with a scalar.

⎟⎟⎠

⎞⎜⎜⎝

−=

2132

M

yyxxyx

=−

=−

232

Find the eigenvalues and the eigenvectors of this matrix

31

Page 32: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Algorithm

p  To find the eigenvalues and eigenvectors of M: n  First find the eigenvalues by solving the

characteristic equation. Call the solutions λ1,..., λn. (There is always at least one eigenvalue, and there are at most n of them.)

n  For all λk, the existence of a nonzero vector in this null space is guaranteed. Any such vector is an eigenvector.

32

Page 33: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Graphs

p  A directed graphs G is a pair (V,E), where V is a finite set and E is a binary relations on V n  V is the Vertex set of G: contains the

vertices n  E is the Edge set of G: contains the edges

p  In an undirected graphs G=(V,E) the edges consists of unordered pairs of vertices

p  The in-degree of a vertex v (directed graph) is the number of edges entering in v

p  The out-degree of a vertex v (directed graph) is the number of edges leaving v.

33

Page 34: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

The Web as a Directed Graph

Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)

Assumption 2: The anchor of the hyperlink describes the target page (textual context)

Page A hyperlink Page B Anchor

34

Page 35: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

35

Ranking web pages

p  To count inlinks: enter in google search form link:www.mydomain.com

p  Web pages are not equally “important” n  www.unibz.it vs. www.stanford.edu n  Inlinks as votes

p www.stanford.edu has 3200 inlinks p www.unibz.it has 352 inlink (Feb 2013)

p  Are all inlinks equal? n  Recursive question!

Page 36: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

36

Simple recursive formulation

p  Each link’s vote is proportional to the importance of its source page

p  If page P with importance x has n outlinks, each link gets x/n votes

1000 $

333 $ 333 $

333 $

Page 37: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

37

Simple “flow” model

The web in 1839

Yahoo

Microsoft Amazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2 a = y /2 + m m = a /2

a, m, and y are the importance of these pages

Page 38: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

38

Solving the flow equations

p  3 equations, 3 unknowns, no constants n  No unique solution n  If you multiply a solution by a constant (λ) you

obtain another solution - try with (2 2 1) p  Additional constraint forces uniqueness

n  y+a+m = 1 (normalization) n  y = 2/5, a = 2/5, m = 1/5 n  These are the scores of the pages under the

assumption of the flow model p  Gaussian elimination method works for small

examples, but we need a better method for large graphs.

Page 39: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

39

Matrix formulation

p  Matrix M has one row and one column for each web page (square matrix)

p  Suppose page i has n outlinks n  If i links to j, then Mij=1/n n  Else Mij=0

p  M is a row stochastic matrix n  Rows sum to 1

p  Suppose r is a vector with one entry per web page n  ri is the importance score of page i n  Call it the rank vector

Page 40: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

40

Example

y ½ ½ 0 a ½ 0 ½ m 0 1 0

y a m

y = y /2 + a /2 a = y /2 + m m = a /2

(y a m) = (y a m)M

Yahoo

Microsoft Amazon

y

a m

y/2

y/2

a/2

a/2

m

= M

Page 41: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

41

Power Iteration Solution

Yahoo

Microsoft Amazon

(1/3 1/3 1/3) (1/3 1/3 1/3)M = (1/3 1/2 1/6) (1/3 1/2 1/6)M = (5/12 1/3 1/4) (5/12 1/3 1/4)M = (3/8 11/24 1/6) … (2/5 2/5 1/5)

y ½ ½ 0 a ½ 0 ½ m 0 1 0

y a m

(y a m) = (y a m)M

= M

Page 42: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Example

42

Page 43: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

States and probabilities

DRY RAIN

0.38

0.62

0.15

0.85

43

Page 44: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Composing transitions

0.44 = 0.38*0.15+0.62*0.62 What kind of operation is on the matrix?

44

Page 45: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Composing transitions

p  The probabilities of the 12hours transitions are given by squaring the matrix representing the probabilities of the 6hours transitions n  P(rain-in-12hours|rain-now)= P(rain-in-12hours|rain-

in-6hours)*P(rain-in-6hours|rain-now)+P(rain-in-12hours|dry-in-6hours)*P(dry-in-6hours|rain-now)=.62*.62+.15*.38=.44

n  P(dry-in-12hours|rain-now)= P(dry-in-12hours|rain-in-6hours)*P(rain-in-6hours|rain-now)+P(dry-in-12hours|dry-in-6hours)*P(dry-in-6hours|rain-now)= 38*.62+.85*.38=.56

2

44.56.22.78.

62.38.15.85.

62.38.15.85.

A=⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

45

dry

dry

rain

rain

Page 46: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Behavior in the limit

⎟⎟⎠

⎞⎜⎜⎝

⎛=

62.38.15.85.

A ⎟⎟⎠

⎞⎜⎜⎝

⎛=

36.64.25.75.3A

⎟⎟⎠

⎞⎜⎜⎝

⎛=

29.71.28.72.7A ⎟⎟

⎞⎜⎜⎝

⎛=

28.72.28.72.8A

⎟⎟⎠

⎞⎜⎜⎝

⎛=

44.56.22.78.2A

⎟⎟⎠

⎞⎜⎜⎝

⎛=

32.68.27.73.4A ⎟⎟

⎞⎜⎜⎝

⎛=

29.71.28.72.6A⎟⎟

⎞⎜⎜⎝

⎛=

30.70.28.72.5A

⎟⎟⎠

⎞⎜⎜⎝

⎛=

28.72.28.72.9A

⎟⎟⎠

⎞⎜⎜⎝

⎛=∞

28.72.28.72.

A46

Page 47: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Behavior in the limit

p  If a,b <=1, and a+b=1, i.e., (a b) is a generic state with a certain probability a to be dry and b=1-a to be rain, then

p  In particular (.72 .28)A=(.72 .28), i.e., it is a (left) eigenvector with eigenvalue 1

p  The eigenvector (.72 .28) represents the limit situation starting from a generic situation (a b): it is called the stationary distribution.

( ) ( ) ( )28.72.28.72.28.72.

=⎟⎟⎠

⎞⎜⎜⎝

⎛=∞ baAba

47

Page 48: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise

p  Find one (left) eigenvector of the matrix below: n  Solve first the characteristic equation (to find

the eigenvalues) n  and then find the left eigenvector

corresponding to the largest eigenvalue

48

.85 .15

.38 .62

!

"#

$

%&

Page 49: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Exercise Solution

p  Characteristic equation

49

det .85−λ .15.38 .62−λ

"

#$

%

&' =(0.85-λ)(0.62-λ) - 0.15*0.38 = λ2 – 1.47λ +0.47

λ =1.47± 1.472 − 4*0.47

2Solutions λ =1 and λ = 0.47

( x y ) .85 .15.38 .62

!

"#

$

%&= x y( ) 0.85x + 0.38y=x

x + y = 1

0.85x +0.38(1-x)=x -0.53x +0.38=0

x = 0.38/0.53=0.72 y = 1 – 0.72= 0.28

Page 50: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Markov Chain

p  A Markov chain is a sequence X1, X2, X3, ... of random variables (Σv all possible values of X P(X=v) = 1) with the property:

p  Markov property: the conditional probability distribution of the next future state Xn+1 given the present and past states is a function of the present state Xn alone

p  If the state space is finite then the transition probabilities can be described with a matrix Pij=P(Xn+1= j | Xn = i ), i,j =1, …m

50

.85 .15

.38 .62

!

"#

$

%&=

P(Xn+1 =1| Xn =1) P(Xn+1 = 2 | Xn =1)P(Xn+1 =1| Xn = 2) P(Xn+1 = 2 | Xn = 2)

!

"

##

$

%

&&

Page 51: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Example: Web

p Xt is the page visited by a user (random surfer) at time t;

p At every time t the user can be in one among m pages (states)

p We assume that when a user is on page i at time t, then the probability to be on page j at time t+1 depends only on the fact that the user is on page i, and not on the pages previously visited.

51

Page 52: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Probabilities

P0 P1

P2

P3

P4 Goal

P(P2|P1) = 0.4

P(P1|P1) = 0.1

P(P0|P1)= 0.05 P(P3 |P1) = 0.3

P(P4| P1) = 0.15

In this example there are 5 states and the probability to jump from a page/state to another is not constant (it is not 1/(#of outlinks of the node)) … as we have assumed before in the simple web graph

This is not a Markov chain! (why?)

P(P1|P0)= 1.0

P(P1|P2)= 1.0

P(P4 |P3) = 0.5

P(P1 |P3) = 0.5

52

Page 53: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Examples

p  Pij=P(Xn+1= j | Xn = i ), i,j =1, …m

p  (1, 0, 0, …, 0) P = (P11, P12, P13, …, P1n) n  if at time n it is in state 1, then at time n+1 it

is in state j with probability P1j, i.e., the first row of Pij gives the probabilities to be in the other states

p  (0.5, 0.5, 0, …, 0) P = (P11·0.5 + P21·0.5, …, P1n·0.5 + P2n·0.5) n  this is the linear combination of the first two

rows.

53

Page 54: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Stationary distribution p  A stationary distribution is a m-dimensional (sum

1) vector which satisfies the equation: p  Where π is a (column) vector and πT (row vector) is

the transpose of π

p  A stationary distribution always exists, but is not guaranteed to be unique (can you make an example of a Markov chain with more than one stationary distribution?)

p  If there is only one stationary distribution then

p  Where x is a generic distribution over the m states (i.e., it is an m-dimensional vector whose entries are <=1 and the sum is 1)

nT

n

T Pxlim=π

54

Page 55: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Random Walk Interpretation

p  Imagine a random web surfer n  At any time t, surfer is on some page P n  At time t+1, the surfer follows an outlink from

P uniformly at random n  Ends up on some page Q linked from P n  Process repeats indefinitely

p  Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t n  p(t) is a probability distribution on pages

55

Page 56: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

The stationary distribution

p  Where is the surfer at time t+1? n  Follows a link uniformly at random n  p(t+1) = p(t)M

p  Suppose the random walk reaches a state such that p(t+1) = p(t)M = p(t) n  Then p(t) is a stationary distribution for the

random walk p  Our rank vector r= p(t) satisfies r = rM.

56

Page 57: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Ergodic Markov chains p  A Markov chain is ergodic if:

n  Informally: there is a path from any state to any other; and the states are not partitioned into sets such that all state transitions occur cyclically from one set to another.

n  Formally: for any start state, after a finite transient time T0, the probability of being in any state at any fixed time T>T0 is nonzero.

Not ergodic (even/ odd).

Not ergodic: the probability to be in a state, at a fixed time, e.g., after 500 transitions, is always either 0 or 1 according to the initial state. 57

Page 58: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Ergodic Markov chains

p  For any ergodic Markov chain, there is a unique long-term visit rate for each state n  Steady-state probability distribution

p  Over a long time-period, we visit each state in proportion to this rate

p  It doesn’t matter where we start. p  Note: non ergodic Markov chains may still have a

steady state.

58

Page 59: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Non Ergodic Example

p  It is easy to show that the steady state (left eigenvector) is πT= (0 0 1), πTP=πT , i.e., is the state 3

p  The user will always reach the state 3 and will stay there (spider trap)

p  This is a non-ergodic Markov Chain (with a steady-state).

1

2

3 0.5

0.5 0.8 ⎟

⎟⎟

⎜⎜⎜

=

1008.002.05.05.00

P

1.0

0.2

59

Page 60: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Random teleports

p  The Google solution for spider traps (not for dead ends)

p  At each time step, the random surfer has two options: n  With probability β, follow a link at random n  With probability 1-β, jump to some page

uniformly at random n  Common values for β are in the range 0.8 to

0.9 p  Surfer will teleport out of spider trap within a few

time steps 60

Page 61: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Matrix formulation p  Suppose there are N pages

n  Consider a page i, with set of outlinks O(i) n  We have

p Mij = 1/|O(i)| when i links j p and Mij = 0 otherwise

n  The random teleport is equivalent to p adding a teleport link from i to every other

page with probability (1-β)/N p reducing the probability of following each

outlink from 1/|O(i)| to β/|O(i)| p Equivalent: tax each page a fraction (1-β) of

its score and redistribute evenly. 61

Page 62: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

p  Simple example with 6 pages

p  P(5|1)=P(4|1)=P(3|1)=P(2|1)= β/4 +(1-β)/6 p  P(1|1)=P(6|1)= (1-β)/6 p  P(*|1) = 4[β/4 +(1-β)/6] + 2(1-β)/6 = 1

Example

1

2

4 5

3

6

62

Page 63: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Google Page Rank

p  Construct the NxN matrix A as follows n  Aij = βMij + (1-β)/N

p  Verify that A is a stochastic matrix p  The page rank vector r is the principal eigenvector of

this matrix n  satisfying r = rA n  The score of each page ri satisfies the following:

p  I(i) is the set of nodes that have a link to page i p  O(k) is the set of links exiting from k p  r is the stationary distribution of the random walk with

teleports.

ri = βrk

|O(k) |k∈I (i)∑

#

$%%

&

'((+(1−β)N

63

Page 64: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Example

0,03 0,24 0,24 0,24 0,24 0,030,03 0,03 0,45 0,03 0,03 0,450,03 0,03 0,03 0,03 0,88 0,030,03 0,88 0,03 0,03 0,03 0,030,03 0,03 0,03 0,03 0,03 0,880,03 0,03 0,03 0,88 0,03 0,03

1

2

4 5

3

6

P(4|1)=0.24=0.85/4 + 0.15/6 P(6|1)=0.03=0.15/6 P(4|6)=0.88=0.85/1 + 0.15/6

A =

0,03 0,23 0,13 0,24 0,14 0,240,03 0,23 0,13 0,24 0,14 0,240,03 0,23 0,13 0,24 0,14 0,240,03 0,23 0,13 0,24 0,14 0,240,03 0,23 0,13 0,24 0,14 0,240,03 0,23 0,13 0,24 0,14 0,24

A30 =

Stationary distribution = (0.03 0.23 0.13 0.24 0.14 0.24) 64

β=0.85

Page 65: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Dead ends

p  Pages with no outlinks are “dead ends” for the random surfer (dangling nodes) n  Nowhere to go on next step

p  When there are dead ends the matrix is no longer stochastic (the sum of the row elements is not 1)

p  This is true even if we add the teleport n  because the probability to follow a teleport link

is only (1-β)/N and there are just N of these teleports- hence any of them is (1-β)

65

Page 66: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Dealing with dead-ends

p  1) Teleport n  Follow random teleport links with probability

1.0 from dead-ends (i.e., for that pages set β = 0)

n  Adjust matrix accordingly p  2) Prune and propagate

n  Preprocess the graph to eliminate dead-ends n  Might require multiple passes (why?) n  Compute page rank on reduced graph n  Approximate values for dead ends by

propagating values from reduced graph

66

Page 67: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Computing page rank

p  Key step is matrix-vector multiply n  rnew = roldA

p  Easy if we have enough main memory to hold A, rold, rnew

p  Say N = 1 billion pages n  We need 4 bytes (32 bits) for each entry

(say) n  2 billion entries for vectors rnew and rold,

approx 8GB n  Matrix A has N2 entries, i.e., 1018

p it is a large number! 67

Page 68: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Sparse matrix formulation

p  Although A is a dense matrix, it is obtained from a sparse matrix M n  10 links per node, approx 10N entries

p  We can restate the page rank equation n  r = βrM + [(1-β)/N]N (see slide 63) n  [(1-β)/N]N is an N-vector with all entries (1-β)/N

p  So in each iteration, we need to: n  Compute rnew = βroldM n  Add a constant value (1-β)/N to each entry in

rnew

68

Page 69: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Sparse matrix encoding

p  Encode sparse matrix using only nonzero entries n  Space proportional roughly to number of links n  say 10N, or 4*10*1 billion = 40GB n  still won’t fit in memory, but will fit on disk

0 3 1, 5, 7

1 5 17, 64, 113, 117, 245

2 2 13, 23

source node degree destination nodes

69

Page 70: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Basic Algorithm

p  Assume we have enough RAM to fit rnew, plus some working memory n  Store rold and matrix M on disk

Basic Algorithm: p  Initialize: rold = [1/N]N p  Iterate:

n  Update: Perform a sequential scan of M and rold and update rnew

n  Write out rnew to disk as rold for next iteration n  Every few iterations, compute |rnew-rold| and

stop if it is below threshold p  Need to read in both vectors into memory 70

Page 71: Google PageRank - Free University of Bozen-Bolzanoricci/ISR/slides-2015/1B-Google.pdf · 2015-03-05 · PageRank™ algorithm ! Gartner reckons that Google now make use of more than

Update step

0 3 1, 5, 6

1 4 17, 64, 113, 117

2 2 13, 23

src degree destination 0 1 2 3 4 5 6

0 1 2 3 4 5 6

rnew rold

Initialize all entries of rnew to (1-β)/N For each page p (out-degree n):

Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…N: rnew(destj) += β*rold(p)/n

The old value in 0 contributes to updating only the new values in 1,5, and 6. 71