web search before google - stanford...

23
Sta306b May 11, 2012 PageRank: 1 Web search before Google (Taken from Page et al. (1999), “The PageRank Citation Ranking: Bringing Order to the Web”.)

Upload: others

Post on 06-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 1'

&

$

%

Web search before Google

(Taken from Page et al. (1999), “The PageRank Citation Ranking:

Bringing Order to the Web”.)

Page 2: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 2'

&

$

%

Web search and Google’s PageRank algorithm

• Idea is to rank webpages by “importance”: webpages that have

many pointers from other pages are more important

• Suppose we have n webpages. The PageRank of webpage i based on

its linking webpages (webpages j that link to i). But we don’t just

count the number of linking webpages, i.e., not all linking webpages

are treated equally. Instead, we weight the links from different

webpages. There are two main two ideas:

– Webpages that link to i, and have high PageRank scores

themselves, should be given more weight.

– Webpages that link to i, but link to a lot of other webpages in

general, should be given less weight.

Page 3: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 3'

&

$

%

FlawedRank (almost PageRank)

Let Lij = 1 if webpage j links to webpage i (written j → i), and Lij = 0

otherwise. Also let cj =∑n

k=1Lkj , the total number of webpages that j

links to.

We’re going to define something that’s almost PageRank, but not quite,

because it’s flawed. The FlawedRank pi of webpage i satisfies

pi =∑

j→i

pjcj

=n∑

j=1

Lij

cjpj .

Does this match our ideas from the last slide? Yes: for j → i, the weight

is pj/cj . This increases with pj , but decreases with cj .

In matrix notation: The FlawedRank vector p is defined by p = LD−1

c p.

Page 4: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 4'

&

$

%

FlawedRank as a Markov chain

You can think of a Markov Chain as a random process that moves

between states numbered 1, . . . n (each step of the process is one

move). Recall that for a Markov chain to have an n× n transition

matrix P , this means that P (go from i to j) = Pij .

Suppose p(0) is an n-dimensional vector giving us the probability of

being in each state to begin with. After one step, the probability of

being in each state is given by p(1) = P T p(0).

Now consider a Markov chain, with the states as webpages, and

with transition matrix AT . Note that (AT )ij = Aji = Lji/ci, so we

can describe the chain as

P (go from i to j) =

1/ci if i→ j

0 otherwise.

Page 5: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 5'

&

$

%

This is like a random surfer, i.e., a person surfing the web by

clicking on links uniformly at random.

Page 6: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 6'

&

$

%

Stationary distributions

A stationary distribution of our Markov chain is a probability

vector p (i.e., its entries are ≥ 0 and sum to 1) with p = Ap. This

means that the distribution after one step of the Markov chain is

unchanged. Note that this is exactly what we’re looking for: an

eigenvector of A corresponding to eigenvalue 1!

If the Markov chain is strongly connected, meaning that any state

can be reached from any other state, then the stationary

distribution p exists and is unique. Furthermore, we can think of

the stationary distribution as the proportions of visits the chain

pays to each state after a very long time (the ergodic theorem):

pi = limt→∞

# of visits to state i in t steps

t.

Page 7: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 7'

&

$

%

Our interpretation: the FlawedRank of pi is the proportion of times

our random surfer spends on webpage i if we let him go forever.

Page 8: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 8'

&

$

%

Why is FlawedRank flawed?

There’s a problem here. Our Markov chain is not strongly

connected, in three cases (at least):

Disconnected

componentsDangling links Loops

Actually, even for Markov chains that are not strongly connected, a

stationary distribution always exists, but may nonunique.

Page 9: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 9'

&

$

%

In other words, the FlawedRank vector p exists but is ambiguously

defined.

Page 10: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 10'

&

$

%

FlawedRank example

Here A = LD−1 =

0 0 1 0 0

1 0 0 0 0

0 1 0 0 0

0 0 0 0 1

0 0 0 1 0

.

Page 11: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 11'

&

$

%

Here there are two eigenvectors of A with eigenvalue 1:

p =

13

13

13

0

0

and p =

0

0

0

12

12

.

These are totally opposite rankings!

Page 12: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 12'

&

$

%

PageRank

• The Google PageRanks pi are defined by the recursive relationship

pi = (1− d) + dn∑

j=1

(Lij/cj)pj (1)

where d is a positive constant (apparently set to 0.85).

• the first term fixes the problem in FlawedRank

Page 13: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 13'

&

$

%

Google’s PageRank algorithm- ctd

• Idea is that the importance of page i is the sum of the importances

of pages that point to i. The sums are weighted by 1/cj , i.e. each

page distributes a total vote of 1 to other pages.

• The constant d ensures that each page gets a PageRank of at least

1− d.

• In matrix notation

p = (1− d)e+ d · LD−1

cp (2)

where e is a vector of n ones and Dc is a diagonal matrix with

diagonal elements cj .

• Now from (2) we have eTp = n (i.e. the average PageRank is 1), so

we can write (2) as p =[

(1− d)eeT /n) + dL/c]

p = Ap where the

matrix A is the expression in square braces.

Page 14: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 14'

&

$

%

Google’s PageRank algorithm- ctd

• It turns out that A has a real eigenvalue equal to 1, so that we

can find p̂ by the power method: starting with some p = p0 we

iterate

pk ← Apk−1; pk ← npk/eTpk

The fixed points p̂ are the desired PageRanks

Page 15: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 15'

&

$

%

The Random Surfer Model

• Original paper of Page and Brin considered PageRank as a

model of user behaviour, where a surfer clicks on links at

random with no regard towards content.

• Surfer does a random walk on the web, choosing among

available outgoing links at random. The factor 1− d is the

probability that he doe not click on a link but jumps instead to

a random webpage

• some descriptions of PageRank have (1− d)/n as the first term

in definition (1), which would better coincide with the random

surfer interpretation. Then the page rank solution (normalized)

is the stationary distribution of irreducible, aperiodic Markov

chain over the n webpages.

Page 16: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 16'

&

$

%

The Random Surfer Model- continued

• Definition (1) also corresponds to a Markov chain, with

different transition probabilities than those from the (1− d)/n

version.

• Viewing PageRank as a Markov Chain makes clear why the

matrix A has a real eigenvalue of 1. Since A has positive

entries with columns summing to one, it has a unique

eignevector with eigenvalue 1, corresponding to the stationary

distribution of the chain (see text page 577-578).

Page 17: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 17'

&

$

%

Google’s PageRank- example

Page 2

Page 4

Page 1

Page 3

Page 18: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 18'

&

$

%

L =

0 0 1 0

1 0 0 0

1 1 0 1

0 0 0 0

c = (2, 1, 1, 1)

Solution: p̂ = (1.49, .78, 1.58, .15)

Notice that page 4 has no incoming links, and hence gets the

minimum PageRank of 0.15.

Page 19: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 19'

&

$

%

Exercise

Check that PageRank fixes the problem in the earlier FlawedRank

example, producing the normalized solution

p̂ = c(.20, .20, .20, .20, .20)

Page 20: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 20'

&

$

%

Using PageRank for web search

For a basic web search, given a query, we could do the following:

1. Compute the PageRank vector p once. (Google recomputes this

from time to time, to stay current.)

2. Find the documents containing all words in the query.

3. Sort these documents by PageRank, and return the top k (e.g. k =

50).

This is a little too simple ... but we can use similarity scores, changing

the above to:

3. Sort these documents by PageRank, and keep only the top K (e.g. K

= 5000).

4. Sort by similarity to the query and return the top k (e.g. k = 50).

Google uses a combination of PageRank, similarity scores, and other

techniques (it’s proprietary!)

Page 21: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 21'

&

$

%

More recent work

Following invention of PageRank, there has been a huge amount of

work to improve/extend PageRank and not only at Google! There

are many, many academic papers too, here are a few:

• Intelligent surfing: pointing the surfer towards webpages that

are textually relevant. (Richardson and Domingos (2002), “The

Intelligent Surfer: Probabilistic Combination of Link and

Content Information in PageRank”.)

• TrustRank: pointing the surfer away from spam. (Gyongyi et

al. (2004), “Combating Web Spam with TrustRank”.)

• PigeonRank: pigeons, the real reason for Google’s success.

(http://www.google.com/onceuponatime/technology/pigeonrank.html.)

Page 22: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 22'

&

$

%

Computational issues

• How can we perform each iteration quickly (multiply by A

quickly)?

Use the sparsity of web graph.

• How many iterations does it take (generally) to get a

reasonable answer?

Not very many if A has a large spectral gap (difference between

its first and second largest absolute eigenvalues);

Page 23: Web search before Google - Stanford Universitystatweb.stanford.edu/~tibs/sta306bfiles/pagerank.pdfGoogle’s PageRank algorithm- ctd • Idea is that the importance of page iis the

Sta306b May 11, 2012 PageRank: 23'

&

$

%

Software

See igraph package in R