web search before google - stanford...
TRANSCRIPT
Sta306b May 11, 2012 PageRank: 1'
&
$
%
Web search before Google
(Taken from Page et al. (1999), “The PageRank Citation Ranking:
Bringing Order to the Web”.)
Sta306b May 11, 2012 PageRank: 2'
&
$
%
Web search and Google’s PageRank algorithm
• Idea is to rank webpages by “importance”: webpages that have
many pointers from other pages are more important
• Suppose we have n webpages. The PageRank of webpage i based on
its linking webpages (webpages j that link to i). But we don’t just
count the number of linking webpages, i.e., not all linking webpages
are treated equally. Instead, we weight the links from different
webpages. There are two main two ideas:
– Webpages that link to i, and have high PageRank scores
themselves, should be given more weight.
– Webpages that link to i, but link to a lot of other webpages in
general, should be given less weight.
Sta306b May 11, 2012 PageRank: 3'
&
$
%
FlawedRank (almost PageRank)
Let Lij = 1 if webpage j links to webpage i (written j → i), and Lij = 0
otherwise. Also let cj =∑n
k=1Lkj , the total number of webpages that j
links to.
We’re going to define something that’s almost PageRank, but not quite,
because it’s flawed. The FlawedRank pi of webpage i satisfies
pi =∑
j→i
pjcj
=n∑
j=1
Lij
cjpj .
Does this match our ideas from the last slide? Yes: for j → i, the weight
is pj/cj . This increases with pj , but decreases with cj .
In matrix notation: The FlawedRank vector p is defined by p = LD−1
c p.
Sta306b May 11, 2012 PageRank: 4'
&
$
%
FlawedRank as a Markov chain
You can think of a Markov Chain as a random process that moves
between states numbered 1, . . . n (each step of the process is one
move). Recall that for a Markov chain to have an n× n transition
matrix P , this means that P (go from i to j) = Pij .
Suppose p(0) is an n-dimensional vector giving us the probability of
being in each state to begin with. After one step, the probability of
being in each state is given by p(1) = P T p(0).
Now consider a Markov chain, with the states as webpages, and
with transition matrix AT . Note that (AT )ij = Aji = Lji/ci, so we
can describe the chain as
P (go from i to j) =
1/ci if i→ j
0 otherwise.
Sta306b May 11, 2012 PageRank: 5'
&
$
%
This is like a random surfer, i.e., a person surfing the web by
clicking on links uniformly at random.
Sta306b May 11, 2012 PageRank: 6'
&
$
%
Stationary distributions
A stationary distribution of our Markov chain is a probability
vector p (i.e., its entries are ≥ 0 and sum to 1) with p = Ap. This
means that the distribution after one step of the Markov chain is
unchanged. Note that this is exactly what we’re looking for: an
eigenvector of A corresponding to eigenvalue 1!
If the Markov chain is strongly connected, meaning that any state
can be reached from any other state, then the stationary
distribution p exists and is unique. Furthermore, we can think of
the stationary distribution as the proportions of visits the chain
pays to each state after a very long time (the ergodic theorem):
pi = limt→∞
# of visits to state i in t steps
t.
Sta306b May 11, 2012 PageRank: 7'
&
$
%
Our interpretation: the FlawedRank of pi is the proportion of times
our random surfer spends on webpage i if we let him go forever.
Sta306b May 11, 2012 PageRank: 8'
&
$
%
Why is FlawedRank flawed?
There’s a problem here. Our Markov chain is not strongly
connected, in three cases (at least):
Disconnected
componentsDangling links Loops
Actually, even for Markov chains that are not strongly connected, a
stationary distribution always exists, but may nonunique.
Sta306b May 11, 2012 PageRank: 9'
&
$
%
In other words, the FlawedRank vector p exists but is ambiguously
defined.
Sta306b May 11, 2012 PageRank: 10'
&
$
%
FlawedRank example
Here A = LD−1 =
0 0 1 0 0
1 0 0 0 0
0 1 0 0 0
0 0 0 0 1
0 0 0 1 0
.
Sta306b May 11, 2012 PageRank: 11'
&
$
%
Here there are two eigenvectors of A with eigenvalue 1:
p =
13
13
13
0
0
and p =
0
0
0
12
12
.
These are totally opposite rankings!
Sta306b May 11, 2012 PageRank: 12'
&
$
%
PageRank
• The Google PageRanks pi are defined by the recursive relationship
pi = (1− d) + dn∑
j=1
(Lij/cj)pj (1)
where d is a positive constant (apparently set to 0.85).
• the first term fixes the problem in FlawedRank
Sta306b May 11, 2012 PageRank: 13'
&
$
%
Google’s PageRank algorithm- ctd
• Idea is that the importance of page i is the sum of the importances
of pages that point to i. The sums are weighted by 1/cj , i.e. each
page distributes a total vote of 1 to other pages.
• The constant d ensures that each page gets a PageRank of at least
1− d.
• In matrix notation
p = (1− d)e+ d · LD−1
cp (2)
where e is a vector of n ones and Dc is a diagonal matrix with
diagonal elements cj .
• Now from (2) we have eTp = n (i.e. the average PageRank is 1), so
we can write (2) as p =[
(1− d)eeT /n) + dL/c]
p = Ap where the
matrix A is the expression in square braces.
Sta306b May 11, 2012 PageRank: 14'
&
$
%
Google’s PageRank algorithm- ctd
• It turns out that A has a real eigenvalue equal to 1, so that we
can find p̂ by the power method: starting with some p = p0 we
iterate
pk ← Apk−1; pk ← npk/eTpk
The fixed points p̂ are the desired PageRanks
Sta306b May 11, 2012 PageRank: 15'
&
$
%
The Random Surfer Model
• Original paper of Page and Brin considered PageRank as a
model of user behaviour, where a surfer clicks on links at
random with no regard towards content.
• Surfer does a random walk on the web, choosing among
available outgoing links at random. The factor 1− d is the
probability that he doe not click on a link but jumps instead to
a random webpage
• some descriptions of PageRank have (1− d)/n as the first term
in definition (1), which would better coincide with the random
surfer interpretation. Then the page rank solution (normalized)
is the stationary distribution of irreducible, aperiodic Markov
chain over the n webpages.
Sta306b May 11, 2012 PageRank: 16'
&
$
%
The Random Surfer Model- continued
• Definition (1) also corresponds to a Markov chain, with
different transition probabilities than those from the (1− d)/n
version.
• Viewing PageRank as a Markov Chain makes clear why the
matrix A has a real eigenvalue of 1. Since A has positive
entries with columns summing to one, it has a unique
eignevector with eigenvalue 1, corresponding to the stationary
distribution of the chain (see text page 577-578).
Sta306b May 11, 2012 PageRank: 17'
&
$
%
Google’s PageRank- example
Page 2
Page 4
Page 1
Page 3
Sta306b May 11, 2012 PageRank: 18'
&
$
%
L =
0 0 1 0
1 0 0 0
1 1 0 1
0 0 0 0
c = (2, 1, 1, 1)
Solution: p̂ = (1.49, .78, 1.58, .15)
Notice that page 4 has no incoming links, and hence gets the
minimum PageRank of 0.15.
Sta306b May 11, 2012 PageRank: 19'
&
$
%
Exercise
Check that PageRank fixes the problem in the earlier FlawedRank
example, producing the normalized solution
p̂ = c(.20, .20, .20, .20, .20)
Sta306b May 11, 2012 PageRank: 20'
&
$
%
Using PageRank for web search
For a basic web search, given a query, we could do the following:
1. Compute the PageRank vector p once. (Google recomputes this
from time to time, to stay current.)
2. Find the documents containing all words in the query.
3. Sort these documents by PageRank, and return the top k (e.g. k =
50).
This is a little too simple ... but we can use similarity scores, changing
the above to:
3. Sort these documents by PageRank, and keep only the top K (e.g. K
= 5000).
4. Sort by similarity to the query and return the top k (e.g. k = 50).
Google uses a combination of PageRank, similarity scores, and other
techniques (it’s proprietary!)
Sta306b May 11, 2012 PageRank: 21'
&
$
%
More recent work
Following invention of PageRank, there has been a huge amount of
work to improve/extend PageRank and not only at Google! There
are many, many academic papers too, here are a few:
• Intelligent surfing: pointing the surfer towards webpages that
are textually relevant. (Richardson and Domingos (2002), “The
Intelligent Surfer: Probabilistic Combination of Link and
Content Information in PageRank”.)
• TrustRank: pointing the surfer away from spam. (Gyongyi et
al. (2004), “Combating Web Spam with TrustRank”.)
• PigeonRank: pigeons, the real reason for Google’s success.
(http://www.google.com/onceuponatime/technology/pigeonrank.html.)
Sta306b May 11, 2012 PageRank: 22'
&
$
%
Computational issues
• How can we perform each iteration quickly (multiply by A
quickly)?
Use the sparsity of web graph.
• How many iterations does it take (generally) to get a
reasonable answer?
Not very many if A has a large spectral gap (difference between
its first and second largest absolute eigenvalues);
Sta306b May 11, 2012 PageRank: 23'
&
$
%
Software
See igraph package in R