link analysis: pagerank
DESCRIPTION
Link Analysis: PageRank. Ranking Nodes on the Graph. vs. Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure. Link Analysis Algorithms. - PowerPoint PPT PresentationTRANSCRIPT
Link Analysis: PageRank
Ranking Nodes on the Graph
• Web pages are not equally “important”www.joe-schmoe.com vs. www.stanford.edu
• Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure
Slides by Jure Leskovec: Mining Massive Datasets 2
vs.
Slides by Jure Leskovec: Mining Massive Datasets 3
Link Analysis Algorithms
• We will cover the following Link Analysis approaches to computing importances of nodes in a graph:– Page Rank– Hubs and Authorities (HITS)– Topic-Specific (Personalized) Page Rank– Web Spam Detection Algorithms
Links as Votes
• Idea: Links as votes– Page is more important if it has more links
• In-coming links? Out-going links?
• Think of in-links as votes:– www.stanford.edu has 23,400 inlinks– www.joe-schmoe.com has 1 inlink
• Are all in-links are equal?– Links from important pages count more– Recursive question!
Slides by Jure Leskovec: Mining Massive Datasets 4
Simple Recursive Formulation
• Each link’s vote is proportional to the importance of its source page
• If page p with importance x has n out-links, each link gets x/n votes
• Page p’s own importance is the sum of the votes on its in-links
Slides by Jure Leskovec: Mining Massive Datasets 5
p
Slides by Jure Leskovec: Mining Massive Datasets 6
PageRank: The “Flow” Model• A “vote” from an important
page is worth more• A page is important if it is
pointed to by other important pages
• Define a “rank” rj for node j
ji
ij
rr(i)dout
y
maa/2
y/2a/2
m
y/2
The web in 1839
Flow equations:ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
Solving the Flow Equations
• 3 equations, 3 unknowns, no constants– No unique solution
• Additional constraint forces uniqueness– ry + ra + rm = 1– ry = 2/5, ra = 2/5, rm = 1/5
• Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
Slides by Jure Leskovec: Mining Massive Datasets 7
ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
Flow equations:
PageRank: Matrix Formulation• Stochastic adjacency matrix M
– Let page j has dj out-links– If j → i, then Mij = 1/dj else Mij = 0
• M is a column stochastic matrix– Columns sum to 1
• Rank vector r: vector with an entry per page– ri is the importance score of page i– i ri = 1
• The flow equations can be written r = M r
Slides by Jure Leskovec: Mining Massive Datasets 8
Example
• Suppose page j links to 3 pages, including i
Slides by Jure Leskovec: Mining Massive Datasets 9
i
j
M r r
=i
1/3
Eigenvector Formulation
• The flow equations can be written r = M ∙ r
• So the rank vector is an eigenvector of the stochastic web matrix– In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
Slides by Jure Leskovec: Mining Massive Datasets 10
Example: Flow Equations & M
Slides by Jure Leskovec: Mining Massive Datasets 11
r = Mr
y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m
y
a m
y a my ½ ½ 0a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
Power Iteration Method
• Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks
• Power iteration: a simple iterative scheme– Suppose there are N web pages– Initialize: r(0) = [1/N,….,1/N]T
– Iterate: r(t+1) = M ∙ r(t)
– Stop when |r(t+1) – r(t)|1 < • |x|1 = 1≤i≤N|xi| is the L1 norm • Can use any other vector norm e.g., Euclidean
Slides by Jure Leskovec: Mining Massive Datasets 12
ji
tit
jrr
i
)()1(
ddi …. out-degree of node i
Slides by Jure Leskovec: Mining Massive Datasets 13
PageRank: How to solve?
• Power Iteration:– Set /N
• And iterate• ri=j Mij r∙ j
• Example:ry 1/3 1/3 5/12 9/24 6/15ra = 1/3 3/6 1/3 11/24 … 6/15rm 1/3 1/6 3/12 1/6 3/15
y
a m
y a my ½ ½ 0
a ½ 0 1
m 0 ½ 0
Iteration 0, 1, 2, …
ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
Random Walk Interpretation
Imagine a random web surfer:– At any time t, surfer is on some page u– At time t+1, the surfer follows an
out-link from u uniformly at random– Ends up on some page v linked from u– Process repeats indefinitely
Let:p(t) … vector whose ith coordinate is the
prob. that the surfer is at page i at time t– p(t) is a probability distribution over pages
Slides by Jure Leskovec: Mining Massive Datasets 14
ji
ij
rr(i)dout
j
i1 i2 i3
The Stationary Distribution
• Where is the surfer at time t+1?– Follows a link uniformly at random
p(t+1) = M · p(t)• Suppose the random walk reaches a state
p(t+1) = M · p(t) = p(t)then p(t) is stationary distribution of a random walk
Our rank vector r satisfies r = M · r– So, it is a stationary distribution for
the random walk
Slides by Jure Leskovec: Mining Massive Datasets 15
)()1( tptp Mj
i1 i2 i3
Slides by Jure Leskovec: Mining Massive Datasets 16
PageRank: Three Questions
• Does this converge?
• Does it converge to what we want?
• Are results reasonable?
ji
tit
jrr
i
)()1(
d Mrr or equivalently
Slides by Jure Leskovec: Mining Massive Datasets 17
Does This Converge?
• Example:ra 1 0 1 0rb 0 1 0 1
=
ba
Iteration 0, 1, 2, …
ji
tit
jrr
i
)()1(
d
Slides by Jure Leskovec: Mining Massive Datasets 18
Does it Converge to What We Want?
• Example:ra 1 0 0 0rb 0 1 0 0=
ba
Iteration 0, 1, 2, …
ji
tit
jrr
i
)()1(
d
Slides by Jure Leskovec: Mining Massive Datasets 19
Problems with the “Flow” Model
2 problems:• Some pages are “dead ends”
(have no out-links)– Such pages cause
importance to “leak out”
• Spider traps (all out links arewithin the group)– Eventually spider traps absorb all importance
Slides by Jure Leskovec: Mining Massive Datasets 20
Problem: Spider Traps
• Power Iteration:– Set
• And iterate
• Example:ry 1/3 2/6 3/12 5/24 0ra = 1/3 1/6 2/12 3/24 … 0rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
y
a m
y a my ½ ½ 0
a ½ 0 0
m 0 ½ 1
ry = ry /2 + ra /2ra = ry /2rm = ra /2 + rm
Solution: Random Teleports• The Google solution for spider traps: At each
time step, the random surfer has two options:– With probability , follow a link at random– With probability 1-, jump to some page
uniformly at random– Common values for are in the range 0.8 to 0.9
• Surfer will teleport out of spider trap within a few time steps
Slides by Jure Leskovec: Mining Massive Datasets 21
y
a m
y
a m
Slides by Jure Leskovec: Mining Massive Datasets 22
Problem: Dead Ends
• Power Iteration:– Set
• And iterate
• Example:ry 1/3 2/6 3/12 5/24 0ra = 1/3 1/6 2/12 3/24 … 0rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
y
a m
y a my ½ ½ 0
a ½ 0 0
m 0 ½ 0
ry = ry /2 + ra /2ra = ry /2rm = ra /2
Slides by Jure Leskovec: Mining Massive Datasets 23
Solution: Dead Ends• Teleports: Follow random teleport links with
probability 1.0 from dead-ends– Adjust matrix accordingly
y
a my a m
y ½ ½ ⅓a ½ 0 ⅓
m 0 ½ ⅓
y a my ½ ½ 0
a ½ 0 0
m 0 ½ 0
y
a m
Slides by Jure Leskovec: Mining Massive Datasets 24
Why Teleports Solve the Problem?
Markov Chains• Set of states X• Transition matrix P where Pij = P(Xt=i | Xt-1=j)• π specifying the probability of being at each
state x X• Goal is to find π such that π = P π
)()1( tt Mrr
Slides by Jure Leskovec: Mining Massive Datasets 25
Why is This Analogy Useful?
• Theory of Markov chains
• Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
Slides by Jure Leskovec: Mining Massive Datasets 26
Make M Stochastic
• Stochastic: Every column sums to 1• A possible solution: Add green links
y
a m
y a my ½ ½ 1/3a ½ 0 1/3
m 0 ½ 1/3
ry = ry /2 + ra /2 + rm /3ra = ry /2+ rm /3rm = ra /2 + rm /3
)1( 1n
aMS T• ai…=1 if node i has
out deg 0, =0 else• 1…vector of all 1s
Slides by Jure Leskovec: Mining Massive Datasets 27
Make M Aperiodic
• A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.
• A possible solution: Add green links
y
a m
Slides by Jure Leskovec: Mining Massive Datasets 28
Make M Irreducible
• From any state, there is a non-zero probability of going from any one state to any another
• A possible solution: Add green links
y
a m
Slides by Jure Leskovec: Mining Massive Datasets 29
Solution: Random Jumps
• Google’s solution that does it all:– Makes M stochastic, aperiodic, irreducible
• At each step, random surfer has two options:– With probability 1-, follow a link at random– With probability , jump to some random page
• PageRank equation [Brin-Page, 98]
di … out-degree of node i
From now on: We assume M has no dead endsThat is, we follow random teleport links
with probability 1.0 from dead-ends
Slides by Jure Leskovec: Mining Massive Datasets 30
The Google Matrix• PageRank equation [Brin-Page, 98]
• The Google Matrix A:
• G is stochastic, aperiodic and irreducible, so
• What is ?– In practice =0.85 (make 5 steps and jump)
Random Teleports ( 0.8)
Slides by Jure Leskovec: Mining Massive Datasets 31
ya =m
1/31/31/3
0.330.200.46
0.240.200.52
0.260.180.56
7/33 5/3321/33
. . .
y
a m
0.8·
½+0
.2·⅓
0.8·½+0.2·⅓
0.2·⅓
0.8+0.2·⅓
0.2·⅓
0.2· ⅓
0.2· ⅓
0.8·½+0.2·⅓ 0.
8·½
+0.2
·⅓
1/2 1/2 0 1/2 0 0 0 1/2 1
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15
0.8 + 0.2
S 1/n·1·1T
A
Slides by Jure Leskovec: Mining Massive Datasets 32
Computing Page Rank• Key step is matrix-vector multiplication
– rnew = A ∙ rold
• Easy if we have enough main memory to hold A, rold, rnew
• Say N = 1 billion pages– We need 4 bytes for
each entry (say)– 2 billion entries for
vectors, approx 8GB– Matrix A has N2 entries
• 1018 is a large number!
½ ½ 0 ½ 0 00 ½ 1
1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3
7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15
0.8 +0.2
A = M∙ + (1-) [1/N]NxN
=
A =
Matrix Formulation• Suppose there are N pages• Consider a page j, with set of out-links dj
• We have Mij = 1/|dj| when j→i and Mij = 0 otherwise
• The random teleport is equivalent to– Adding a teleport link from j to every other page with
probability (1-)/N– Reducing the probability of following each out-link from 1/|
dj| to /|dj|– Equivalent: Tax each page a fraction (1-) of its score and
redistribute evenly
Slides by Jure Leskovec: Mining Massive Datasets 33
Slides by Jure Leskovec: Mining Massive Datasets 34
Rearranging the Equation
• , where
since
• So we get:
[x]N … a vector of length N with all entries x
Slides by Jure Leskovec: Mining Massive Datasets 35
Sparse Matrix Formulation
• We just rearranged the PageRank equation
• where [(1-)/N]N is a vector with all N entries (1-)/N
• M is a sparse matrix!– 10 links per node, approx 10N entries
• So in each iteration, we need to:– Compute rnew = M ∙ rold
– Add a constant value (1-)/N to each entry in rnew
Slides by Jure Leskovec: Mining Massive Datasets 36
Sparse Matrix Encoding
• Encode sparse matrix using only nonzero entries– Space proportional roughly to number of links– Say 10N, or 4*10*1 billion = 40GB– Still won’t fit in memory, but will fit on disk
0 3 1, 5, 71 5 17, 64, 113, 117, 2452 2 13, 23
sourcenode degree destination nodes
Slides by Jure Leskovec: Mining Massive Datasets 37
Basic Algorithm: Update Step• Assume enough RAM to fit rnew into memory
– Store rold and matrix M on disk• Then 1 step of power-iteration is:
0 3 1, 5, 61 4 17, 64, 113, 1172 2 13, 23
src degree destination0123456
0123456
rnew rold
Initialize all entries of rnew to (1-)/NFor each page p (of out-degree n):
Read into memory: p, n, dest1,…,destn, rold(p)for j = 1…n: rnew(destj) += rold(p) / n
Slides by Jure Leskovec: Mining Massive Datasets 38
Analysis
• Assume enough RAM to fit rnew into memory– Store rold and matrix M on disk
• In each iteration, we have to:– Read rold and M– Write rnew back to disk– IO cost = 2|r| + |M|
• Question:– What if we could not even fit rnew in memory?
Slides by Jure Leskovec: Mining Massive Datasets 39
Block-based Update Algorithm
0 4 0, 1, 3, 51 2 0, 52 2 3, 4
src degree destination01
23
45
012345
rnew rold
Slides by Jure Leskovec: Mining Massive Datasets 40
Analysis of Block Update
• Similar to nested-loop join in databases– Break rnew into k blocks that fit in memory– Scan M and rold once for each block
• k scans of M and rold
– k(|M| + |r|) + |r| = k|M| + (k+1)|r|
• Can we do better?– Hint: M is much bigger than r (approx 10-20x), so
we must avoid reading it k times per iteration
Slides by Jure Leskovec: Mining Massive Datasets 41
Block-Stripe Update Algorithm0 4 0, 11 3 02 2 1
src degree destination
01
23
45
012345
rnew
rold
0 4 51 3 52 2 4
0 4 32 2 3
Slides by Jure Leskovec: Mining Massive Datasets 42
Block-Stripe Analysis
• Break M into stripes– Each stripe contains only destination nodes in the
corresponding block of rnew
• Some additional overhead per stripe– But it is usually worth it
• Cost per iteration– |M|(1+) + (k+1)|r|
Slides by Jure Leskovec: Mining Massive Datasets 43
Some Problems with Page Rank
• Measures generic popularity of a page– Biased against topic-specific authorities– Solution: Topic-Specific PageRank (next)
• Uses a single measure of importance– Other models e.g., hubs-and-authorities– Solution: Hubs-and-Authorities (next)
• Susceptible to Link spam– Artificial link topographies created in order to boost page
rank– Solution: TrustRank (next)