vahab lecture 2
DESCRIPTION
math lectureTRANSCRIPT
-
Algorithms and Economics of the Internet (CSCI-GA.3033-003)
Vahab Mirrokni, Google Research, New York Richard Cole, Courant Institute, NYU
-
In this lecture
Structure and modeling of social networks Power law graphs Small world phenomenon; Random Generative Models;
Mining & clustering large networks Ranking through Link Analysis: HITS; Page Rank;
Webspam; Local and Spectral clustering; Mapreduce and Message-passing-based algorithms
Viral Marketing over Social Networks Advertising, influence maximization, and revenue
maximization over Social Networks
-
Real Networks vs. Random Networks
Are real networks like random graphs? Average path length, Disruption: YES Clustering Coefficient, Degree Distribution, Attack: NO
Problems with the random network model: Degreed distribution differs from that of real networks Giant component in most real network does NOT emerge
through a phase transition No local structure clustering coefficient is too low
Most important: Are real networks random? The answer is simply: NO [Other Random Generative Models] But They Are Useful in Analyzing Other Models
-
4
Models of Network Growth
Preferential Attachment Prices Model Barabasi-Albert model
The LCD model [Bollobas-Riordan] The copying model
-
5
Preferential attachment
The main idea is that the rich get richer first studied by Yule for the size of biological genera revisited by Simon reinvented multiple times
Also known as Gibrat principle cumulative advantage Matthew effect
-
6
Preferential Attachment in Networks
First considered by [Price 65] as a model for citation networks each new paper is generated with m citations new papers cite previous papers with probability proportional to
their indegree (citations)
what about papers without any citations? each paper is considered to have a default citation probability of citing a paper with degree k, proportional to k+1
Power law with exponent = 2+1/m
-
7
Barabasi-Albert model
Undirected model: each node connects to other nodes with probability proportional to their degree the process starts with some initial subgraph each node comes with m edges
Results in power-law with exponent = 3
-
8
Weaknesses of the BA model
It is not directed (not good as a model for the Web) It focuses mainly on the (in-) degree and does not take into
account other parameters (out-degree distribution, components, clustering coefficient)
It correlates age with degree which is not always the case: older vertices have higher mean degree.
Many variations have been considered some in order to address the above problems edge rewiring, appearance and disappearance fitness parameters variable mean degree non-linear preferential attachment
-
9
The LCD model [Bollobas-Riordan]
Self loops and multiple edges are allowed A new vertex v, connects to a vertex u with
probability proportional to the degree of u, counting the new edge.
The m edges are inserted sequentially, thus the problem reduces to studying the single edge problem
-
10
Preferential attachment graphs
Expected diameter if m = 1, the diameter is (log n) if m > 1, the diameter is (log n/loglog n)
Expected clustering coefficient
[ ]nnlog
81m
CE2
(2) =
-
11
Copying model
Each node has constant out-degree d A new node selects uniformly one of the existing
nodes as a prototype For the i-th outgoing link
with probability it copies the i-th link of the prototype node
with probability 1- it selects the target of the link uniformly at random
-
12
An example
-
13
Copying model properties
Power law degree distribution with exponent = (2-)/(1- )
Number of bipartite cliques of size i x d is ne-i
The model was meant to capture the topical nature of the Web
It has also found applications in biological networks
-
14
Other graph models
Cooper Frieze model multiple parameters that allow for adding vertices,
edges, preferential attachment, uniform linking
Directed graphs [Bollobas et al] allow for preferential selection of both the source and
the destination allow for edges from both new and old vertices
-
15
Small world network models: Watts & Strogatz (clustering & short paths) Kleinberg (geographical) Watts, Dodds & Newman (hierarchical)
-
Reconciling two observations:
High clustering: my friends friends tend to be my friends Short average paths
Small world phenomenon: Watts/Strogatz model
Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.
-
n As in many network generating algorithms n Disallow self-edges n Disallow multiple edges
Select a fraction p of edges Reposition on of their endpoints
Add a fraction p of additional edges leaving underlying grid intact
Watts-Strogatz model: Generating small world graphs
Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.
-
Each node has K>=4 nearest neighbors (local)
tunable: vary the probability p of rewiring any given edge
small p: regular grid large p: classical random graph
Watts-Strogatz model: Generating small world graphs
-
Watts/Strogatz model: What happens in between? Small shortest path means small clustering? Large shortest path means large clustering? Through numerical simulation
As we increase p from 0 to 1 Fast decrease of mean distance Slow decrease in clustering
-
Watts/Strogatz model: Change in clustering coefficient and average path length as a function of the proportion of rewired edges
l(p)/l(0)
C(p)/C(0)
10% of links rewired 1% of links rewired
Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.
-
What features of real social networks are missing from the small world model? n Long range links not as likely as short range ones
n Hierarchical structure / groups
n Hubs
-
Geographical small world models
The geographic movement of the [message] from Nebraska to Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain S.Milgram The small world problem, Psychology Today 1,61,1967
NE
MA
-
nodes are placed on a grid and connect to nearest neighbors additional links placed with
p(link between u and v) = (distance(u,v))-r
Kleinbergs geographical small world model
Source: Kleinberg, Navigation in a small world
exponent that will determine navigability
-
Decentralized Algorithm
Node s must send message m to node t At any moment, current message holder u must
pass m to a neighbor given only: Set of local contacts of all nodes (grid structure) Location on grid of destination node t Location and long-range contacts of all nodes that have
seen m (but not long-range contacts of nodes that have not seen m)
-
Delivery Time
Definition: Expected delivery time is the expectation, over the choice of long-range contacts and a uniformly random source and destination, of the number of steps taken to deliver message.
-
Results [Kleinberg, 2000]
Theorem 1: There is a decentralized algorithm A so that when r = 2 and p = q = 1, the expected delivery time of A is O(log2n).
Theorem 2: (a) For 0 r < 2, the expected delivery time of any decentralized algorithm is (n(2 r)/3). (b) For r > 2, the expected delivery time of any decentralized algorithm is (n(r 2)/(r 1)). (Constants depend on p, q, and r.)
-
Proof of Theorem 1
Algorithm: In each step, u sends m to his neighbor v which is closest (in grid distance) to t.
Proof Sketch: Define phases based on how close m is to t:
algorithm is in phase j if 2j dist(m,t) 2(j+1) Prove we dont spend much time any phase:
expected time in phase j is at most log n for all j Conclude since at most log n + 1 phases, and so expected
delivery time is O(log2 n)
-
When r=0, links are randomly distributed, ASP ~ log(n), n size of grid
When r=0, any decentralized algorithm is at least a0n2/3
geographical search when network lacks locality
When r
-
Overly localized links on a grid When r>2 expected search time ~ N(r-2)/(r-1)
-
When r=2, expected time of a DA is at most C (log N)2
21~pd
geographical small world model Links balanced between long and short range
-
In this lecture
Structure and modeling of social networks Power law graphs Small world phenomenon; Random Generative Models;
Mining & clustering large networks Ranking through Link Analysis: HITS; Page Rank;
Webspam; Local and Spectral clustering; Mapreduce and Message-passing-based algorithms
Viral Marketing over Social Networks Advertising, influence maximization, and revenue
maximization over Social Networks
-
Components of a search engine
Crawler How to handle different types of URLs How often to crawl each page How to detect duplicates
Indexer Data structures (to minimize # of disk access)
Query handler Find the set of pages that contain the query word. Sort the results.
-
Difficulties
Too many hits (e.g. for ad auctions # indexed pages: 330,000,000)
Often too many pages contain the query.
Sometimes pages are not suff. self-descriptive.
Brin & Page: As of Nov 97, only one in the top four commercial search engine finds itself!
Need to find popular pages.
-
Link analysis
Instead of using text analysis, we analyze the structure of hyperlinks to extract information about the popularity of a page.
Advantages:
Less need for complicated text analysis Less manipulable, and independent of one persons point
of view (think of it as a voting system).
-
Link-based ranking of search results
Compute importance of nodes (in the web graph) Various notions of node centrality Degree dentrality = degree of u Betweenness centrality = #shortest paths passing
through u Closeness centrality = avg. length of shortest paths
from u to all other nodes HITS (Hypertext Induced Topic Selection) PageRank: Googles first link-based ranking
-
HITS and PageRank
HITS (Hypertext Induced Topic Selection) J. Kleinberg, Authorative sources in a hyperlinked environment, SODA 1998.
PageRank
S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, WWW 1998.
L. Page, S. Brin, R. Motwani, and Winograd, The PageRank citation ranking: bringing order to the web.
-
Relevance vs. popularity
Balance between relevance and popularity? Construct a focused subgraph based on relevance,
and return the most popular page in this subgraph, Or compute a measure of relevance (considering
how many times and in what form [title/url/font size/anchor] the query appears in the page), and multiply with a popularity measure, e.g. PageRank.
-
Constructing a focused subgraph
Given query , start with the set R of the top ~200 text-based hits for .
Add to this set: the set of pages that have a link from a page in R; the set of pages that have a link to a page p in R, with an
upper limit of ~50 pages per p 2 R.
Call the resulting set S. Find the most authorative page in G[S].
-
Constructing a focused subgraph
Desired properties: Relatively small Rich in relevant pages Contains most of the strongest authorities on the subject.
-
Finding authorities
Approach 1: vertices with the largest in-degrees This approach is used to evaluate scientific citations
(the impact factor). Deficiencies:
A page might have a large in-degree from low-quality pages. universally popular pages often dominate the result. Easy to manipulate.
-
Finding authorities: Mutually Recursive Definition Approach 2: define the set of authorities recursively. A good hub links to many good authorities Best authorities on a subject have a large in-degree from the best
hubs on the subject.
A good authority is linked from many good hubs Best hubs on a subject give links to the best authorities on the
subject.
Model using two scores for each node: Hub score and Authority score Represented as vectors h and a
-
42
Authority and Hubness
2
3
4
1 1
5
6
7
a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
-
Finding authorities Initialize authority and hub weights, a0 and h0 while (not converged)
for each vertex i,
end end
Does it converge? What does it converge to?
ak+1(i) = hk ( j)j!Bi"
hk+1(i) = ak ( j)j!Fi"
-
Rewrite in matrix form
h=Aa. a=Ath.
Recall At is the
transpose of A.
Guaranteed to converge.
Substituting, h=AAth and a=AtAa Fact: h is an eigenvector of AAt and a is an eigenvector of AtA. Proof on the board Further, this algorithm is a particular, known algorithm for computing eigenvectors: the power iteration method.
-
45
HITS Example Results
Authority Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
-
PageRank
Again, the idea is a recursive definition of importance: An important page is a page that has many links from
other important pages.
Problems with the nave definition: Not always well-defined. Pages with no out-degree form rank sinks.
Solution: Add a random restart everywhere to get rid of sinks, and make it well-defined and
-
PageRank: Random Walk View
Fix: consider a random surfer, which every time either clicks on a random link, or with probability , gets bored and starts again from a random page.
PageRank takes = 15%, and uses a non-uniform
distribution for starting again.
-
PageRank Algorithm
Let restart probability be a, e.g. a=15%. initialize ranks P0 while (not converged) for each vertex i
end end
Pk+1(i) = a / n+ (1! a)Pk ( j)N jj"Bi
#
-
49
Markov Chain Notation
Random surfer model Description of a random walk through the Web graph Interpreted as a transition matrix with asymptotic probability
that a surfer is currently browsing that page
Does it converge to some sensible solution (as too) regardless of the initial ranks ?
rt = M rt-1 M: transition matrix for a first-order Markov chain (stochastic)
-
50
Problem
Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph E.g.
no parent rank 0 MT converges to a matrix whose last column is all zero
no children no solution MT converges to zero matrix
-
51
Modification
Surfer will restart browsing by picking a new Web page at random
M = ( (1-a)B + a E ) E : escape Matrix
B: Adjacency Matrix
M : stochastic Matrix
rt = M rt-1 M: transition matrix for a first-order Markov chain (stochastic)
-
PageRank Algorithm
Let restart probability be a, e.g. a=15%. initialize ranks P0 while (not converged) for each vertex i
end end
Pk+1(i) = a / n+ (1! a)Pk ( j)Deg( j)j"Incoming(i)
#
-
Pagerank: Issues and Variants
How realistic is the random surfer model? What if we modeled the back button? [Fagi00]
Surfer behavior sharply skewed towards short paths [Hube98]
Search engines, bookmarks & directories make jumps non-random.
Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform
edge selection)
Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)
-
Topic Specific (Personalized) Pagerank [Have02]
Conceptually, we use a random surfer who teleports, with say 15% probability, using the following rule:
Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories
Restart to a page uniformly at random within the chosen category
Sounds hard to implement: cant compute PageRank at query time!
-
Topic Specific (Personalized) Pagerank [Have02]
Implementation offline:Compute pagerank distributions wrt to individual
categories Query independent model as before Each page has multiple pagerank scores one for each ODP
category, with teleportation only to that category
online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum
of category-specific pageranks
-
Influencing PageRank (Personalization) Input:
Web graph W influence vector v
v : (page degree of influence)
Output: Rank vector r: (page page importance wrt v)
r = PR(W , v)
-
Non-uniform Teleportation
Teleport with 10% probability to a Sports page
Sports
-
Interpretation of Composite Score
For a set of personalization vectors {vj}
j [wj PR(W , vj)] = PR(W , j [wj vj])
Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt vj
-
Interpretation
10% Sports teleportation
Sports
-
Interpretation
Health
10% Health teleportation
-
Interpretation
Sports
Health
pr = (0.9 PRsports + 0.1 PRhealth) gives you: 9% sports teleportation, 1% health teleportation
-
Pairwise Similarity: Personalized PageRank
Personalized PageRank (PPR) of uv: Probability of visiting v in the following random walk: at each step, With probability a, go back to u. With probability 1-a, go to a neighbor uniformly at
random.
PPR is a similarity measure: It captures Distance #disjoint paths
-
Approximate PPR vector
PPR Vector for u: vector of PPR value from u. Contribution PR (CPR) vector for u: vector of PPR value to u. Goal: Compute approximate PPR or CPR Vectors
with an additive error of
-
64
Next Lecture:
Link Analysis Link Spam Detection Local Algorithms for computing approximate
PPR Local and Specteral Graph Partitioning
-
65
Further Reading
M. Henzinger, Link Analysis in Web Information Retreival, Bulletin
of the IEEE computer Society Technical Committee on Data Engineering, 2000.
-
Credits/References Some material used in preparing this lecture: Newman course on Networks, U. Michigan. Nicole Immorlica and Mohammad Mahdians
course at U. of Washington, 2006. Jure Leskovecs course on Information and Social
Networks, Stanford, 2011,
Lada Adamics course on Networks: Theory and Applications at U. Michigan
Thanks