vahab lecture 2

Algorithms and Economics of the Internet (CSCI-GA.3033-003)

Vahab Mirrokni, Google Research, New York Richard Cole, Courant Institute, NYU

In this lecture

Structure and modeling of social networks Power law graphs Small world phenomenon; Random Generative Models;

Mining & clustering large networks Ranking through Link Analysis: HITS; Page Rank;

Webspam; Local and Spectral clustering; Mapreduce and Message-passing-based algorithms

Viral Marketing over Social Networks Advertising, influence maximization, and revenue

maximization over Social Networks

Real Networks vs. Random Networks

Are real networks like random graphs? Average path length, Disruption: YES Clustering Coefficient, Degree Distribution, Attack: NO

Problems with the random network model: Degreed distribution differs from that of real networks Giant component in most real network does NOT emerge

through a phase transition No local structure clustering coefficient is too low

Most important: Are real networks random? The answer is simply: NO [Other Random Generative Models] But They Are Useful in Analyzing Other Models

4

Models of Network Growth

Preferential Attachment Prices Model Barabasi-Albert model

The LCD model [Bollobas-Riordan] The copying model

5

Preferential attachment

The main idea is that the rich get richer first studied by Yule for the size of biological genera revisited by Simon reinvented multiple times

Also known as Gibrat principle cumulative advantage Matthew effect

6

Preferential Attachment in Networks

First considered by [Price 65] as a model for citation networks each new paper is generated with m citations new papers cite previous papers with probability proportional to

their indegree (citations)

what about papers without any citations? each paper is considered to have a default citation probability of citing a paper with degree k, proportional to k+1

Power law with exponent = 2+1/m

7

Barabasi-Albert model

Undirected model: each node connects to other nodes with probability proportional to their degree the process starts with some initial subgraph each node comes with m edges

Results in power-law with exponent = 3

8

Weaknesses of the BA model

It is not directed (not good as a model for the Web) It focuses mainly on the (in-) degree and does not take into

account other parameters (out-degree distribution, components, clustering coefficient)

It correlates age with degree which is not always the case: older vertices have higher mean degree.

Many variations have been considered some in order to address the above problems edge rewiring, appearance and disappearance fitness parameters variable mean degree non-linear preferential attachment

9

The LCD model [Bollobas-Riordan]

Self loops and multiple edges are allowed A new vertex v, connects to a vertex u with

probability proportional to the degree of u, counting the new edge.

The m edges are inserted sequentially, thus the problem reduces to studying the single edge problem

10

Preferential attachment graphs

Expected diameter if m = 1, the diameter is (log n) if m > 1, the diameter is (log n/loglog n)

Expected clustering coefficient

[ ]nnlog

81m

CE2

(2) =

11

Copying model

Each node has constant out-degree d A new node selects uniformly one of the existing

nodes as a prototype For the i-th outgoing link

with probability it copies the i-th link of the prototype node

with probability 1- it selects the target of the link uniformly at random

12

An example

13

Copying model properties

Power law degree distribution with exponent = (2-)/(1- )

Number of bipartite cliques of size i x d is ne-i

The model was meant to capture the topical nature of the Web

It has also found applications in biological networks

14

Other graph models

Cooper Frieze model multiple parameters that allow for adding vertices,

edges, preferential attachment, uniform linking

Directed graphs [Bollobas et al] allow for preferential selection of both the source and

the destination allow for edges from both new and old vertices

15

Small world network models: Watts & Strogatz (clustering & short paths) Kleinberg (geographical) Watts, Dodds & Newman (hierarchical)

Reconciling two observations:

High clustering: my friends friends tend to be my friends Short average paths

Small world phenomenon: Watts/Strogatz model

Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.

n As in many network generating algorithms n Disallow self-edges n Disallow multiple edges

Select a fraction p of edges Reposition on of their endpoints

Add a fraction p of additional edges leaving underlying grid intact

Watts-Strogatz model: Generating small world graphs


Each node has K>=4 nearest neighbors (local)

tunable: vary the probability p of rewiring any given edge

small p: regular grid large p: classical random graph

Watts-Strogatz model: Generating small world graphs

Watts/Strogatz model: What happens in between? Small shortest path means small clustering? Large shortest path means large clustering? Through numerical simulation

As we increase p from 0 to 1 Fast decrease of mean distance Slow decrease in clustering

Watts/Strogatz model: Change in clustering coefficient and average path length as a function of the proportion of rewired edges

l(p)/l(0)

C(p)/C(0)

10% of links rewired 1% of links rewired


What features of real social networks are missing from the small world model? n Long range links not as likely as short range ones

n Hierarchical structure / groups

n Hubs

Geographical small world models

The geographic movement of the [message] from Nebraska to Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain S.Milgram The small world problem, Psychology Today 1,61,1967

NE

MA

nodes are placed on a grid and connect to nearest neighbors additional links placed with

p(link between u and v) = (distance(u,v))-r

Kleinbergs geographical small world model

Source: Kleinberg, Navigation in a small world

exponent that will determine navigability

Decentralized Algorithm

Node s must send message m to node t At any moment, current message holder u must

pass m to a neighbor given only: Set of local contacts of all nodes (grid structure) Location on grid of destination node t Location and long-range contacts of all nodes that have

seen m (but not long-range contacts of nodes that have not seen m)

Delivery Time

Definition: Expected delivery time is the expectation, over the choice of long-range contacts and a uniformly random source and destination, of the number of steps taken to deliver message.

Results [Kleinberg, 2000]

Theorem 1: There is a decentralized algorithm A so that when r = 2 and p = q = 1, the expected delivery time of A is O(log2n).

Theorem 2: (a) For 0 r < 2, the expected delivery time of any decentralized algorithm is (n(2 r)/3). (b) For r > 2, the expected delivery time of any decentralized algorithm is (n(r 2)/(r 1)). (Constants depend on p, q, and r.)

Proof of Theorem 1

Algorithm: In each step, u sends m to his neighbor v which is closest (in grid distance) to t.

Proof Sketch: Define phases based on how close m is to t:

algorithm is in phase j if 2j dist(m,t) 2(j+1) Prove we dont spend much time any phase:

expected time in phase j is at most log n for all j Conclude since at most log n + 1 phases, and so expected

delivery time is O(log2 n)

When r=0, links are randomly distributed, ASP ~ log(n), n size of grid

When r=0, any decentralized algorithm is at least a0n2/3

geographical search when network lacks locality

When r

Overly localized links on a grid When r>2 expected search time ~ N(r-2)/(r-1)

When r=2, expected time of a DA is at most C (log N)2

21~pd

geographical small world model Links balanced between long and short range

In this lecture

Structure and modeling of social networks Power law graphs Small world phenomenon; Random Generative Models;

Mining & clustering large networks Ranking through Link Analysis: HITS; Page Rank;

Webspam; Local and Spectral clustering; Mapreduce and Message-passing-based algorithms

Viral Marketing over Social Networks Advertising, influence maximization, and revenue

maximization over Social Networks

Components of a search engine

Crawler How to handle different types of URLs How often to crawl each page How to detect duplicates

Indexer Data structures (to minimize # of disk access)

Query handler Find the set of pages that contain the query word. Sort the results.

Difficulties

Too many hits (e.g. for ad auctions # indexed pages: 330,000,000)

Often too many pages contain the query.

Sometimes pages are not suff. self-descriptive.

Brin & Page: As of Nov 97, only one in the top four commercial search engine finds itself!

Need to find popular pages.

Link analysis

Instead of using text analysis, we analyze the structure of hyperlinks to extract information about the popularity of a page.

Advantages:

Less need for complicated text analysis Less manipulable, and independent of one persons point

of view (think of it as a voting system).

Link-based ranking of search results

Compute importance of nodes (in the web graph) Various notions of node centrality Degree dentrality = degree of u Betweenness centrality = #shortest paths passing

through u Closeness centrality = avg. length of shortest paths

from u to all other nodes HITS (Hypertext Induced Topic Selection) PageRank: Googles first link-based ranking

HITS and PageRank

HITS (Hypertext Induced Topic Selection) J. Kleinberg, Authorative sources in a hyperlinked environment, SODA 1998.

PageRank

S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, WWW 1998.

L. Page, S. Brin, R. Motwani, and Winograd, The PageRank citation ranking: bringing order to the web.

Relevance vs. popularity

Balance between relevance and popularity? Construct a focused subgraph based on relevance,

and return the most popular page in this subgraph, Or compute a measure of relevance (considering

how many times and in what form [title/url/font size/anchor] the query appears in the page), and multiply with a popularity measure, e.g. PageRank.

Constructing a focused subgraph

Given query , start with the set R of the top ~200 text-based hits for .

Add to this set: the set of pages that have a link from a page in R; the set of pages that have a link to a page p in R, with an

upper limit of ~50 pages per p 2 R.

Call the resulting set S. Find the most authorative page in G[S].

Constructing a focused subgraph

Desired properties: Relatively small Rich in relevant pages Contains most of the strongest authorities on the subject.

Finding authorities

Approach 1: vertices with the largest in-degrees This approach is used to evaluate scientific citations

(the impact factor). Deficiencies:

A page might have a large in-degree from low-quality pages. universally popular pages often dominate the result. Easy to manipulate.

Finding authorities: Mutually Recursive Definition Approach 2: define the set of authorities recursively. A good hub links to many good authorities Best authorities on a subject have a large in-degree from the best

hubs on the subject.

A good authority is linked from many good hubs Best hubs on a subject give links to the best authorities on the

subject.

Model using two scores for each node: Hub score and Authority score Represented as vectors h and a

42

Authority and Hubness

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Finding authorities Initialize authority and hub weights, a0 and h0 while (not converged)

for each vertex i,

end end

Does it converge? What does it converge to?

ak+1(i) = hk ( j)j!Bi"

hk+1(i) = ak ( j)j!Fi"

Rewrite in matrix form

h=Aa. a=Ath.

Recall At is the

transpose of A.

Guaranteed to converge.

Substituting, h=AAth and a=AtAa Fact: h is an eigenvector of AAt and a is an eigenvector of AtA. Proof on the board Further, this algorithm is a particular, known algorithm for computing eigenvectors: the power iteration method.

45

HITS Example Results

Authority Hubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

PageRank

Again, the idea is a recursive definition of importance: An important page is a page that has many links from

other important pages.

Problems with the nave definition: Not always well-defined. Pages with no out-degree form rank sinks.

Solution: Add a random restart everywhere to get rid of sinks, and make it well-defined and

PageRank: Random Walk View

Fix: consider a random surfer, which every time either clicks on a random link, or with probability , gets bored and starts again from a random page.

PageRank takes = 15%, and uses a non-uniform

distribution for starting again.

PageRank Algorithm

Let restart probability be a, e.g. a=15%. initialize ranks P0 while (not converged) for each vertex i

end end

Pk+1(i) = a / n+ (1! a)Pk ( j)N jj"Bi

#

49

Markov Chain Notation

Random surfer model Description of a random walk through the Web graph Interpreted as a transition matrix with asymptotic probability

that a surfer is currently browsing that page

Does it converge to some sensible solution (as too) regardless of the initial ranks ?

rt = M rt-1 M: transition matrix for a first-order Markov chain (stochastic)

50

Problem

Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph E.g.

no parent rank 0 MT converges to a matrix whose last column is all zero

no children no solution MT converges to zero matrix

51

Modification

Surfer will restart browsing by picking a new Web page at random

M = ( (1-a)B + a E ) E : escape Matrix

B: Adjacency Matrix

M : stochastic Matrix

rt = M rt-1 M: transition matrix for a first-order Markov chain (stochastic)

PageRank Algorithm

Let restart probability be a, e.g. a=15%. initialize ranks P0 while (not converged) for each vertex i

end end

Pk+1(i) = a / n+ (1! a)Pk ( j)Deg( j)j"Incoming(i)

#

Pagerank: Issues and Variants

How realistic is the random surfer model? What if we modeled the back button? [Fagi00]

Surfer behavior sharply skewed towards short paths [Hube98]

Search engines, bookmarks & directories make jumps non-random.

Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform

edge selection)

Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)

Topic Specific (Personalized) Pagerank [Have02]

Conceptually, we use a random surfer who teleports, with say 15% probability, using the following rule:

Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories

Restart to a page uniformly at random within the chosen category

Sounds hard to implement: cant compute PageRank at query time!

Topic Specific (Personalized) Pagerank [Have02]

Implementation offline:Compute pagerank distributions wrt to individual

categories Query independent model as before Each page has multiple pagerank scores one for each ODP

category, with teleportation only to that category

online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum

of category-specific pageranks

Influencing PageRank (Personalization) Input:

Web graph W influence vector v

v : (page degree of influence)

Output: Rank vector r: (page page importance wrt v)

r = PR(W , v)

Non-uniform Teleportation

Teleport with 10% probability to a Sports page

Sports

Interpretation of Composite Score

For a set of personalization vectors {vj}

j [wj PR(W , vj)] = PR(W , j [wj vj])

Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt vj

Interpretation

10% Sports teleportation

Sports

Interpretation

Health

10% Health teleportation

Interpretation

Sports

Health

pr = (0.9 PRsports + 0.1 PRhealth) gives you: 9% sports teleportation, 1% health teleportation

Pairwise Similarity: Personalized PageRank

Personalized PageRank (PPR) of uv: Probability of visiting v in the following random walk: at each step, With probability a, go back to u. With probability 1-a, go to a neighbor uniformly at

random.

PPR is a similarity measure: It captures Distance #disjoint paths

Approximate PPR vector

PPR Vector for u: vector of PPR value from u. Contribution PR (CPR) vector for u: vector of PPR value to u. Goal: Compute approximate PPR or CPR Vectors

with an additive error of

64

Next Lecture:

Link Analysis Link Spam Detection Local Algorithms for computing approximate

PPR Local and Specteral Graph Partitioning

65

Further Reading

M. Henzinger, Link Analysis in Web Information Retreival, Bulletin

of the IEEE computer Society Technical Committee on Data Engineering, 2000.

Credits/References Some material used in preparing this lecture: Newman course on Networks, U. Michigan. Nicole Immorlica and Mohammad Mahdians

course at U. of Washington, 2006. Jure Leskovecs course on Information and Social

Networks, Stanford, 2011,

Lada Adamics course on Networks: Theory and Applications at U. Michigan

Thanks

vahab lecture 2

Documents