vahab lecture 2

Upload: roshni-ghosh

Post on 08-Jan-2016

4 views

Category:

Documents


0 download

DESCRIPTION

math lecture

TRANSCRIPT

  • Algorithms and Economics of the Internet (CSCI-GA.3033-003)

    Vahab Mirrokni, Google Research, New York Richard Cole, Courant Institute, NYU

  • In this lecture

    Structure and modeling of social networks Power law graphs Small world phenomenon; Random Generative Models;

    Mining & clustering large networks Ranking through Link Analysis: HITS; Page Rank;

    Webspam; Local and Spectral clustering; Mapreduce and Message-passing-based algorithms

    Viral Marketing over Social Networks Advertising, influence maximization, and revenue

    maximization over Social Networks

  • Real Networks vs. Random Networks

    Are real networks like random graphs? Average path length, Disruption: YES Clustering Coefficient, Degree Distribution, Attack: NO

    Problems with the random network model: Degreed distribution differs from that of real networks Giant component in most real network does NOT emerge

    through a phase transition No local structure clustering coefficient is too low

    Most important: Are real networks random? The answer is simply: NO [Other Random Generative Models] But They Are Useful in Analyzing Other Models

  • 4

    Models of Network Growth

    Preferential Attachment Prices Model Barabasi-Albert model

    The LCD model [Bollobas-Riordan] The copying model

  • 5

    Preferential attachment

    The main idea is that the rich get richer first studied by Yule for the size of biological genera revisited by Simon reinvented multiple times

    Also known as Gibrat principle cumulative advantage Matthew effect

  • 6

    Preferential Attachment in Networks

    First considered by [Price 65] as a model for citation networks each new paper is generated with m citations new papers cite previous papers with probability proportional to

    their indegree (citations)

    what about papers without any citations? each paper is considered to have a default citation probability of citing a paper with degree k, proportional to k+1

    Power law with exponent = 2+1/m

  • 7

    Barabasi-Albert model

    Undirected model: each node connects to other nodes with probability proportional to their degree the process starts with some initial subgraph each node comes with m edges

    Results in power-law with exponent = 3

  • 8

    Weaknesses of the BA model

    It is not directed (not good as a model for the Web) It focuses mainly on the (in-) degree and does not take into

    account other parameters (out-degree distribution, components, clustering coefficient)

    It correlates age with degree which is not always the case: older vertices have higher mean degree.

    Many variations have been considered some in order to address the above problems edge rewiring, appearance and disappearance fitness parameters variable mean degree non-linear preferential attachment

  • 9

    The LCD model [Bollobas-Riordan]

    Self loops and multiple edges are allowed A new vertex v, connects to a vertex u with

    probability proportional to the degree of u, counting the new edge.

    The m edges are inserted sequentially, thus the problem reduces to studying the single edge problem

  • 10

    Preferential attachment graphs

    Expected diameter if m = 1, the diameter is (log n) if m > 1, the diameter is (log n/loglog n)

    Expected clustering coefficient

    [ ]nnlog

    81m

    CE2

    (2) =

  • 11

    Copying model

    Each node has constant out-degree d A new node selects uniformly one of the existing

    nodes as a prototype For the i-th outgoing link

    with probability it copies the i-th link of the prototype node

    with probability 1- it selects the target of the link uniformly at random

  • 12

    An example

  • 13

    Copying model properties

    Power law degree distribution with exponent = (2-)/(1- )

    Number of bipartite cliques of size i x d is ne-i

    The model was meant to capture the topical nature of the Web

    It has also found applications in biological networks

  • 14

    Other graph models

    Cooper Frieze model multiple parameters that allow for adding vertices,

    edges, preferential attachment, uniform linking

    Directed graphs [Bollobas et al] allow for preferential selection of both the source and

    the destination allow for edges from both new and old vertices

  • 15

    Small world network models: Watts & Strogatz (clustering & short paths) Kleinberg (geographical) Watts, Dodds & Newman (hierarchical)

  • Reconciling two observations:

    High clustering: my friends friends tend to be my friends Short average paths

    Small world phenomenon: Watts/Strogatz model

    Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.

  • n As in many network generating algorithms n Disallow self-edges n Disallow multiple edges

    Select a fraction p of edges Reposition on of their endpoints

    Add a fraction p of additional edges leaving underlying grid intact

    Watts-Strogatz model: Generating small world graphs

    Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.

  • Each node has K>=4 nearest neighbors (local)

    tunable: vary the probability p of rewiring any given edge

    small p: regular grid large p: classical random graph

    Watts-Strogatz model: Generating small world graphs

  • Watts/Strogatz model: What happens in between? Small shortest path means small clustering? Large shortest path means large clustering? Through numerical simulation

    As we increase p from 0 to 1 Fast decrease of mean distance Slow decrease in clustering

  • Watts/Strogatz model: Change in clustering coefficient and average path length as a function of the proportion of rewired edges

    l(p)/l(0)

    C(p)/C(0)

    10% of links rewired 1% of links rewired

    Source: Watts, D.J., Strogatz, S.H.(1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.

  • What features of real social networks are missing from the small world model? n Long range links not as likely as short range ones

    n Hierarchical structure / groups

    n Hubs

  • Geographical small world models

    The geographic movement of the [message] from Nebraska to Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain S.Milgram The small world problem, Psychology Today 1,61,1967

    NE

    MA

  • nodes are placed on a grid and connect to nearest neighbors additional links placed with

    p(link between u and v) = (distance(u,v))-r

    Kleinbergs geographical small world model

    Source: Kleinberg, Navigation in a small world

    exponent that will determine navigability

  • Decentralized Algorithm

    Node s must send message m to node t At any moment, current message holder u must

    pass m to a neighbor given only: Set of local contacts of all nodes (grid structure) Location on grid of destination node t Location and long-range contacts of all nodes that have

    seen m (but not long-range contacts of nodes that have not seen m)

  • Delivery Time

    Definition: Expected delivery time is the expectation, over the choice of long-range contacts and a uniformly random source and destination, of the number of steps taken to deliver message.

  • Results [Kleinberg, 2000]

    Theorem 1: There is a decentralized algorithm A so that when r = 2 and p = q = 1, the expected delivery time of A is O(log2n).

    Theorem 2: (a) For 0 r < 2, the expected delivery time of any decentralized algorithm is (n(2 r)/3). (b) For r > 2, the expected delivery time of any decentralized algorithm is (n(r 2)/(r 1)). (Constants depend on p, q, and r.)

  • Proof of Theorem 1

    Algorithm: In each step, u sends m to his neighbor v which is closest (in grid distance) to t.

    Proof Sketch: Define phases based on how close m is to t:

    algorithm is in phase j if 2j dist(m,t) 2(j+1) Prove we dont spend much time any phase:

    expected time in phase j is at most log n for all j Conclude since at most log n + 1 phases, and so expected

    delivery time is O(log2 n)

  • When r=0, links are randomly distributed, ASP ~ log(n), n size of grid

    When r=0, any decentralized algorithm is at least a0n2/3

    geographical search when network lacks locality

    When r

  • Overly localized links on a grid When r>2 expected search time ~ N(r-2)/(r-1)

  • When r=2, expected time of a DA is at most C (log N)2

    21~pd

    geographical small world model Links balanced between long and short range

  • In this lecture

    Structure and modeling of social networks Power law graphs Small world phenomenon; Random Generative Models;

    Mining & clustering large networks Ranking through Link Analysis: HITS; Page Rank;

    Webspam; Local and Spectral clustering; Mapreduce and Message-passing-based algorithms

    Viral Marketing over Social Networks Advertising, influence maximization, and revenue

    maximization over Social Networks

  • Components of a search engine

    Crawler How to handle different types of URLs How often to crawl each page How to detect duplicates

    Indexer Data structures (to minimize # of disk access)

    Query handler Find the set of pages that contain the query word. Sort the results.

  • Difficulties

    Too many hits (e.g. for ad auctions # indexed pages: 330,000,000)

    Often too many pages contain the query.

    Sometimes pages are not suff. self-descriptive.

    Brin & Page: As of Nov 97, only one in the top four commercial search engine finds itself!

    Need to find popular pages.

  • Link analysis

    Instead of using text analysis, we analyze the structure of hyperlinks to extract information about the popularity of a page.

    Advantages:

    Less need for complicated text analysis Less manipulable, and independent of one persons point

    of view (think of it as a voting system).

  • Link-based ranking of search results

    Compute importance of nodes (in the web graph) Various notions of node centrality Degree dentrality = degree of u Betweenness centrality = #shortest paths passing

    through u Closeness centrality = avg. length of shortest paths

    from u to all other nodes HITS (Hypertext Induced Topic Selection) PageRank: Googles first link-based ranking

  • HITS and PageRank

    HITS (Hypertext Induced Topic Selection) J. Kleinberg, Authorative sources in a hyperlinked environment, SODA 1998.

    PageRank

    S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, WWW 1998.

    L. Page, S. Brin, R. Motwani, and Winograd, The PageRank citation ranking: bringing order to the web.

  • Relevance vs. popularity

    Balance between relevance and popularity? Construct a focused subgraph based on relevance,

    and return the most popular page in this subgraph, Or compute a measure of relevance (considering

    how many times and in what form [title/url/font size/anchor] the query appears in the page), and multiply with a popularity measure, e.g. PageRank.

  • Constructing a focused subgraph

    Given query , start with the set R of the top ~200 text-based hits for .

    Add to this set: the set of pages that have a link from a page in R; the set of pages that have a link to a page p in R, with an

    upper limit of ~50 pages per p 2 R.

    Call the resulting set S. Find the most authorative page in G[S].

  • Constructing a focused subgraph

    Desired properties: Relatively small Rich in relevant pages Contains most of the strongest authorities on the subject.

  • Finding authorities

    Approach 1: vertices with the largest in-degrees This approach is used to evaluate scientific citations

    (the impact factor). Deficiencies:

    A page might have a large in-degree from low-quality pages. universally popular pages often dominate the result. Easy to manipulate.

  • Finding authorities: Mutually Recursive Definition Approach 2: define the set of authorities recursively. A good hub links to many good authorities Best authorities on a subject have a large in-degree from the best

    hubs on the subject.

    A good authority is linked from many good hubs Best hubs on a subject give links to the best authorities on the

    subject.

    Model using two scores for each node: Hub score and Authority score Represented as vectors h and a

  • 42

    Authority and Hubness

    2

    3

    4

    1 1

    5

    6

    7

    a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

  • Finding authorities Initialize authority and hub weights, a0 and h0 while (not converged)

    for each vertex i,

    end end

    Does it converge? What does it converge to?

    ak+1(i) = hk ( j)j!Bi"

    hk+1(i) = ak ( j)j!Fi"

  • Rewrite in matrix form

    h=Aa. a=Ath.

    Recall At is the

    transpose of A.

    Guaranteed to converge.

    Substituting, h=AAth and a=AtAa Fact: h is an eigenvector of AAt and a is an eigenvector of AtA. Proof on the board Further, this algorithm is a particular, known algorithm for computing eigenvectors: the power iteration method.

  • 45

    HITS Example Results

    Authority Hubness

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Authority and hubness weights

  • PageRank

    Again, the idea is a recursive definition of importance: An important page is a page that has many links from

    other important pages.

    Problems with the nave definition: Not always well-defined. Pages with no out-degree form rank sinks.

    Solution: Add a random restart everywhere to get rid of sinks, and make it well-defined and

  • PageRank: Random Walk View

    Fix: consider a random surfer, which every time either clicks on a random link, or with probability , gets bored and starts again from a random page.

    PageRank takes = 15%, and uses a non-uniform

    distribution for starting again.

  • PageRank Algorithm

    Let restart probability be a, e.g. a=15%. initialize ranks P0 while (not converged) for each vertex i

    end end

    Pk+1(i) = a / n+ (1! a)Pk ( j)N jj"Bi

    #

  • 49

    Markov Chain Notation

    Random surfer model Description of a random walk through the Web graph Interpreted as a transition matrix with asymptotic probability

    that a surfer is currently browsing that page

    Does it converge to some sensible solution (as too) regardless of the initial ranks ?

    rt = M rt-1 M: transition matrix for a first-order Markov chain (stochastic)

  • 50

    Problem

    Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph E.g.

    no parent rank 0 MT converges to a matrix whose last column is all zero

    no children no solution MT converges to zero matrix

  • 51

    Modification

    Surfer will restart browsing by picking a new Web page at random

    M = ( (1-a)B + a E ) E : escape Matrix

    B: Adjacency Matrix

    M : stochastic Matrix

    rt = M rt-1 M: transition matrix for a first-order Markov chain (stochastic)

  • PageRank Algorithm

    Let restart probability be a, e.g. a=15%. initialize ranks P0 while (not converged) for each vertex i

    end end

    Pk+1(i) = a / n+ (1! a)Pk ( j)Deg( j)j"Incoming(i)

    #

  • Pagerank: Issues and Variants

    How realistic is the random surfer model? What if we modeled the back button? [Fagi00]

    Surfer behavior sharply skewed towards short paths [Hube98]

    Search engines, bookmarks & directories make jumps non-random.

    Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform

    edge selection)

    Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)

  • Topic Specific (Personalized) Pagerank [Have02]

    Conceptually, we use a random surfer who teleports, with say 15% probability, using the following rule:

    Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories

    Restart to a page uniformly at random within the chosen category

    Sounds hard to implement: cant compute PageRank at query time!

  • Topic Specific (Personalized) Pagerank [Have02]

    Implementation offline:Compute pagerank distributions wrt to individual

    categories Query independent model as before Each page has multiple pagerank scores one for each ODP

    category, with teleportation only to that category

    online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum

    of category-specific pageranks

  • Influencing PageRank (Personalization) Input:

    Web graph W influence vector v

    v : (page degree of influence)

    Output: Rank vector r: (page page importance wrt v)

    r = PR(W , v)

  • Non-uniform Teleportation

    Teleport with 10% probability to a Sports page

    Sports

  • Interpretation of Composite Score

    For a set of personalization vectors {vj}

    j [wj PR(W , vj)] = PR(W , j [wj vj])

    Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt vj

  • Interpretation

    10% Sports teleportation

    Sports

  • Interpretation

    Health

    10% Health teleportation

  • Interpretation

    Sports

    Health

    pr = (0.9 PRsports + 0.1 PRhealth) gives you: 9% sports teleportation, 1% health teleportation

  • Pairwise Similarity: Personalized PageRank

    Personalized PageRank (PPR) of uv: Probability of visiting v in the following random walk: at each step, With probability a, go back to u. With probability 1-a, go to a neighbor uniformly at

    random.

    PPR is a similarity measure: It captures Distance #disjoint paths

  • Approximate PPR vector

    PPR Vector for u: vector of PPR value from u. Contribution PR (CPR) vector for u: vector of PPR value to u. Goal: Compute approximate PPR or CPR Vectors

    with an additive error of

  • 64

    Next Lecture:

    Link Analysis Link Spam Detection Local Algorithms for computing approximate

    PPR Local and Specteral Graph Partitioning

  • 65

    Further Reading

    M. Henzinger, Link Analysis in Web Information Retreival, Bulletin

    of the IEEE computer Society Technical Committee on Data Engineering, 2000.

  • Credits/References Some material used in preparing this lecture: Newman course on Networks, U. Michigan. Nicole Immorlica and Mohammad Mahdians

    course at U. of Washington, 2006. Jure Leskovecs course on Information and Social

    Networks, Stanford, 2011,

    Lada Adamics course on Networks: Theory and Applications at U. Michigan

    Thanks