ch 14. link analysis padmini srinivasan computer science department psriniva...

19
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department http://cs.uiowa.edu/~psriniva [email protected]

Upload: peter-dixon

Post on 05-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Ch 14. Link Analysis

Padmini Srinivasan

Computer Science Department

http://cs.uiowa.edu/[email protected]

Page 2: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Web Search

• Hard problem– Hats off to ‘information retrieval’– Complex information needs

• Keywords• Synonyms, polysemy (multiple meanings)

– True homonyms: row (oar) row (argue); delta (greek and of a river)– Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’

person, ‘hand’ it to me

– The age of intermediaries (BRS After Dark)– Diversity in writing + Diversity in queries + Diversity in Indexing +

Diversity in motivations– Controlled vocabularies vs free-texts– Majority rule? ‘Cornell’

Page 3: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Web Search Peculiarities

• Compared to the good old days• Needle in a haystack problem; many needles in many

haystacks! Which ones to look for?– How distinct is this from the “traditional” methods for IR? Libraries

etc.– Can we do without libraries?

• Quality – a serious question?– Does redundancy promote quality?– Does collaboration promote quality?

• Scale– Retrieve and FILTER/ORGANIZE– Satisfying versus satisficing

Page 4: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Link Analysis

• In-links and out-links; in-degree and out-degree– A matter of endorsement! (directional)– Akin to citations – What are differences? Must one out-link?– Power laws all the way through!

Page 5: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Some studies

• (Kumar et. al. 99): Alexa web crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999

• Probability page has in-degree k = 1/k2

• Probability page has at least in-degree k = 1/k• Actual exponent slightly larger than 2.

• Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions

Page 6: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Broder et al. Graph Structure of the Web

Note that the exponent is different. Note also the deviation In the low end of the out-degree.

Page 7: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Fractals?

• Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.”

• Graph structure in the web

Page 8: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Similar Studies

• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are– In-degree: power law; exponent 2.1 (Fig. 4)– Out-degree: not so good (Fig. 5)– Check out Fig. 8: SCC distribution (number of SCCs

versus Size of SCC). Power law; exponent 2.09

• Webbase, 200 Million Stanford crawl (2001)– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million)

next SCC: 10 thousand!

Page 9: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Hubs & Authorities

• In-links: votes• HITS algorithm: Hyperlink induced topic search.– A good hub is one that points to good authorities [lists;

directories]– A good authority is one that is pointed to by good hubs– A good hub need not be an authority and vice versa.– Those who have knowledge; those who know well

about those who have knowledge– Dynamic estimation; repeated application of update

rules. Converges!

Page 10: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Algorithm

• First conduct retrieval. Compute Hubs and Authorities on relevant set– Rank the retrieved set by a list of hubs and a list of

authorities• Initialize hub and authority scores (say to all 1,

or some other positive number)– Apply authority score update rule– Apply hub score update rule

• Example: fig 14.15 and 14.18 (problem 3)

Page 11: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Its all about convergence

• First show how the update rule works with matrices M and MT

• Then show the same using eigenvectors• Then show that the initialization of hub scores

really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number

Page 12: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

PageRank

• Endorsements repeatedly move through out-links. A B

• Principle of repeated improvement:– Weight of ‘current’ endorsement depends on

‘current’ estimate of A’s PageRank.– More important nodes convey higher

endorsements.– Stabilize ~ till the network changes

Page 13: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Calculation

• Initialize: each node has a PageRank = 1/n where n is the number of nodes

• Basic PageRank Update Rule: – A node divides its PageRank equally over its out-links.

If no out-links, it keeps its PageRank. – The PageRank of a node = sum of PageRanks it receives

in that iteration.– Total PageRank stays constant, so no need for

normalizing.• Iterate till convergence OR a number of iterations.

Page 14: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu
Page 15: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Equilibrium

• No further changes in PageRanks• Degenerate cases exist (Scaled PageRank

Updates)• Values need not be unique except where the

network is strongly connected.

Page 16: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu
Page 17: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Slow leaks?

Page 18: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Scaled PageRank Update Rule

• Scaling factor: (between 0 and 1) generally (0.8 and 0.9)– Apply basic PageRank update rule. For each page:– Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank..– Total PageRank = s– Divide remaining PageRank (1-s) equitably over all nodes.

• Get a unique set of values for each setting of s. [shown later in proofs]

• Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)

Page 19: Ch 14. Link Analysis Padmini Srinivasan Computer Science Department psriniva padmini-srinivasan@uiowa.edu

Summary

• Link based analysis– Power laws: in-links, out-links etc.

• Hubs and Authorities– convergence

• PageRank– convergence