ch 14. link analysis padmini srinivasan computer science department psriniva...

Post on 05-Jan-2016

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ch 14. Link Analysis

Padmini Srinivasan

Computer Science Department

http://cs.uiowa.edu/~psrinivapadmini-srinivasan@uiowa.edu

Web Search

• Hard problem– Hats off to ‘information retrieval’– Complex information needs

• Keywords• Synonyms, polysemy (multiple meanings)

– True homonyms: row (oar) row (argue); delta (greek and of a river)– Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’

person, ‘hand’ it to me

– The age of intermediaries (BRS After Dark)– Diversity in writing + Diversity in queries + Diversity in Indexing +

Diversity in motivations– Controlled vocabularies vs free-texts– Majority rule? ‘Cornell’

Web Search Peculiarities

• Compared to the good old days• Needle in a haystack problem; many needles in many

haystacks! Which ones to look for?– How distinct is this from the “traditional” methods for IR? Libraries

etc.– Can we do without libraries?

• Quality – a serious question?– Does redundancy promote quality?– Does collaboration promote quality?

• Scale– Retrieve and FILTER/ORGANIZE– Satisfying versus satisficing

Link Analysis

• In-links and out-links; in-degree and out-degree– A matter of endorsement! (directional)– Akin to citations – What are differences? Must one out-link?– Power laws all the way through!

Some studies

• (Kumar et. al. 99): Alexa web crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999

• Probability page has in-degree k = 1/k2

• Probability page has at least in-degree k = 1/k• Actual exponent slightly larger than 2.

• Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions

Broder et al. Graph Structure of the Web

Note that the exponent is different. Note also the deviation In the low end of the out-degree.

Fractals?

• Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.”

• Graph structure in the web

Similar Studies

• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are– In-degree: power law; exponent 2.1 (Fig. 4)– Out-degree: not so good (Fig. 5)– Check out Fig. 8: SCC distribution (number of SCCs

versus Size of SCC). Power law; exponent 2.09

• Webbase, 200 Million Stanford crawl (2001)– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million)

next SCC: 10 thousand!

Hubs & Authorities

• In-links: votes• HITS algorithm: Hyperlink induced topic search.– A good hub is one that points to good authorities [lists;

directories]– A good authority is one that is pointed to by good hubs– A good hub need not be an authority and vice versa.– Those who have knowledge; those who know well

about those who have knowledge– Dynamic estimation; repeated application of update

rules. Converges!

Algorithm

• First conduct retrieval. Compute Hubs and Authorities on relevant set– Rank the retrieved set by a list of hubs and a list of

authorities• Initialize hub and authority scores (say to all 1,

or some other positive number)– Apply authority score update rule– Apply hub score update rule

• Example: fig 14.15 and 14.18 (problem 3)

Its all about convergence

• First show how the update rule works with matrices M and MT

• Then show the same using eigenvectors• Then show that the initialization of hub scores

really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number

PageRank

• Endorsements repeatedly move through out-links. A B

• Principle of repeated improvement:– Weight of ‘current’ endorsement depends on

‘current’ estimate of A’s PageRank.– More important nodes convey higher

endorsements.– Stabilize ~ till the network changes

Calculation

• Initialize: each node has a PageRank = 1/n where n is the number of nodes

• Basic PageRank Update Rule: – A node divides its PageRank equally over its out-links.

If no out-links, it keeps its PageRank. – The PageRank of a node = sum of PageRanks it receives

in that iteration.– Total PageRank stays constant, so no need for

normalizing.• Iterate till convergence OR a number of iterations.

Equilibrium

• No further changes in PageRanks• Degenerate cases exist (Scaled PageRank

Updates)• Values need not be unique except where the

network is strongly connected.

Slow leaks?

Scaled PageRank Update Rule

• Scaling factor: (between 0 and 1) generally (0.8 and 0.9)– Apply basic PageRank update rule. For each page:– Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank..– Total PageRank = s– Divide remaining PageRank (1-s) equitably over all nodes.

• Get a unique set of values for each setting of s. [shown later in proofs]

• Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)

Summary

• Link based analysis– Power laws: in-links, out-links etc.

• Hubs and Authorities– convergence

• PageRank– convergence

top related