ch 14. link analysis padmini srinivasan computer science department psriniva...
TRANSCRIPT
Ch 14. Link Analysis
Padmini Srinivasan
Computer Science Department
http://cs.uiowa.edu/[email protected]
Web Search
• Hard problem– Hats off to ‘information retrieval’– Complex information needs
• Keywords• Synonyms, polysemy (multiple meanings)
– True homonyms: row (oar) row (argue); delta (greek and of a river)– Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’
person, ‘hand’ it to me
– The age of intermediaries (BRS After Dark)– Diversity in writing + Diversity in queries + Diversity in Indexing +
Diversity in motivations– Controlled vocabularies vs free-texts– Majority rule? ‘Cornell’
Web Search Peculiarities
• Compared to the good old days• Needle in a haystack problem; many needles in many
haystacks! Which ones to look for?– How distinct is this from the “traditional” methods for IR? Libraries
etc.– Can we do without libraries?
• Quality – a serious question?– Does redundancy promote quality?– Does collaboration promote quality?
• Scale– Retrieve and FILTER/ORGANIZE– Satisfying versus satisficing
Link Analysis
• In-links and out-links; in-degree and out-degree– A matter of endorsement! (directional)– Akin to citations – What are differences? Must one out-link?– Power laws all the way through!
Some studies
• (Kumar et. al. 99): Alexa web crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999
• Probability page has in-degree k = 1/k2
• Probability page has at least in-degree k = 1/k• Actual exponent slightly larger than 2.
• Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions
Broder et al. Graph Structure of the Web
•
Note that the exponent is different. Note also the deviation In the low end of the out-degree.
Fractals?
• Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.”
• Graph structure in the web
Similar Studies
• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are– In-degree: power law; exponent 2.1 (Fig. 4)– Out-degree: not so good (Fig. 5)– Check out Fig. 8: SCC distribution (number of SCCs
versus Size of SCC). Power law; exponent 2.09
• Webbase, 200 Million Stanford crawl (2001)– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million)
next SCC: 10 thousand!
Hubs & Authorities
• In-links: votes• HITS algorithm: Hyperlink induced topic search.– A good hub is one that points to good authorities [lists;
directories]– A good authority is one that is pointed to by good hubs– A good hub need not be an authority and vice versa.– Those who have knowledge; those who know well
about those who have knowledge– Dynamic estimation; repeated application of update
rules. Converges!
Algorithm
• First conduct retrieval. Compute Hubs and Authorities on relevant set– Rank the retrieved set by a list of hubs and a list of
authorities• Initialize hub and authority scores (say to all 1,
or some other positive number)– Apply authority score update rule– Apply hub score update rule
• Example: fig 14.15 and 14.18 (problem 3)
Its all about convergence
• First show how the update rule works with matrices M and MT
• Then show the same using eigenvectors• Then show that the initialization of hub scores
really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number
PageRank
• Endorsements repeatedly move through out-links. A B
• Principle of repeated improvement:– Weight of ‘current’ endorsement depends on
‘current’ estimate of A’s PageRank.– More important nodes convey higher
endorsements.– Stabilize ~ till the network changes
Calculation
• Initialize: each node has a PageRank = 1/n where n is the number of nodes
• Basic PageRank Update Rule: – A node divides its PageRank equally over its out-links.
If no out-links, it keeps its PageRank. – The PageRank of a node = sum of PageRanks it receives
in that iteration.– Total PageRank stays constant, so no need for
normalizing.• Iterate till convergence OR a number of iterations.
Equilibrium
• No further changes in PageRanks• Degenerate cases exist (Scaled PageRank
Updates)• Values need not be unique except where the
network is strongly connected.
Slow leaks?
Scaled PageRank Update Rule
• Scaling factor: (between 0 and 1) generally (0.8 and 0.9)– Apply basic PageRank update rule. For each page:– Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank..– Total PageRank = s– Divide remaining PageRank (1-s) equitably over all nodes.
• Get a unique set of values for each setting of s. [shown later in proofs]
• Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)
Summary
• Link based analysis– Power laws: in-links, out-links etc.
• Hubs and Authorities– convergence
• PageRank– convergence