ch 14. link analysis padmini srinivasan computer science department psriniva...

Ch 14. Link Analysis

Padmini Srinivasan

Computer Science Department

http://cs.uiowa.edu/~psrinivapadmini-srinivasan@uiowa.edu

Web Search

• Hard problem– Hats off to ‘information retrieval’– Complex information needs

• Keywords• Synonyms, polysemy (multiple meanings)

– True homonyms: row (oar) row (argue); delta (greek and of a river)– Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’

person, ‘hand’ it to me

– The age of intermediaries (BRS After Dark)– Diversity in writing + Diversity in queries + Diversity in Indexing +

Diversity in motivations– Controlled vocabularies vs free-texts– Majority rule? ‘Cornell’

Web Search Peculiarities

• Compared to the good old days• Needle in a haystack problem; many needles in many

haystacks! Which ones to look for?– How distinct is this from the “traditional” methods for IR? Libraries

etc.– Can we do without libraries?

• Quality – a serious question?– Does redundancy promote quality?– Does collaboration promote quality?

• Scale– Retrieve and FILTER/ORGANIZE– Satisfying versus satisficing

Link Analysis

• In-links and out-links; in-degree and out-degree– A matter of endorsement! (directional)– Akin to citations – What are differences? Must one out-link?– Power laws all the way through!

Some studies

• (Kumar et. al. 99): Alexa web crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999

• Probability page has in-degree k = 1/k2

• Probability page has at least in-degree k = 1/k• Actual exponent slightly larger than 2.

• Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions

Broder et al. Graph Structure of the Web

Note that the exponent is different. Note also the deviation In the low end of the out-degree.

Fractals?

• Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.”

• Graph structure in the web

Similar Studies

• Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are– In-degree: power law; exponent 2.1 (Fig. 4)– Out-degree: not so good (Fig. 5)– Check out Fig. 8: SCC distribution (number of SCCs

versus Size of SCC). Power law; exponent 2.09

• Webbase, 200 Million Stanford crawl (2001)– 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million)

next SCC: 10 thousand!

Hubs & Authorities

• In-links: votes• HITS algorithm: Hyperlink induced topic search.– A good hub is one that points to good authorities [lists;

directories]– A good authority is one that is pointed to by good hubs– A good hub need not be an authority and vice versa.– Those who have knowledge; those who know well

about those who have knowledge– Dynamic estimation; repeated application of update

rules. Converges!

Algorithm

• First conduct retrieval. Compute Hubs and Authorities on relevant set– Rank the retrieved set by a list of hubs and a list of

authorities• Initialize hub and authority scores (say to all 1,

or some other positive number)– Apply authority score update rule– Apply hub score update rule

• Example: fig 14.15 and 14.18 (problem 3)

Its all about convergence

• First show how the update rule works with matrices M and MT

• Then show the same using eigenvectors• Then show that the initialization of hub scores

really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number

PageRank

• Endorsements repeatedly move through out-links. A B

• Principle of repeated improvement:– Weight of ‘current’ endorsement depends on

‘current’ estimate of A’s PageRank.– More important nodes convey higher

endorsements.– Stabilize ~ till the network changes

Calculation

• Initialize: each node has a PageRank = 1/n where n is the number of nodes

• Basic PageRank Update Rule: – A node divides its PageRank equally over its out-links.

If no out-links, it keeps its PageRank. – The PageRank of a node = sum of PageRanks it receives

in that iteration.– Total PageRank stays constant, so no need for

normalizing.• Iterate till convergence OR a number of iterations.

Equilibrium

• No further changes in PageRanks• Degenerate cases exist (Scaled PageRank

Updates)• Values need not be unique except where the

network is strongly connected.

Slow leaks?

Scaled PageRank Update Rule

• Scaling factor: (between 0 and 1) generally (0.8 and 0.9)– Apply basic PageRank update rule. For each page:– Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank..– Total PageRank = s– Divide remaining PageRank (1-s) equitably over all nodes.

• Get a unique set of values for each setting of s. [shown later in proofs]

• Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)

Summary

• Link based analysis– Power laws: in-links, out-links etc.

• Hubs and Authorities– convergence

• PageRank– convergence

ch 14. link analysis padmini srinivasan computer science department psriniva...

Documents

padmini in pune

padmini kirpalani phd walden university pubh-8165 dr. robert...

fall 2018 valero energy corporation energy (nyse: vlo) ·...

tn srinivasan

padmini 2012 collection brussels padmini brussels1

1. padmini silk

technology nvidia corporation (nyse: nvda) · technology...

curriculum vitae professor srinivasan...

padmini galgotia- beautiful wedding dress collection

launch of apple i padmini

padmini 0356 supplierappraisal

padmini 10xqcma067 adityabirlafinance internship report

crawlers padmini srinivasan computer science department...

fall 2018 pioneer natural resources energy (nyse: pxd...

concept maps: learning made visible allison brckalorenz...

padmini ekadasi 2015

big data...

srinivasan africagathering2011

hotel padmini palace

padmini -rbi - copy