web structure & content - indiana university...
TRANSCRIPT
![Page 1: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/1.jpg)
WAW 2002
Web Structure & Content(not a theory talk)
Filippo MenczerThe University of Iowa
http://dollar.biz.uiowa.edu/~fil/
Research supported by NSF CAREER Award IIS-0133124
![Page 2: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/2.jpg)
WAW 2002
Exploiting the Web’stext and link cues
• Pages close in word vector space tend to be related– Cluster hypothesis (van Rijsbergen 1979)– The WebCrawler (Pinkerton 1994)– The whole first generation of search engines
• Pages that link to each other tend to be related– Link-cluster conjecture (Menczer 1997)
• Many formulations: Gibson & al 1998, Bharat & Henzinger 1998,Chakrabarti & al 1998, Dean & Henzinger 1999, Davison 2000, etc
– Link eigenvalue analysis: HITS, hubs and authorities• (Kleinberg & al 1998 segg. @ Almaden etc.)
– Google’s PageRank analysis• (Brin & Page 1998)
– The whole second generation of search engines
![Page 3: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/3.jpg)
WAW 2002
Three topologiesWhat about the relationship between lexical / link cues and page meaning?
Text Links Meaning
![Page 4: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/4.jpg)
WAW 2002
Talk outline• The topologies of the Web• Correlations, distributions, projections• Power laws and Web growth models• Navigating optimal paths• Semantic maps (?)
![Page 5: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/5.jpg)
WAW 2002
G = 5/15C = 2R = 3/6 = 2/4
The “link-cluster” conjecture• Connection between semantic topology
(relevance) and linkage topology (hypertext)– G = Pr[rel(p)] ~ fraction of relevant pages (generality)– R = Pr[rel(p) | rel(q) AND link(q,p)]
• Related nodes are clustered if R > G– Necessary and sufficient condition for a random
crawler to find pages related to start points
![Page 6: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/6.jpg)
WAW 2002
• Preservationof semantics(meaning)across links
Link-clusterconjecture
†
L(q,d) ≡
path(q, p){ p: path(q,p ) £d }
Â
{p : path(q, p) £ d}
†
R(q,d)G(q)
≡Pr[rel( p) rel(q)Ÿ path(q, p) £ d]
Pr[rel( p)]
![Page 7: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/7.jpg)
WAW 2002
• Correlation of lexical andlinkage topology
• L(d): average linkdistance
• S(d): averagesimilarity tostart (topic)page from pagesup to distance d
• Correlationr(L,S) = –0.76
The “link-content” conjecture
†
S(q,d) ≡
sim(q, p){ p: path(q,p ) £d }
Â
{p : path(q, p) £ d}
![Page 8: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/8.jpg)
WAW 2002
Heterogeneity oflexical-linkage correlation
†
S = c + (1- c)eaLb edu net
gov
org
comsignif. diff. a only (a=0.05)
signif. diff. a & b (a=0.05)
![Page 9: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/9.jpg)
WAW 2002
Mapping the relationshipbetween topologies
• Any pair of pages rather than linked pages from crawl• Data: Open Directory Project (dmoz.org)
– RDF Snapshot: 2002-02-14 04:01:50 GMT– After cleanup: 896,233 URLs in 97,614 topics– After sampling: 150,000 URLs in 47,174 topics
• 10,000 from each of 15 top-level branches
• Need ‘similarity’ or ‘proximity’ metric for each topology,given a pair of pages:– Content: textual/lexical (cosine) similarity– Link: co-citation/bibliographic coupling– Semantic: relatedness inferred from manual classification
![Page 10: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/10.jpg)
WAW 2002
Content similarity†
s c (p1, p2) =
fkp1fkp2
k Œ p1 « p2
Â
fkp1
2
k Œ p1
 fkp2
2
k Œ p2
Âterm i
term j
term k
p1p2
p1 p2
†
s l (p1, p2) =Up1
«Up2
Up1»Up2
Link similarity
![Page 11: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/11.jpg)
WAW 2002
Semantic similarity
• Information-theoreticmeasure based on classification tree (Lin 1998)
• Classic path distance in special case of balanced tree
†
s s(c1,c2) =2logPr[lca(c1,c2)]
logPr[c1]+ logPr[c2]
top
lca
c1
c2
![Page 12: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/12.jpg)
WAW 2002
Correlationsbetween
similarities
0 0.1 0.2 0.3 0.4 0.5
Games
Arts
Business
Society
All pairs
Science
Shopping
Adult
Health
Recreation
Kids and Teens
Computers
Sports
Reference
Home
News
r
link-semanticcontent-semanticcontent-link
3.84 x 109
pairs
![Page 13: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/13.jpg)
WAW 2002
Joint distribution cube
sc
sl
100
>104
![Page 14: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/14.jpg)
WAW 2002
Link probabilityvs lexical distance
†
r =1
s c
-1
Pr(l | r) =(p,q) : r = r Ÿs l > l
(p,q) : r = r
Phasetransition
Power law tail
†
Pr(l | r) ~ r-a(l)
†
r*
Proc. Natl.Acad. Sci. USA99(22): 14014-14019, 2002
![Page 15: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/15.jpg)
WAW 2002
• Preferential attachment “BA”– At each step t add page pt
– Create m new links from pt to pi<t(Barabasi & Albert 1999, de Solla Price 1976)
• Modified BA(Bianconi & Barabasi 2001, Adamic & Huberman 2000)
• Mixture(Pennock & al. 2002,
Cooper & Frieze 2001, Dorogovtsev & al 2000)
• Web copying(Kleinberg, Kumar & al 1999, 2000)
• Mixture with Euclidean distance in graphs(Fabrikant, Koutsoupias & Papadimitriou 2002)
Web growth models
†
Pr(i) µ k(i)
†
Pr(i) µh(i)k(i)
†
Pr(i) µy ⋅ k(i) + (1-y) ⋅ c
†
i = argmin(frit + gi)
†
Pr(i) µy ⋅ Pr( j Æ i) + (1-y) ⋅ c
![Page 16: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/16.jpg)
WAW 2002
Local content-based growth model
†
Pr(pt Æ pi< t ) =k(i)mt
if r(pi, pt ) < r*
c[r(pi, pt )]-a otherwise
Ï Ì Ô
Ó Ô
g=4.3±0.1
g=4.26±0.07
• Similar to preferentialattachment (BA)
• At each step t add page pt
• Create m new links frompt to existing pages
• Use degree (k) info onlyfor nearby pages(popularity/importance of
similar/related pages)
![Page 17: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/17.jpg)
WAW 2002
Efficient crawling algorithms?• Theory: since the Web is a small world network, or
has a scale free degree distribution, there exist shortpaths between any two pages:– ~ log N (Barabasi & Albert 1999)
– ~ log N / log log N (Bollobas 2001)
• Practice: can’t find them!– Greedy algorithms based on location in geographical
small world networks: ~ poly(N) (Kleinberg 2000)
– Greedy algorithms based on degree in power lawnetworks: ~ N (Adamic, Huberman et al 2001)
![Page 18: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/18.jpg)
WAW 2002
Exception # 1
• Geographical networks(Kleinberg 2000)– Local links to all lattice neighbors
– Long-range link probability distribution:power law Pr ~ r–a
• r: lattice (Manhattan) distance• a: constant clustering exponent
†
t ~ log2 N ¤a = D
![Page 19: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/19.jpg)
WAW 2002
Exception # 2
• Hierarchical networks(Kleinberg 2002, Watts & al. 2002)– Nodes are classified at the leaves of tree
– Link probability distribution: exponential tailPr ~ e–h
• h: tree distance (height of lowest common ancestor)
h=1h=2
†
t ~ loge N,e ≥1
![Page 20: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/20.jpg)
WAW 2002
Is the Web one of these exceptions?Geographical model
– Replace lattice distanceby lexical distance
r = (1 / sc) – 1
Hierarchical model– Replace tree distance
by semantic distanceh = 1 – ss exponential
tail
local links
long range links (power law tail)
![Page 21: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/21.jpg)
WAW 2002
Talk outline• The topologies of the Web• Correlations, distributions, projections• Power laws and Web growth models• Navigating optimal paths• Semantic maps
![Page 22: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/22.jpg)
WAW 2002
Semantic maps: define“local” Precision and Recall
†
P(sc,sl ) =
s s(p,q){ p,q:s c = sc ,s l = sl }
Â
{p,q :s c = sc,s l = sl}
R(sc,sl ) =
s s(p,q){ p,q:s c = sc ,s l = sl }
Â
s s(p,q){ p,q}Â
Averagingsemantic similarity
Summingsemantic similarity
![Page 23: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/23.jpg)
WAW 2002
Semantic maps: Business
sc
sllog Recall Precision
![Page 24: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/24.jpg)
WAW 2002
Semantic maps: Adult
sc
sllog Recall Precision
![Page 25: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/25.jpg)
WAW 2002
Semantic maps: Computers
sc
sllog Recall Precision
![Page 26: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/26.jpg)
WAW 2002
Semantic maps: Home
sc
sllog Recall Precision
![Page 27: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/27.jpg)
WAW 2002
Semantic maps: News
sc
sllog Recall Precision
![Page 28: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/28.jpg)
WAW 2002
Semantic maps: all pairs
sc
sllog Recall Precision
![Page 29: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/29.jpg)
WAW 2002
So what?
• Interpret performance of search engines• Understand “ranking optimization”
– vs filtering– vs combinations
• Topical signatures for topical/community portals• Design “better” (and more scalable) crawlers
– Topic driven– Query driven– User/community/peer driven
• Competitive intelligence, security applications
![Page 30: Web Structure & Content - Indiana University Bloomingtoncarl.cs.indiana.edu/fil/Web/WAW2002.pdf · 2014-09-04 · WAW 2002 Exploiting the Web’s text and link cues •Pages close](https://reader033.vdocuments.us/reader033/viewer/2022050223/5f6898475a6bff2de27bd427/html5/thumbnails/30.jpg)
WAW 2002
Talk outline• The topologies of the Web• Correlations, distributions, projections• Power laws and Web growth models• Navigating optimal paths• Semantic maps• Questions?
http://dollar.biz.uiowa.edu/~fil/
Research supported by NSF CAREER Award IIS-0133124