navigating the web graph - caida.org
TRANSCRIPT
![Page 1: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/1.jpg)
Navigating the Web graphWorkshop on Networks and Navigation
Santa Fe Institute, August 2008
Filippo MenczerInformatics & Computer ScienceIndiana University, Bloomington
![Page 2: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/2.jpg)
Outline
Topical locality: Content, link, and semantic topologiesImplications for growth models and navigationApplications
Topical Web crawlersDistributed collaborative peer search
![Page 3: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/3.jpg)
The Web as a text corpus
Pages close in word vector space tend to be related
Cluster hypothesis (van Rijsbergen 1979)
The WebCrawler (Pinkerton 1994)
The whole first generation of search engines
weapons
mass
destruction
p1p2
![Page 4: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/4.jpg)
Enter the Web’s link structure
Broder & al. 2000
p(i) =α
N+ (1 − α)
∑
j:j→i
p(j)
|" : j → "|
Brin & Page 1998
Barabasi & Albert 1999
![Page 5: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/5.jpg)
Three network topologies
Text Links
![Page 6: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/6.jpg)
Three network topologies
Text LinksMeaning
![Page 7: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/7.jpg)
Connection between semantic topology (topicality or relevance) and link topology (hypertext)
G = Pr[rel(p)] ~ fraction of relevant pages (generality)
R = Pr[rel(p) | rel(q) AND link(q,p)]
Related nodes are “clustered” if R > G (modularity)
Necessary and sufficient condition for a random crawler to find pages related to start points
G = 5/15C = 2R = 3/6 = 2/4
The “link-cluster” conjecture
ICML 1997
![Page 8: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/8.jpg)
• Stationary hit rate for a random crawler:
Link-cluster conjecture
η(t + 1) = η(t) ⋅R + (1 −η(t)) ⋅G ≥ η(t)
η t→∞ → η∗ =G
1− (R −G)η∗ >G⇔ R >G
η∗
G−1 =
R−G1 − (R−G)
Value added
Conjecture
![Page 9: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/9.jpg)
Pages that link to each other tend to be related
Preservation of semantics (meaning)
A.k.a. topic drift
Link-cluster conjecture
€
L(q,δ) ≡path(q, p)
{p: path(q,p ) ≤δ }∑
{p : path(q, p) ≤ δ}
€
R(q,δ)G(q)
≡Pr rel(p) | rel(q)∧ path(q, p) ≤ δ[ ]
Pr[rel(p)]
JASIST 2004
![Page 10: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/10.jpg)
9
Correlation of lexical and linkage topology
L(δ): average link distance
S(δ): average similarity to start (topic) page from pages up to distance δ
Correlationρ(L,S) = –0.76
The “link-content” conjecture
€
S(q,δ) ≡sim(q, p)
{p: path(q,p ) ≤δ }∑
{p : path(q, p) ≤ δ}
![Page 11: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/11.jpg)
Heterogeneity of link-content correlation
€
S = c + (1− c)eaLb edu net
gov
com
signif. diff. a only (p<0.05)
signif. diff. a & b (p<0.05)
org
![Page 12: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/12.jpg)
Mapping the relationship between links, content, and
semantic topologies• Given any pair of pages, need ‘similarity’ or
‘proximity’ metric for each topology:– Content: textual/lexical (cosine) similarity– Link: co-citation/bibliographic coupling– Semantic: relatedness inferred from manual classification
• Data: Open Directory Project (dmoz.org)– ~ 1 M pages after cleanup– ~ 1.3*1012 page pairs!
![Page 13: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/13.jpg)
Content similarity€
σ c p1, p2( ) =p1 ⋅ p2p1 ⋅ p2
term i
term j
term k
p1p2
p1 p2
€
σ l (p1, p2) =Up1
∩Up2
Up1∪Up2
Link similarity
![Page 14: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/14.jpg)
Semantic similarity
• Information-theoretic measure based on classification tree (Lin 1998)
• Classic path distance in special case of balanced tree€
σ s(c1,c2) =2logPr[lca(c1,c2)]logPr[c1]+ logPr[c2]
top
lca
c1
c2
![Page 15: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/15.jpg)
Individual metric distributions
semanticcontent
link
![Page 16: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/16.jpg)
| Retrieved & Relevant || Retrieved |
| Retrieved & Relevant || Relevant |
Precision =
Recall =
![Page 17: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/17.jpg)
| Retrieved & Relevant || Retrieved |
| Retrieved & Relevant || Relevant |
€
P(sc,sl ) =
σ s(p,q){p,q:σ c = sc ,σ l = sl }
∑
{p,q :σ c = sc,σ l = sl}
R(sc,sl ) =
σ s(p,q){p,q:σ c = sc ,σ l = sl }
∑
σ s(p,q){p,q}∑
Averagingsemantic similarity
Summingsemantic similarity
Precision =
Recall =
![Page 18: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/18.jpg)
Science
σc
σllog Recall Precision
![Page 19: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/19.jpg)
Adult
σc
σllog Recall Precision
![Page 20: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/20.jpg)
News
σc
σllog Recall Precision
![Page 21: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/21.jpg)
All pairs
σc
σllog Recall Precision
![Page 22: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/22.jpg)
Outline
Topical locality: Content, link, and semantic topologiesImplications for growth models and navigationApplications
Topical Web crawlersDistributed collaborative peer search
![Page 23: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/23.jpg)
Link probability vs lexical distance
€
r =1 σ c −1
Pr(λ | ρ) =(p,q) : r = ρ∧σ l > λ
(p,q) : r = ρ
![Page 24: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/24.jpg)
Link probability vs lexical distance
€
r =1 σ c −1
Pr(λ | ρ) =(p,q) : r = ρ∧σ l > λ
(p,q) : r = ρ
Phasetransition
Power law tail
€
Pr(λ | ρ) ~ ρ−α(λ)
€
ρ*
Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002
![Page 25: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/25.jpg)
Local content-based growth model
• Similar to preferential attachment (BA)
• Use degree info (popularity/importance) only for nearby (similar/related) pages
€
Pr(pt → pi< t ) =k(i)mt
if r(pi, pt ) < ρ*
c[r(pi, pt )]−α otherwise
![Page 26: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/26.jpg)
So, many models can predict degree distributions...
Which is “right” ?
Need an independent observation (other than degree) to validate models
Distribution of content similarity across linked pairs
![Page 27: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/27.jpg)
None of these models is right!
![Page 28: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/28.jpg)
The mixture modelPr(i) ∝ ψ ·
1
t+ (1 − ψ) ·
k(i)
mtdegree-uniform mixture
ai2
i1
i3t
bi2
i1
i3t
ci2
i1
i3t
![Page 29: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/29.jpg)
The mixture model
Bias choice by content similarity instead of uniform distribution
Pr(i) ∝ ψ ·1
t+ (1 − ψ) ·
k(i)
mtdegree-uniform mixture
ai2
i1
i3t
bi2
i1
i3t
ci2
i1
i3t
![Page 30: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/30.jpg)
Degree-similarity mixture model
Pr(i) ∝ ψ · P̂r(i) + (1 − ψ) ·k(i)
mt
![Page 31: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/31.jpg)
Degree-similarity mixture model
Pr(i) ∝ ψ · P̂r(i) + (1 − ψ) ·k(i)
mt
ψ = 0.2, α = 1.7
P̂r(i) ∝ [r(i, t)]−α
![Page 32: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/32.jpg)
Both mixture models get the degree distribution right…
![Page 33: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/33.jpg)
…but the degree-similarity mixture model predicts the similarity distribution better
Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004
![Page 34: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/34.jpg)
0
4
DMOZ
0
0.25
0.5
0.75
1
!l
0
2
PNAS
0 0.25 0.5 0.75 1
!c
0
0.25
0.5
0.75
1
!l
Citation networks
15,785 articles
published in PNAS between 1997 and
2002
![Page 35: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/35.jpg)
Citation networks
![Page 36: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/36.jpg)
Citation networks
![Page 37: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/37.jpg)
Open Questions
Understand distribution of content similarity across all pairs of pages
Growth model to explain co-evolution of both link topology and content similarity
The role of search engines
![Page 38: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/38.jpg)
Efficient crawling algorithms?
Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:
~ log N (Barabasi & Albert 1999)~ log N / log log N (Bollobas 2001)
![Page 39: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/39.jpg)
Efficient crawling algorithms?
Practice: can’t find them!• Greedy algorithms based on location in geographical small
world networks: ~ poly(N) (Kleinberg 2000)• Greedy algorithms based on degree in power law
networks: ~ N (Adamic, Huberman & al. 2001)
Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:
~ log N (Barabasi & Albert 1999)~ log N / log log N (Bollobas 2001)
![Page 40: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/40.jpg)
Exception # 1
• Geographical networks (Kleinberg 2000)– Local links to all lattice neighbors– Long-range link probability distribution:
power law Pr ~ r–α
• r: lattice (Manhattan) distance• α: constant clustering exponent
€
t ~ log2 N⇔α = D
![Page 41: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/41.jpg)
Is the Web a geographical network?
local links
long range links (power law tail)
Replace lattice distance by lexical distancer = (1 / σc) – 1
![Page 42: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/42.jpg)
Exception # 2
• Hierarchical networks (Kleinberg 2002, Watts & al. 2002)– Nodes are classified at the leaves of tree– Link probability distribution: exponential tail
Pr ~ e–h
• h: tree distance (height of lowest common ancestor)
h=1h=2
€
t ~ logε N,ε ≥1
![Page 43: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/43.jpg)
exponential tail
Is the Web a hierarchical network? Replace tree distance by semantic distance
h = 1 – σs
top
lca
c1
c2
![Page 44: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/44.jpg)
Take home message:the Web is a “friendly” place!
![Page 45: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/45.jpg)
Outline
Topical locality: Content, link, and semantic topologiesImplications for growth models and navigationApplications
Topical Web crawlersDistributed collaborative peer search
![Page 46: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/46.jpg)
Crawler applications• Universal Crawlers
– Search engines!
• Topical crawlers– Live search
(e.g., myspiders.informatics.indiana.edu)
– Topical search engines & portals
– Business intelligence (find competitors/partners)
– Distributed, collaborative search
![Page 47: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/47.jpg)
spears
[sic]Topical crawlers
![Page 48: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/48.jpg)
Evaluating topical crawlers
• Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework
– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures
• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithms from
bandwidth & common utilities
Information Retrieval 2005
![Page 49: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/49.jpg)
Evaluating topical crawlers: Topics
• Automate evaluation using edited directories
• Different sources of relevance assessments
Keywords
Description
Targets
![Page 50: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/50.jpg)
Evaluating topical crawlers: TasksStart from seeds, find targets
and/or pages similar to target descriptions
d=2
d=3
![Page 51: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/51.jpg)
Examples of crawling algorithms
• Breadth-First– Visit links in order encountered
• Best-First– Priority queue sorted by similarity– Variants:
– explore top N at a time– tag tree context– hub scores
• SharkSearch– Priority queue sorted by combination of similarity,
anchor text, similarity of parent, etc.
• InfoSpiders
![Page 52: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/52.jpg)
Examples of crawling algorithms
• Breadth-First– Visit links in order encountered
• Best-First– Priority queue sorted by similarity– Variants:
– explore top N at a time– tag tree context– hub scores
• SharkSearch– Priority queue sorted by combination of similarity,
anchor text, similarity of parent, etc.
• InfoSpiders
![Page 53: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/53.jpg)
Exploration vs. Exploitation
Pages crawled
Avg
targ
et r
ecal
l
![Page 54: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/54.jpg)
Co-citation: hub scoresLink scorehub = linear combination between
link and hub score
![Page 55: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/55.jpg)
Recall (159 ODP topics)
Split ODP URLs between seeds and
targets
Add 10 best hubs to seeds for 94 topics
0
5
10
15
20
25
30
35
40
45
0 2000 4000 6000 8000 10000
aver
age
targ
et re
call@
N (%
)
N (pages crawled)
Breadth-FirstNaive Best-First
DOMHub-Seeker
0
5
10
15
20
25
30
35
40
45
0 2000 4000 6000 8000 10000
aver
age
targ
et re
call@
N (%
)
N (pages crawled)
Breadth-FirstNaive Best-First
DOMHub-Seeker
ECDL 2003
![Page 56: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/56.jpg)
InfoSpiders adaptive distributed algorithm using an evolving population of learning agents
![Page 57: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/57.jpg)
InfoSpiders adaptive distributed algorithm using an evolving population of learning agents
keywordvector neural net
local frontier
![Page 58: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/58.jpg)
InfoSpiders adaptive distributed algorithm using an evolving population of learning agents
keywordvector neural net
local frontier
offspring
![Page 59: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/59.jpg)
![Page 60: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/60.jpg)
![Page 61: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/61.jpg)
![Page 62: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/62.jpg)
Foreach agent thread: Pick & follow link from local frontier Evaluate new links, merge frontier Adjust link estimator E := E + payoff - cost
If E < 0: Die Elsif E > Selection_Threshold: Clone offspring Split energy with offspring Split frontier with offspring Mutate offspring
Evolutionary Local Selection Algorithm (ELSA)
selectivequery
expansion
matchresource
bias
reinforcementlearning
![Page 63: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/63.jpg)
Action selection
![Page 64: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/64.jpg)
Q-learningCompare estimated relevance of visited document with estimated relevance of link followed from previous page
Teaching input: E(D) + µ maxl(D) λl
![Page 65: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/65.jpg)
Performance
ACM Trans. Internet Technology 2003
Pages crawled
Avg
tar
get
reca
ll
![Page 66: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/66.jpg)
![Page 67: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/67.jpg)
![Page 68: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/68.jpg)
http://sixearch.org
![Page 69: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/69.jpg)
http://sixearch.org
![Page 70: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/70.jpg)
http://sixearch.org
![Page 71: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/71.jpg)
6S: Collaborative Peer Search
AB
queryquery
query
query
query
C
WWW2004 WWW2005 WTAS2005P2PIR2006
bookmarks
localstorage
WWW
IndexCrawler
Peer
![Page 72: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/72.jpg)
6S: Collaborative Peer Search
AB
queryquery
hit
hit
Data mining & referral opportunities
query
query
query
C
WWW2004 WWW2005 WTAS2005P2PIR2006
bookmarks
localstorage
WWW
IndexCrawler
Peer
![Page 73: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/73.jpg)
6S: Collaborative Peer Search
AB
queryquery
hit
hit
Data mining & referral opportunities
query
query
query
C
Emerging communities
WWW2004 WWW2005 WTAS2005P2PIR2006
bookmarks
localstorage
WWW
IndexCrawler
Peer
![Page 74: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/74.jpg)
Reinforcement Learning
![Page 75: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/75.jpg)
Query Routing
![Page 76: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/76.jpg)
Simulating 500 Users
ODP (dmoz.org)
Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006
![Page 77: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/77.jpg)
Simulating 500 Users
ODP (dmoz.org)
Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006
![Page 78: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/78.jpg)
Simulating 500 Users
ODP (dmoz.org)
Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006
![Page 79: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/79.jpg)
P@10
![Page 80: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/80.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.05 0.1 0.15 0.2 0.25
Prec
ision
Recall
6SCentralized Search Engine
Distributed vs Centralized
![Page 81: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/81.jpg)
! " #! #" $! $"!$!
!
$!
%!
&!
'!
#!!
#$!
#%!
()*+,*-./*+./**+
01*+02*.34052*.678
39)-:*+.3;*<<,3,*5:=,0>*:*+
Pajek Pajek
Small-world
![Page 82: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/82.jpg)
Semantic Similarity
![Page 83: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/83.jpg)
Semantic Similarity
Arts/Movies/Filmmaking
Business/Arts_and_Entertainment/Fashion
Business/E-Commerce/Developers
Business/Telecommunications/Call_Centers
Computers/Programming/Graphics
Health/Conditions_and_Diseases/Cancer
Health/Mental_Health/Grief,_Loss_and_Bereavement
Health/Professions/Midwifery
Health/Reproductive_Health/Birth_Control
Home/Family/Pregnancy
Shopping/Clothing/Accessories
Shopping/Clothing/Footwear
Shopping/Clothing/Uniforms
Shopping/Sports/Cycling
Shopping/Visual_Arts/Artist_Created_Prints
Society/Issues/Abortion
Society/People/Women
Sports/Cycling/Racing
![Page 84: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/84.jpg)
Ongoing Work
• Improve coverage/diversity in query routing algorithm
• Spam protection: trust/reputation subsystem
• User study with 6S application
![Page 85: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/85.jpg)
User study
![Page 86: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/86.jpg)
Query network
![Page 87: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/87.jpg)
Result network
![Page 88: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/88.jpg)
http://sixearch.org
Questions?
![Page 89: Navigating the Web graph - caida.org](https://reader033.vdocuments.us/reader033/viewer/2022052722/628ee3d856816123a51a71bb/html5/thumbnails/89.jpg)
Thank you!Questions?
informatics.indiana.edu/fil
Research supported by NSF CAREER Award IIS-0348940