webbase: building a web warehouse hector garcia-molina stanford university work with: sergey brin,...
DESCRIPTION
3 WebBase WEB PAGETRANSCRIPT
WebBase:Building a Web Warehouse
Hector Garcia-MolinaStanford University
Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala,Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley
2
The Web
• A universal information resource– Model weak, strong agreement
• How to exploit it?
web
3
WebBase
WEB PAGE
4
WebBase Goals
• Manage very large collections of Web pages– Today: 1500GB HTML, 200 M pages
• Enable large-scale Web-related research• Locally provide a significant portion of the Web• Efficient wide-area Web data distribution
5
WebBase Architecture
6
WebBase Remote Users
• Berkeley• Columbia• U. Washington• Harvey Mudd• Università degli
Studi di Milano• U. of Arizona
• California Digital Library
• Cornell• U. of Houston• Learning Lab
Lower Saxony (L3S)• France Telecom• U. Texas
7
Outline
• Technical Challenges• WebBase Use• The Future
8
Challenges
• Scalability– crawling– archive distribution– index construction– storage
• Consistency– freshness– versions
• Dissemination
• Archiving– “units”– coordination
• IP Management– copy access– link access– access control
• Hidden Web• Topic-Specific
Collection Building
9
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
10
Parallel Crawling
C
C
C
...
web
11
Independent Crawlers
C
C
web
a
e
d c
b
site 1
fh
i
g
site 2
12
Partition: Firewall
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition·URL hash·Site hash·Hierarchical
13
Partition: Cross-Over
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
14
Partition: Cross-Over
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
15
Partition: Exchange
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
16
Partition: Exchange
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
17
Coverage vs Overlap
cross-over crawler; 5 random seeds per C-proc
18
WebBase Parallel Crawling
web
sitequeues ...
process
sitequeues ...
process
...
computer
other computers
coordinator
19
WebBase Parallel Crawling
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
pages/sec cpu utilization sites-being-crawled
100%
2 cpuutilzation
0%
200%
number of processes
20
Challenges
• Scalability– crawling– archive distribution– index construction– storage
• Consistency– freshness– versions
• Dissemination
• Archiving– “units”– coordination
• IP Management– copy access– link access– access control
• Hidden Web• Topic-Specific
Collection Building
done
next
21
How to Refresh?
a
b
a
b
webrepository
a changes daily
b changes once a week
can visitone page per week
• How should we visit pages?– a a a a a a a ...– b b b b b b b ...– a b a b a b a b ... [uniform]
– a a a a a a b a a a ... [proportional]
– ?
22
Using WebBase
• Fast Page Rank• Complex Queries
23
Structure of the Web
Color the nodes by their domainred = stanford.edugreen = berkeley.edublue = mit.edu
24
Structure of the Web
stanford.edu berkeley.edu
mit.edu
25
Nested Block Structure of the Web
Berkeley
Stanford
from
to
26
Personalized Page Rank
ab
27
Complex Queries
Stanford WebBase Repositor
y
Text searchE.g., Search for “SARS Symptoms”
Bulk/Streaming accessLarge-scale mining & indexingE.g., compute PageRank, extract communities
Complex queriesDeclarative analysis interface
Example of a Complex Query
Rank pages in S by PageRank
Rank domains in R by sum (incoming ranks)
Web Entire Web
Compute S = stanford.edu pages containing phrase
“Mobile networking”stanford.ed
uMobile
networking pages
(S)
Compute R = set of all “.edu”
domains pointed to by
pages in SS
RList top 10 domains in
R
find universitiescollaborating with Stanfordon mobile networking
29
Supernodes
P1
P2
P3
P4
P5
Web graph
= {N1, N2, N3}
N1 N3
N2
N1
N2
N3
E1,2E3,2
E1,3E3,1
Supernode graph
P1 P2
IntraNode1
P2 P5
SEdgePos1
,3
P4 P5
IntraNode3
SEdgeNeg3
,2
P5P3
30
Growth of Supernode Graph
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120
Number of pages (Millions)
Size
of s
uper
node
gra
ph (M
B)
82MB, 115M pages(830 GB of
raw HTML)
31
Query Execution Times
Query
Tim
e fo
r nav
igat
ion
oper
atio
n (s
ecs)
0
100
200
300
400
500
600
Query 1 Query 2 Query 3 Query 4 Query 5 Query 6
S-Node representationRelational DBConnectivity Server
Files of adjacency lists
32
Query Optimization
P
4pDepth
".net/%domainmy 2." LIKE pURL
P
5pDepth
1000
pURL
33
Impact of cluster-based optimization
0
100
200
300
400
500
600
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Sample Queries
Que
ry E
xecu
tion
Tim
e (s
ecs)
No optimizationOptimization enabled
35-million page dataset600 million links300GB of HTML
40-45% reduction in query execution times
34
Conclusion (So Far)
• Web is universal information resource• WebBase exploits this resource• WebBase Challenges:
– scalability, consitency, complex queries...
• The Future for WebBase(and clones)??
35
Will WebBase Scale?
web content(indexable)
webBasecapacity(pesimistic)
webBasecapacity(optimistic)
timetoday
36
Pessimistic Scenario
• Specialized WebBases– sports– shopping– ...
web content(indexable)
webBasecapacity(pesimistic)
timetoday
37
Optimistic Scenario
• Web in a Box– web delivered in
“CD” monthy– search engine
handles updates
web content(indexable)
webBasecapacity(optimistic)
timetoday
38
Legal Issues?
• Is WebBase legal?– copies– links, deep linking
• International regulations
39
Biasing Results
• How long will Google, Altavista, etc.resist “temptations”?
• Biasing Crawler• Link and Content Spam
40
Access Data
• WebBase does not capture access patterns
web
? WebBase
41
Semantic Web?
• Will tags be generated?• By whom?• Agreement?
web
? WebBase
semantic tags
42
Future Technical Challenges
• Incremental Updates• Query Optimization• Crawling Deep Web
43
Final Conclusion
• Many challenges ahead...• Additional information:
Google: Stanford WebBase
WEB PAGE