webbase: building a web warehouse hector garcia-molina stanford university work with: sergey brin,...

WebBase:Building a Web Warehouse

Hector Garcia-MolinaStanford University

Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala,Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley

2

The Web

• A universal information resource– Model weak, strong agreement

• How to exploit it?

web

3

WebBase

WEB PAGE

4

WebBase Goals

• Manage very large collections of Web pages– Today: 1500GB HTML, 200 M pages

• Enable large-scale Web-related research• Locally provide a significant portion of the Web• Efficient wide-area Web data distribution

5

WebBase Architecture

6

WebBase Remote Users

• Berkeley• Columbia• U. Washington• Harvey Mudd• Università degli

Studi di Milano• U. of Arizona

• California Digital Library

• Cornell• U. of Houston• Learning Lab

Lower Saxony (L3S)• France Telecom• U. Texas

7

Outline

• Technical Challenges• WebBase Use• The Future

8

Challenges

• Scalability– crawling– archive distribution– index construction– storage

• Consistency– freshness– versions

• Dissemination

• Archiving– “units”– coordination

• IP Management– copy access– link access– access control

• Hidden Web• Topic-Specific

Collection Building

9

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

10

Parallel Crawling

C

C

C

...

web

11

Independent Crawlers

C

C

web

a

e

d c

b

site 1

fh

i

g

site 2

12

Partition: Firewall

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition·URL hash·Site hash·Hierarchical

13

Partition: Cross-Over

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

14

Partition: Cross-Over

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

15

Partition: Exchange

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

16

Partition: Exchange

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

17

Coverage vs Overlap

cross-over crawler; 5 random seeds per C-proc

18

WebBase Parallel Crawling

web

sitequeues ...

process

sitequeues ...

process

...

computer

other computers

coordinator

19

WebBase Parallel Crawling

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

pages/sec cpu utilization sites-being-crawled

100%

2 cpuutilzation

0%

200%

number of processes

20

Challenges

• Scalability– crawling– archive distribution– index construction– storage

• Consistency– freshness– versions

• Dissemination

• Archiving– “units”– coordination

• IP Management– copy access– link access– access control

• Hidden Web• Topic-Specific

Collection Building

done

next

21

How to Refresh?

a

b

a

b

webrepository

a changes daily

b changes once a week

can visitone page per week

• How should we visit pages?– a a a a a a a ...– b b b b b b b ...– a b a b a b a b ... [uniform]

– a a a a a a b a a a ... [proportional]

– ?

22

Using WebBase

• Fast Page Rank• Complex Queries

23

Structure of the Web

Color the nodes by their domainred = stanford.edugreen = berkeley.edublue = mit.edu

24

Structure of the Web

stanford.edu berkeley.edu

mit.edu

25

Nested Block Structure of the Web

Berkeley

Stanford

from

to

26

Personalized Page Rank

ab

27

Complex Queries

Stanford WebBase Repositor

y

Text searchE.g., Search for “SARS Symptoms”

Bulk/Streaming accessLarge-scale mining & indexingE.g., compute PageRank, extract communities

Complex queriesDeclarative analysis interface

Example of a Complex Query

Rank pages in S by PageRank

Rank domains in R by sum (incoming ranks)

Web Entire Web

Compute S = stanford.edu pages containing phrase

“Mobile networking”stanford.ed

uMobile

networking pages

(S)

Compute R = set of all “.edu”

domains pointed to by

pages in SS

RList top 10 domains in

R

find universitiescollaborating with Stanfordon mobile networking

29

Supernodes

P1

P2

P3

P4

P5

Web graph

= {N1, N2, N3}

N1 N3

N2

N1

N2

N3

E1,2E3,2

E1,3E3,1

Supernode graph

P1 P2

IntraNode1

P2 P5

SEdgePos1

,3

P4 P5

IntraNode3

SEdgeNeg3

,2

P5P3

30

Growth of Supernode Graph

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120

Number of pages (Millions)

Size

of s

uper

node

gra

ph (M

B)

82MB, 115M pages(830 GB of

raw HTML)

31

Query Execution Times

Query

Tim

e fo

r nav

igat

ion

oper

atio

n (s

ecs)

0

100

200

300

400

500

600

Query 1 Query 2 Query 3 Query 4 Query 5 Query 6

S-Node representationRelational DBConnectivity Server

Files of adjacency lists

32

Query Optimization

P

4pDepth

".net/%domainmy 2." LIKE pURL

P

5pDepth

1000

pURL

33

Impact of cluster-based optimization

0

100

200

300

400

500

600

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Sample Queries

Que

ry E

xecu

tion

Tim

e (s

ecs)

No optimizationOptimization enabled

35-million page dataset600 million links300GB of HTML

40-45% reduction in query execution times

34

Conclusion (So Far)

• Web is universal information resource• WebBase exploits this resource• WebBase Challenges:

– scalability, consitency, complex queries...

• The Future for WebBase(and clones)??

35

Will WebBase Scale?

web content(indexable)

webBasecapacity(pesimistic)

webBasecapacity(optimistic)

timetoday

36

Pessimistic Scenario

• Specialized WebBases– sports– shopping– ...


webBasecapacity(pesimistic)

timetoday

37

Optimistic Scenario

• Web in a Box– web delivered in

“CD” monthy– search engine

handles updates


webBasecapacity(optimistic)

timetoday

38

Legal Issues?

• Is WebBase legal?– copies– links, deep linking

• International regulations

39

Biasing Results

• How long will Google, Altavista, etc.resist “temptations”?

• Biasing Crawler• Link and Content Spam

40

Access Data

• WebBase does not capture access patterns

web

? WebBase

41

Semantic Web?

• Will tags be generated?• By whom?• Agreement?

web

? WebBase

semantic tags

42

Future Technical Challenges

• Incremental Updates• Query Optimization• Crawling Deep Web

43

Final Conclusion

• Many challenges ahead...• Additional information:

Google: Stanford WebBase

WEB PAGE

webbase: building a web warehouse hector garcia-molina stanford university work with: sergey brin,...

Documents