measuring the size of the web

35
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State

Upload: melodie-levine

Post on 30-Dec-2015

28 views

Category:

Documents


1 download

DESCRIPTION

Measuring the Size of the Web. Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State. Studying the Web. To study the characteristics of the Web Statistics Topology Behavior … Why Scientific curiosity Practical values Eg, search engine coverage. Nature 1999. Web as Platform. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Measuring the Size of the Web

Measuring the Size of the Web

Dongwon Lee, Ph.D.

IST 501, Fall 2014

Penn State

Page 2: Measuring the Size of the Web

Studying the Web

To study the characteristics of the Web Statistics Topology Behavior …

Why Scientific curiosity Practical values

Eg, search engine coverage2

Nature 1999

Page 3: Measuring the Size of the Web

Web as Platform

Web becomes a new computation platform Pauses new challenges

Scale Efficiency Heterogeneity Impact to People’s lives

3

Page 4: Measuring the Size of the Web

Eg, How Big is the Web?

Q1: How many web sites?

Q2: How many web pages?

Q3: How many surface/deep web pages?

Research Method Mostly used Experimental method to validate

novel solutions

4

Page 5: Measuring the Size of the Web

Q1: How Many Web Sites?

DNS Registrars List of domain names

Issues Not every domain is web site A domain contains more than one web site Registrars are under no obligations for their

correctness So many of them …

5

Page 6: Measuring the Size of the Web

6

How Many Web Sites?

Brute-force: Polling every IP IPv4: 256.256.256.256

2^32 = 4 billion IPv6: 2^128

10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days

Not going to work !!

Page 7: Measuring the Size of the Web

7

How Many Web Sites? 2nd attempt: Sampling

T: All 4 Billion IPs

S: Sampled IPs

V: Valid reply

||||

||T

S

V

Page 8: Measuring the Size of the Web

8

How Many Web Sites?

||||

||T

S

V

1.Select |S| random IPs2.Send HTTP requests to port 80 at the

selected IPs3.Count valid replies: “HTTP 200 OK” = |V|4. |T| = 2^32

Q: What are the issues here?

Page 9: Measuring the Size of the Web

9

Issues

Virtual hosting Ports other than 80 Temporarily unavailable sites …

Page 10: Measuring the Size of the Web

10

OCLC Survey (2002)

OCLC (Online

Computer Library)

Results

http://wcp.oclc.org/ Still room for growth (at least for Web sites) ??

Page 11: Measuring the Size of the Web

NetCraft Web Server Survey (2010) Goal is to measure web server market share Also record # of sites their crawlers visited August 2010: 213,458,815 distinct sites

11http://news.netcraft.com/archives/category/web-server-survey/

Page 12: Measuring the Size of the Web

NetCraft Web Server Survey (2013) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 716,822,317 distinct sites

12http://news.netcraft.com/archives/category/web-server-survey/

Page 13: Measuring the Size of the Web

NetCraft Web Server Survey (2014) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 992,177,228 distinct sites

13http://news.netcraft.com/archives/category/web-server-survey/

Page 14: Measuring the Size of the Web

14

Q2: How Many Web Pages? Sampling based?

Issue here?

T: All URLs

S: Sampled URLs

V: Valid reply ||||

||T

S

V

Page 15: Measuring the Size of the Web

15

How Many Web Pages?

Method #1: For each site with valid reply, download all pages Measure average # of pages per site Avg # of pages X total # of sites

Result [Lawrence & Giles, 1999] 289 pages per site, 2.8M sites 289 * 2.8M =~ 800M web pages

Page 16: Measuring the Size of the Web

16

Further Issues

A small #of sites with TONS of pages Sampling could miss these sites

Majority of sites with small # of pages Lots of samples necessary

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

0 200 400 600 800 1000

No of Sites

No

of

Pa

ge

s

99.99% of the sites

Page 17: Measuring the Size of the Web

17

How Many Web Pages?

Method #2: Random sampling

Assume:

T: All pages

B: Base setS: Random samples

Page 18: Measuring the Size of the Web

18

Random Page?

Idea: Random walk Start from a Portal home page (eg, Yahoo) Estimate the size of the portal: B Follow random links, say 10,000 times Select the pages At the end, a set of random web pages S are

gathered

Page 19: Measuring the Size of the Web

19

Straightforward Random Walk

google.com

amazon.com

pike.psu.edu

Follow a random out-link at each step 1

2

3

4

56

7

8

9

Issues?

Page 20: Measuring the Size of the Web

20

Straightforward Random Walk

google.com

amazon.com

pike.psu.edu

Follow a random out-link at each step 1

2

3

4

56

7

8

9

1. Gets stuck in sinks and in dense Web communities2. Biased towards popular pages3. Converges slowly, if at all

Issues?

Page 21: Measuring the Size of the Web

21

Going to Converge? Random walks on regular, undirected graph

uniformly distributed sample

Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution

: depends on the graph structure N: number of nodes

Idea: Transform the Web graph to a regular, undirected graph Perform a random walk

Problem Web is neither regular nor undirected

NO log1

Page 22: Measuring the Size of the Web

22

Intuition

Random walk on undirected Web graph (not regular) High chance to be at a “popular” node at a

particular time Increase the chance to be at a “unpopular”

node by staying there longer through self loop.

Unpopular nodesPopular node

Page 23: Measuring the Size of the Web

23

WebWalker: Undirected Regular Random Walk on the Web

Fact:

A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps.

w(v) = degmax - deg(v)

google.com

pike.psu.edu

1

2

31

amazon.com

4

0

23

03

2

2

4

4

3

3

3

1

2

5Follow a random out-link or a random in-link at each step

Use weighted self loops to even out pages’ degrees

Page 24: Measuring the Size of the Web

24

Ideal Random Walk

Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page:

say, 300,000 If edge(n) < 300,000, then add self-loop

Perform random walks on the graph 10-5 for the 1996 Web, N 109

Page 25: Measuring the Size of the Web

25

WebWalker Results (2000)

Size of the Web pages Altavista: |B| = 250M |BS|/|S| = 35% Estimated |T| = ~ 720M

Avg page size: 12K Avg # of out-links: 10

Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz, Approximating Aggregate Queries about Web Pages

via Random Walks. VLDB, 2000

Page 26: Measuring the Size of the Web

How large is SE’s Index?

Prepare a representative corpus (eg, DMOZ) Draw a word W with known frequency

percentage F Eg, “The” is present in 60% of all documents

within the corpus Submit W to a search engine E If E reports there are X number of documents

containing W, one can extrapolate the total size of E’s index as=~ X / F

Repeat multiple times for computing average26

Page 27: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2010)

27

28 Billions

Page 28: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2011)

28

46 Billions

Page 29: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2013)

29

46 Billions

Page 30: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2013)

30

10 Billions

Page 31: Measuring the Size of the Web

Google Reveals Itself (2008) 1998: 26 Million URLs 2000: 1 Billion URLs 2008: 1 trillion URLs

Not all of them are indexed Duplicates Auto-generated (eg, Calendar) Spams

Experts suspect (2010) Google index at least 40 Billions

31

Page 32: Measuring the Size of the Web

32

Deep Web (aka Hidden Web)

HTML FORM InterfaceQuery Answers

Page 33: Measuring the Size of the Web

33

Q3: Size of Deep Web?

Deep Web: Information reachable only through query interface (eg, HTML FORM)

Often backed by DBMS

Estimation:

How to estimate? By sampling

(Avg size of record) X (Avg # of records per site) X

(Total # of Deep Web sites)

Page 34: Measuring the Size of the Web

34

Size of Deep Web? Total # of Deep Web sites:

|BS|/|S|

Avg size of a record: Issue random queries Estimate reply size

Avg # of records per site: Permute all possible queries for the FORM Issue all queries and count valid return

Page 35: Measuring the Size of the Web

35

Size of Deep Web (2005)

BrightPlanet report estimates: Avg size of a record: 14KB Avg # of records per site: 5MB Total # of Deep Web sites: 200,000 Size of the Deep Web: 10^16 (10 petabytes) 1,000 times larger than the “Surface Web”

How to access it? Wrapper/Mediator (aka. Web scrapping)

http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now