the fragmented web

16
The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen

Upload: jela

Post on 14-Jan-2016

12 views

Category:

Documents


0 download

DESCRIPTION

The Fragmented Web. Notes on Chapter 12 For In765 Judith Molka-Danielsen. 1. Virtual robots. Virtual robots read and index web pages. Would be hard to navigate without them. But, some pages are never mapped. Simple search engines can return too much. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Fragmented Web

The Fragmented Web

Notes on Chapter 12

For In765

Judith Molka-Danielsen

Page 2: The Fragmented Web

1. Virtual robots

• Virtual robots read and index web pages.• Would be hard to navigate without them.• But, some pages are never mapped.• Simple search engines can return too

much.• Meta-search engines select hits across

engines. • www.lib.berkeley.edu/TeachingLib/Guides/

Internet/MetaSearch.html

Page 3: The Fragmented Web

Steve Lawrence and C. Lee Giles Attempt to measure the Web in 1999http://www.neci.nj.nec.com/homepages/lawrence/websize.html

Page 4: The Fragmented Web

2. Relevancy• Finding the “best” page is more important than

finding the “most” pages.• Notes on Searching the Web:

http://home.himolde.no/~molka/in350/week9y01.htm

Precision - proportion of retrieval documents that are relevant. W2 W2 = number retrieved that are relevant N2 N2 = total number retrieved Recall - proportion of relevant documents that are retrieved. W1 W1 = number relevant that are retrieved N1 N1 = total number relevant

P =

R=

Page 5: The Fragmented Web

Determining PageRankhttp://www.whitelines.nl/html/google-page-rank.html#example

• According to Sergey Brin and Lawrence (Larry) Page, Co-founders of Google, the PR of a webpage is calculated using this formula:

• PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I))• Where:

– PR(A) is the PageRank of your page A. – d is the damping factor, usually set to 0,85. – PR(I->A) is the PageRank of page I containing a link to page A. – C(I) is the number of links off page I. – PR(I->A)/C(I) is a PR-value page A receives from page I. – SUM (PR(I->A)/C(I)) is the sum of all PR-values page A

receives from pages with links to page A.. • In other words: The PR of a page is determined by the PR of every

page I that has a link to page A. For every page I that points to page A, the PR of page I is devided by the number of links from page I. These values are cumulated and multiplied by 0,85. Finally 0,15 is added to this result, and this number represents the PR of page A.

• What is your PageRank? http://www.klid.dk/pagerank.php?url=

Page 6: The Fragmented Web

by Greg R. Notess.

http://www.searchengineshowdown.com/stats/sizeest.shtml

Search EngineShowdown

Total Size Estimate(millions)

Claim (millions)

Google 3,033 3,083

AlltheWeb 2,106 2,112

AltaVista 1,689 1,000

WiseNut 1,453 1,500

Hotbot 1,147 3,000

MSN Search 1,018 3,000

Teoma 1,015 500

NLResearch 733 125

Gigablast 275 150

Data from: Dec. 31, 2002

Relative size:

AlltheWeb reported size and percentages from relative size showdown

AlltheWeb: 2,106,156,957 reported; Total Size reports are below.

Page 7: The Fragmented Web

Older Reports with Largest Three at that TimeMarch 2002: Google, WiseNut, AlltheWeb

August 2001: Google, Fast, WiseNut

April 2001: Google, Fast, MSN (Inktomi)

Oct. 2000: Fast, Google, Northern Light

July 2000: iWon, Google, AltaVista

April 2000: Fast, AltaVista, Northern Light

Feb. 2000: Fast, Northern Light, AltaVista

Jan. 2000: Fast, Northern Light, AltaVista

Nov. 1999: Northern Light, Fast, AltaVista

Sept. 1999: Fast, Northern Light, AltaVista

Aug. 1999: Fast, Northern Light, AltaVista

May 1999: Northern Light, AltaVista, Anzwers

March 1999: Northern Light, AltaVista, HotBot

January 1999: Northern Light, AltaVista, HotBot

August 1998: AltaVista, Northern Light, HotBot

May 1998: AltaVista, HotBot, Northern Light

February 1998: HotBot, AltaVista, Northern Light

October 1997: AltaVista, HotBot, Northern Light

September 1997: Northern Light, Excite, HotBot

June 1997: HotBot, AltaVista, Infoseek

October 1996: HotBot, Excite, AltaVista

Page 8: The Fragmented Web

Search EngineNewest

Page FoundRoughAverage

OldestPage Found

MSN (Ink.) 1 day 4 weeks 51 days

HotBot (Ink.) 1 day 4 weeks 51 days

  Google 2 days 1 month 165 days

  AlltheWeb 1 days 1 month 599 days

AltaVista 0 days 3 months 108 days

Gigablast 45 days 7 months381 days

Teoma 41 days 2.5 months 81 days

WiseNut 133 days 6 months 183 days

Freshness

Page 9: The Fragmented Web

Billions Of Textual Documents IndexedDecember 1995-September 2003         

http://searchenginewatch.com/reports/article.php/2156481

Page 10: The Fragmented Web

3. URL’s are directed links.

Andrei Broder (2000)

Page 11: The Fragmented Web

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf

Db driven/on-demand

Static html

Page 12: The Fragmented Web
Page 13: The Fragmented Web
Page 14: The Fragmented Web
Page 15: The Fragmented Web

4. Defining Web based communities

• 15% of web pages have links to opposing views.• 60% of web pages have links to like views.• Social segmentation is self re-enforcing.• Beliefs and affiliations have become public

information represented in links and visits.Web based communities are hard to ID.• No boundaries; different sizes; dif. organized.• Pages with more internal links than outside links

may be ID as a community. But, no efficient algorithm.

Page 16: The Fragmented Web

Other points…

• 5. Technology can allow more control over individuals: ID them, track them.

• Web topology (architecture by self-selecting where to link) limits our actions (browsing, some pages are invisible), more than the code (attempts at control, laws).

• 6. Internet Archive maintained since 1996 by Brewster Kahle. Some data will never go away.

• http://www.archive.org/ (Try the WayBack Machine.)• 7. Web is complex and self-organized. They started by

looking at the macrostructure. The last chapters will look at the smaller groupings.