web search engines

Post on 03-Jan-2016

22 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Web Search Engines. by Greg R. Notess notess@imt.net imt.net/~notess/search. Overview:. Comparing the database content Change Comparative Size Overlap Looking towards future developments Portal or Destination Output sorting. Results are limited by. Database content - PowerPoint PPT Presentation

TRANSCRIPT

Web Search EnginesWeb Search Enginesby Greg R. Notessby Greg R. Notess

notess@imt.netnotess@imt.net

imt.net/~notess/imt.net/~notess/searchsearch

Overview:Overview: Comparing the database contentComparing the database content

• ChangeChange• Comparative SizeComparative Size• OverlapOverlap

Looking towards future Looking towards future developmentsdevelopments• Portal or DestinationPortal or Destination• Output sortingOutput sorting

Results are Results are limited bylimited by

Database contentDatabase content•The Web sites includedThe Web sites included•The depth to which The depth to which they are indexedthey are indexed

If it’s not in the If it’s not in the database, the best database, the best search engine will not search engine will not be able to find the Web be able to find the Web pagepage

So what’re they So what’re they like?like?Very large databasesVery large databasesMost index all words on Most index all words on pagepage•None index words in imagesNone index words in images

Let’s see how the databases Let’s see how the databases compare to the real Webcompare to the real Web

Change over Change over time?time?

Overall Size Overall Size ChangeChange

Is the Web in generalIs the Web in generalGrowing?Growing?Shrinking?Shrinking?Remaining the same?Remaining the same?

Excite Excite 6 Searches 10/96-8/986 Searches 10/96-8/98

What about the What about the rest?rest?

Who’s the biggest?Who’s the biggest?How to measure?How to measure?

•Actual search resultsActual search results•Verified hitsVerified hits

0 500 1000 1500 2000 2500 3000 3500 4000 4500

AltaVista

Northern Light

HotBot

Infoseek

Excite

Lycos

WebCrawler

Total Hits from 15 SearchesAugust 29, 1998

And over time?And over time? 8/98 -- AltaVista8/98 -- AltaVista, Northern Light, HotBot, Northern Light, HotBot 5/98 -- AltaVista5/98 -- AltaVista, HotBot, Northern Light, HotBot, Northern Light 2/98 -- HotBot2/98 -- HotBot, AltaVista, Northern Light, AltaVista, Northern Light 10/97 -- AltaVista10/97 -- AltaVista, HotBot, Northern Light, HotBot, Northern Light 9/97 -- Northern Light9/97 -- Northern Light, Excite, HotBot, Excite, HotBot 6/97 -- HotBot6/97 -- HotBot, AltaVista, Infoseek, AltaVista, Infoseek 10/96 -- HotBot10/96 -- HotBot, Excite, AltaVista, Excite, AltaVista

Back to change in Back to change in sizesizeLet’s look at six search Let’s look at six search

enginesenginesOver the course of two Over the course of two

yearsyears

0

250

500

750

1000

1250

Northern Light HotBot AltaVista Infoseek Excite Lycos

Oct 96 June 97 Sept 97 Oct 97 Feb 98 May 98 Aug 98

Database Size ChangesFive Terms: Oct. 96 - Aug. 98

But at leastBut at least

They have a high They have a high degree of duplication degree of duplication between thembetween them

Right?Right?

Try 4 small Try 4 small searchessearchesUsing five search enginesUsing five search enginesHow many pages are How many pages are

found by all five or at found by all five or at least by four of them?least by four of them?

ZEROZERO

OverlapOverlap

And they exclude And they exclude most:most: Content of Adobe PDF and formatted filesContent of Adobe PDF and formatted files The content in most sites requiring a log inThe content in most sites requiring a log in CGI output: data requested by a formCGI output: data requested by a form Other dynamically produced dataOther dynamically produced data Pages protected by a robots.txt filePages protected by a robots.txt file Intranets, pages not linked from anywhere Intranets, pages not linked from anywhere

elseelse Commercial resources with domain limitationsCommercial resources with domain limitations Non-Web resourcesNon-Web resources

Scope Summary:Scope Summary:

Inconsistent growthInconsistent growthNot full coverageNot full coverageSurprisingly low Surprisingly low duplicationduplication

Positive Side?Positive Side?Essential for searching the Essential for searching the

NetNetCan be used effectivelyCan be used effectively

•Phrase searchPhrase search•Use more than oneUse more than one•Smart searchingSmart searching

Incredibly popularIncredibly popular•Even when they failEven when they fail

–But then, since when is finding But then, since when is finding information always easy?information always easy?

Overview:Overview: Comparing the database contentComparing the database content

• ChangeChange• Comparative SizeComparative Size• OverlapOverlap

Looking towards future Looking towards future developmentsdevelopments• Portal or DestinationPortal or Destination• Output sortingOutput sorting

What is a search What is a search engine?engine?

Portal?Portal?Gateway?Gateway?Destination?Destination?

Search EngineSearch Engine

the software than the software than searches a databasesearches a database

DevelopmentDevelopment Database of Web pagesDatabase of Web pages adds Supplementary Databaseadds Supplementary Database

• Phone numbers, reference, businesses, Phone numbers, reference, businesses, newsnews

then adds Subject directorythen adds Subject directory then Servicesthen Services

• email, ISP, shopping, travel agentemail, ISP, shopping, travel agent now Communitiesnow Communities

Portal to Portal to Destination?Destination?

Driving forceDriving force• advertising revenueadvertising revenue

Keep users longer for moreKeep users longer for moreConflicts with portal and Conflicts with portal and gateway principlegateway principle

Future Future possibilities?possibilities? Smaller databasesSmaller databases Less pointing to external pagesLess pointing to external pages Paid advertising or sponsorship Paid advertising or sponsorship

for visibilityfor visibility Rise of search only sites?Rise of search only sites?

Output Output DevelopmentDevelopment Initially, “Relevance” rankingInitially, “Relevance” ranking

•CrudeCrude•Not site or URL basedNot site or URL based

Some site sorting from ExciteSome site sorting from ExciteNo date sortingNo date sorting

Site SortingSite Sorting Infoseek, then Lycos, now Infoseek, then Lycos, now

HotBotHotBotGroup together by siteGroup together by site

•More relevant than prior More relevant than prior algorithmsalgorithms

Northern Light includes it in Northern Light includes it in •Custom FoldersCustom Folders

Other OutputOther Output RealName on AltaVistaRealName on AltaVista Direct Hit on HotBotDirect Hit on HotBot Subject Directory Categories Subject Directory Categories NewsNews Books, CDs, etc. “about search Books, CDs, etc. “about search

term”term”

Search Engine Search Engine ShowdownShowdown

imt.net/~notess/searchimt.net/~notess/search Search engine featuresSearch engine features See also See also

• www.searchenginewatch.comwww.searchenginewatch.com See alsoSee also

•Rich Wiggins, Coming up Rich Wiggins, Coming up next . . .next . . .

Web Search EnginesWeb Search Enginesby Greg R. Notessby Greg R. Notess

notess@imt.netnotess@imt.net

imt.net/~notess/imt.net/~notess/searchsearch

top related