web search engines
DESCRIPTION
Web Search Engines. by Greg R. Notess [email protected] imt.net/~notess/search. Overview:. Comparing the database content Change Comparative Size Overlap Looking towards future developments Portal or Destination Output sorting. Results are limited by. Database content - PowerPoint PPT PresentationTRANSCRIPT
Web Search EnginesWeb Search Enginesby Greg R. Notessby Greg R. Notess
[email protected]@imt.net
imt.net/~notess/imt.net/~notess/searchsearch
Overview:Overview: Comparing the database contentComparing the database content
• ChangeChange• Comparative SizeComparative Size• OverlapOverlap
Looking towards future Looking towards future developmentsdevelopments• Portal or DestinationPortal or Destination• Output sortingOutput sorting
Results are Results are limited bylimited by
Database contentDatabase content•The Web sites includedThe Web sites included•The depth to which The depth to which they are indexedthey are indexed
If it’s not in the If it’s not in the database, the best database, the best search engine will not search engine will not be able to find the Web be able to find the Web pagepage
So what’re they So what’re they like?like?Very large databasesVery large databasesMost index all words on Most index all words on pagepage•None index words in imagesNone index words in images
Let’s see how the databases Let’s see how the databases compare to the real Webcompare to the real Web
Change over Change over time?time?
Overall Size Overall Size ChangeChange
Is the Web in generalIs the Web in generalGrowing?Growing?Shrinking?Shrinking?Remaining the same?Remaining the same?
Excite Excite 6 Searches 10/96-8/986 Searches 10/96-8/98
What about the What about the rest?rest?
Who’s the biggest?Who’s the biggest?How to measure?How to measure?
•Actual search resultsActual search results•Verified hitsVerified hits
0 500 1000 1500 2000 2500 3000 3500 4000 4500
AltaVista
Northern Light
HotBot
Infoseek
Excite
Lycos
WebCrawler
Total Hits from 15 SearchesAugust 29, 1998
And over time?And over time? 8/98 -- AltaVista8/98 -- AltaVista, Northern Light, HotBot, Northern Light, HotBot 5/98 -- AltaVista5/98 -- AltaVista, HotBot, Northern Light, HotBot, Northern Light 2/98 -- HotBot2/98 -- HotBot, AltaVista, Northern Light, AltaVista, Northern Light 10/97 -- AltaVista10/97 -- AltaVista, HotBot, Northern Light, HotBot, Northern Light 9/97 -- Northern Light9/97 -- Northern Light, Excite, HotBot, Excite, HotBot 6/97 -- HotBot6/97 -- HotBot, AltaVista, Infoseek, AltaVista, Infoseek 10/96 -- HotBot10/96 -- HotBot, Excite, AltaVista, Excite, AltaVista
Back to change in Back to change in sizesizeLet’s look at six search Let’s look at six search
enginesenginesOver the course of two Over the course of two
yearsyears
0
250
500
750
1000
1250
Northern Light HotBot AltaVista Infoseek Excite Lycos
Oct 96 June 97 Sept 97 Oct 97 Feb 98 May 98 Aug 98
Database Size ChangesFive Terms: Oct. 96 - Aug. 98
But at leastBut at least
They have a high They have a high degree of duplication degree of duplication between thembetween them
Right?Right?
Try 4 small Try 4 small searchessearchesUsing five search enginesUsing five search enginesHow many pages are How many pages are
found by all five or at found by all five or at least by four of them?least by four of them?
ZEROZERO
OverlapOverlap
And they exclude And they exclude most:most: Content of Adobe PDF and formatted filesContent of Adobe PDF and formatted files The content in most sites requiring a log inThe content in most sites requiring a log in CGI output: data requested by a formCGI output: data requested by a form Other dynamically produced dataOther dynamically produced data Pages protected by a robots.txt filePages protected by a robots.txt file Intranets, pages not linked from anywhere Intranets, pages not linked from anywhere
elseelse Commercial resources with domain limitationsCommercial resources with domain limitations Non-Web resourcesNon-Web resources
Scope Summary:Scope Summary:
Inconsistent growthInconsistent growthNot full coverageNot full coverageSurprisingly low Surprisingly low duplicationduplication
Positive Side?Positive Side?Essential for searching the Essential for searching the
NetNetCan be used effectivelyCan be used effectively
•Phrase searchPhrase search•Use more than oneUse more than one•Smart searchingSmart searching
Incredibly popularIncredibly popular•Even when they failEven when they fail
–But then, since when is finding But then, since when is finding information always easy?information always easy?
Overview:Overview: Comparing the database contentComparing the database content
• ChangeChange• Comparative SizeComparative Size• OverlapOverlap
Looking towards future Looking towards future developmentsdevelopments• Portal or DestinationPortal or Destination• Output sortingOutput sorting
What is a search What is a search engine?engine?
Portal?Portal?Gateway?Gateway?Destination?Destination?
Search EngineSearch Engine
the software than the software than searches a databasesearches a database
DevelopmentDevelopment Database of Web pagesDatabase of Web pages adds Supplementary Databaseadds Supplementary Database
• Phone numbers, reference, businesses, Phone numbers, reference, businesses, newsnews
then adds Subject directorythen adds Subject directory then Servicesthen Services
• email, ISP, shopping, travel agentemail, ISP, shopping, travel agent now Communitiesnow Communities
Portal to Portal to Destination?Destination?
Driving forceDriving force• advertising revenueadvertising revenue
Keep users longer for moreKeep users longer for moreConflicts with portal and Conflicts with portal and gateway principlegateway principle
Future Future possibilities?possibilities? Smaller databasesSmaller databases Less pointing to external pagesLess pointing to external pages Paid advertising or sponsorship Paid advertising or sponsorship
for visibilityfor visibility Rise of search only sites?Rise of search only sites?
Output Output DevelopmentDevelopment Initially, “Relevance” rankingInitially, “Relevance” ranking
•CrudeCrude•Not site or URL basedNot site or URL based
Some site sorting from ExciteSome site sorting from ExciteNo date sortingNo date sorting
Site SortingSite Sorting Infoseek, then Lycos, now Infoseek, then Lycos, now
HotBotHotBotGroup together by siteGroup together by site
•More relevant than prior More relevant than prior algorithmsalgorithms
Northern Light includes it in Northern Light includes it in •Custom FoldersCustom Folders
Other OutputOther Output RealName on AltaVistaRealName on AltaVista Direct Hit on HotBotDirect Hit on HotBot Subject Directory Categories Subject Directory Categories NewsNews Books, CDs, etc. “about search Books, CDs, etc. “about search
term”term”
Search Engine Search Engine ShowdownShowdown
imt.net/~notess/searchimt.net/~notess/search Search engine featuresSearch engine features See also See also
• www.searchenginewatch.comwww.searchenginewatch.com See alsoSee also
•Rich Wiggins, Coming up Rich Wiggins, Coming up next . . .next . . .
Web Search EnginesWeb Search Enginesby Greg R. Notessby Greg R. Notess
[email protected]@imt.net
imt.net/~notess/imt.net/~notess/searchsearch