Download - 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Lecture 5:Search Engines
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Outline
• Search engines: key tools for ecommerce– Buyers and sellers must find each other
• How do they work?• How much do they index?• How are hits ordered?• Can the order be changed?
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engines
• Tools for finding information on the Web– Problem: “hidden” databases, e.g. New York Times
• Directory– A hand-constructed hierarchy of topics (e.g. Yahoo)
• Search engine– A machine-constructed index (usually by keyword)
• So many search engines, we now need search engines to find them. Searchenginecollosus.com
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Indexing
• Arrangement of data (data structure) to permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak• Sorting helps. Why?
– Permits binary search. About log2n probes into list
• log2(1 billion) ~ 30
– Permits interpolation search. About log2(log2n) probes
• log2 log2(1 billion) ~ 5
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Inverted Files
A file is a list of words by position
– First entry is the word in position 1 (first word)– Entry 4562 is the word in position 4562 (4562nd word)– Last entry is the last word
An inverted file is a list of positions by word!
POS1
10
20
30
36
FILE
a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)
INVERTED FILE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Inverted Files for Multiple Documents
107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802
WORD NDOCS PTR
jezebel 20
jezer 3
jezerit 1
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39jezoar
34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132. . .
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
WORD INDEX
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Architecture
• Spider– Crawls the web to find pages. Follows hyperlinks.
Never stops
• Indexer– Produces data structures for fast searching of all words in
the pages
• Retriever– Query interface– Database lookup to find hits
• 2 billion documents
• 4 TB RAM, many terabytes of disk
– Ranking
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Crawlers (Spiders, Bots)
• Retrieve web pages for indexing by search engines
• Start with an initial page P0. Find URLs on P0 and add them to a
queue
• When done with P0, pass it to an indexing program, get a page
P1 from the queue and repeat
• Can be specialized (e.g. only look for email addresses)• Issues
– Which page to look at next? (Special subjects, recency)– Avoid overloading a site– How deep within a site to go (drill-down)?– How frequently to visit pages?
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Query Specification
• Boolean– AND , OR, NOT, PHRASE “ ”, NEAR ~ – But keyword query is artificial
• Question-answering (simulated)– “Who offers a master’s degree in ecommerce?
• Date range• Relevance specification
– In Altavista, can specify terms by importance (separate from query specification)
• Content– multimedia, MP3, .PPT files
• Stemming: eat, eats, eaten, eating, eater, (ate!)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
“Advanced” Query Specification
• Multimedia, e.g. Google• Date range• Relevance specification
– In Altavista, can specify terms by importance (separate from query specification)
• Content– multimedia, MP3, .PPT files
• Stemming• Language• Search depth (from site’s front page)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Ranking (Scoring) Hits• Hits must be presented in some order• What order?
– Relevance, recency, popularity, reliability?• Some ranking methods
– Presence of keywords in title of document– Closeness of keywords to start of document– Frequency of keyword in document– Link popularity (how many pages point to this one)
• Can the user control? Can the page owner control?• Can you find out what order is used?• Spamdexing: influencing retrieval ranking by altering
a web page. (Puts “spam” in the index)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Google’s PageRank Algorithm
• Assumption: A link in page A to page B is a recommendation of page B by the author of A(we say B is successor of A)
The “quality” of a page is related to the number of links that point to it (its in-degree)
• Apply recursively: Quality of a page is related to– its in-degree, and to
– the quality of pages linking to it
PageRank Algorithm (Brinn & Page, 1998)SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Definition of PageRank
• Consider the following infinite random walk (surfing):– Initially the surfer is at a random page
– At each step, the surfer proceeds
• to a randomly chosen web page with probability d
• to a randomly chosen successor of the current page with
probability 1-d
• The PageRank of a page p is the fraction of steps
the surfer spends at p as the number of steps
approaches infinity
SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
PageRank Formula
where n is the total number of nodes in the graph
• Google uses d 0.85
• PageRank is a probability distribution over web pages
• The sum of all PageRanks of all Pages is 1
Epq
qoutdegreeqPageRankdn
dpPageRank
),(
)(/)()1()(
SOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
PageRank Example
P
A B
PageRank of P is
(1-d)[(PageRank of A)/4 + (PageRank of B)/3)] + d/n
dd
PAGERANK CALCULATORSOURCE: GOOGLE
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Link Popularity
• How many pages link to this page?– on the whole Web– in our database?
• www.linkpopularity.com• Link popularity is used for ranking
– Many measures– Number of links in– Weighted number of links in (by weight of referring page)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Sizes (Sept. 2, 2003)
SOURCE: SEARCHENGINEWATCH.COM
ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma
SEARCHES/DAY(MILLIONS) 250 80 18
2900 per second!
BILLIONS OF PAGES
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Usage
SHARE BY SEARCH SITE SHARE BY ENGINE
SOURCE: SEARCHENGINEWATCH.COM
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engines Disjointness
SOURCE: SEARCHENGINESHOWDOWN
Four searches, 10 engines, total of 141 hits on March 6, 2002
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine EKG
SOURCE: SEARCHENGINEWATCH.COM
Shows activity of the Lycos crawlerat one sample site, calafia.com, bynumber of pages visited during each crawl
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine EKG Comparison
SO
UR
CE
: S
EA
RC
HE
NG
INE
WA
TC
H.C
OM
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Engine Differences
• Coverage (number of documents)• Spidering algorithms (visit SpiderCatcher)
– Frequency, depth of visits
• Inexing policies• Search interfaces• Ranking• One solution: use a metasearcher (search agent)
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Metasearchers
• All the engines operate differently. Different– sizes– query languages– crawling algorithms– storage policies (stop words, punctuation, fonts)– freshness– ranking
• Submit the same query to many engines and collect the results
• Metacrawler
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Clustering
• Viewing large numbers of unstructured hits is not useful
• Answer: cluster them• Vivisimo• Kartoo• iBoogie• SurfWax
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Search Spying
• Peeking at queries as they are being submitted• AllTheWeb• Metaspy. Spies on Metacrawler• AskJeeves• Epicurious (recipes)• StockCharts.com• Yahoo buzz index• Kanoodle• IQSeek
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Time Spent Per Visitor (minutes)by Search Engine, Jan. 2003
SOURCE: SEARCHENGINEWATCH.COM
AJ Ask JeevesAOL America OnlineAV AltavistaELNK EarthLinkGG GoogleISP InfoSpaceLS LookSmartLY LycosMSN MicrosoftNS NetscapeOVR OVERTUREYH Yahoo
Up 58% in ONE YEAR!
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Audience Reachby Search Site, Jan, 2003
AJ Ask JeevesAOL America OnlineAV AltavistaELNK EarthLinkGG GoogleISP InfoSpaceLS LookSmartLY LycosMSN MicrosoftNS NetscapeOVR OVERTUREYH Yahoo
Audience Reach = % of active surfers visiting during month.
Totals exceed 100% because of overlap
SOURCE: SEARCHENGINEWATCH.COM
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Robot Exclusion
• You may not want certain pages indexed but still viewable by browsers. Can’t protect directory.
• Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall
• They look for file robots.txt at highest directory level in domain. If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt
• A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX">
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Robots Exclusion Protocol
• Format of robots.txt– Two fields. User-agent to specify a robot– Disallow to tell the agent what to ignore
• To exclude all robots from a server:User-agent: *Disallow: /
• To exclude one robot from two directories:User-agent: WebCrawler
Disallow: /news/Disallow: /tmp/
• View the robots.txt specification.
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Key Takeaways
• Engines are a critical Web resource• Very sophisticated, high technology• They don’t cover the Web completely• Spamdexing is a problem• New paradigms needed as Web grows• What about images, music, video?
– www.corbis.com, Google images
20-751 ECOMMERCE TECHNOLOGY
FALL 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
QA&