Download - 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines

20-751 ECOMMERCE TECHNOLOGY

FALL 2003

COPYRIGHT © 2003 MICHAEL I. SHAMOS

Lecture 5:Search Engines


FALL 2003


Outline

• Search engines: key tools for ecommerce– Buyers and sellers must find each other

• How do they work?• How much do they index?• How are hits ordered?• Can the order be changed?


FALL 2003


Search Engines

• Tools for finding information on the Web– Problem: “hidden” databases, e.g. New York Times

• Directory– A hand-constructed hierarchy of topics (e.g. Yahoo)

• Search engine– A machine-constructed index (usually by keyword)

• So many search engines, we now need search engines to find them. Searchenginecollosus.com


FALL 2003


Indexing

• Arrangement of data (data structure) to permit fast searching

• Which list is easier to search?

sow fox pig eel yak hen ant cat dog hog

ant cat dog eel fox hen hog pig sow yak• Sorting helps. Why?

– Permits binary search. About log2n probes into list

• log2(1 billion) ~ 30

– Permits interpolation search. About log2(log2n) probes

• log2 log2(1 billion) ~ 5


FALL 2003


Inverted Files

A file is a list of words by position

– First entry is the word in position 1 (first word)– Entry 4562 is the word in position 4562 (4562nd word)– Last entry is the last word

An inverted file is a list of positions by word!

POS1

10

20

30

36

FILE

a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)

INVERTED FILE


FALL 2003


Inverted Files for Multiple Documents

107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3

jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39jezoar

34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132. . .

“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .

LEXICON

WORD INDEX


FALL 2003


Search Engine Architecture

• Spider– Crawls the web to find pages. Follows hyperlinks.

Never stops

• Indexer– Produces data structures for fast searching of all words in

the pages

• Retriever– Query interface– Database lookup to find hits

• 2 billion documents

• 4 TB RAM, many terabytes of disk

– Ranking


FALL 2003


Crawlers (Spiders, Bots)

• Retrieve web pages for indexing by search engines

• Start with an initial page P0. Find URLs on P0 and add them to a

queue

• When done with P0, pass it to an indexing program, get a page

P1 from the queue and repeat

• Can be specialized (e.g. only look for email addresses)• Issues

– Which page to look at next? (Special subjects, recency)– Avoid overloading a site– How deep within a site to go (drill-down)?– How frequently to visit pages?


FALL 2003


Query Specification

• Boolean– AND , OR, NOT, PHRASE “ ”, NEAR ~ – But keyword query is artificial

• Question-answering (simulated)– “Who offers a master’s degree in ecommerce?

• Date range• Relevance specification

– In Altavista, can specify terms by importance (separate from query specification)

• Content– multimedia, MP3, .PPT files

• Stemming: eat, eats, eaten, eating, eater, (ate!)


FALL 2003


“Advanced” Query Specification

• Multimedia, e.g. Google• Date range• Relevance specification

– In Altavista, can specify terms by importance (separate from query specification)

• Content– multimedia, MP3, .PPT files

• Stemming• Language• Search depth (from site’s front page)


FALL 2003


Ranking (Scoring) Hits• Hits must be presented in some order• What order?

– Relevance, recency, popularity, reliability?• Some ranking methods

– Presence of keywords in title of document– Closeness of keywords to start of document– Frequency of keyword in document– Link popularity (how many pages point to this one)

• Can the user control? Can the page owner control?• Can you find out what order is used?• Spamdexing: influencing retrieval ranking by altering

a web page. (Puts “spam” in the index)


FALL 2003


Google’s PageRank Algorithm

• Assumption: A link in page A to page B is a recommendation of page B by the author of A(we say B is successor of A)

The “quality” of a page is related to the number of links that point to it (its in-degree)

• Apply recursively: Quality of a page is related to– its in-degree, and to

– the quality of pages linking to it

PageRank Algorithm (Brinn & Page, 1998)SOURCE: GOOGLE


FALL 2003


Definition of PageRank

• Consider the following infinite random walk (surfing):– Initially the surfer is at a random page

– At each step, the surfer proceeds

• to a randomly chosen web page with probability d

• to a randomly chosen successor of the current page with

probability 1-d

• The PageRank of a page p is the fraction of steps

the surfer spends at p as the number of steps

approaches infinity

SOURCE: GOOGLE


FALL 2003


PageRank Formula

where n is the total number of nodes in the graph

• Google uses d 0.85

• PageRank is a probability distribution over web pages

• The sum of all PageRanks of all Pages is 1

Epq

qoutdegreeqPageRankdn

dpPageRank

),(

)(/)()1()(

SOURCE: GOOGLE


FALL 2003


PageRank Example

P

A B

PageRank of P is

(1-d)[(PageRank of A)/4 + (PageRank of B)/3)] + d/n

dd

PAGERANK CALCULATORSOURCE: GOOGLE


FALL 2003


Link Popularity

• How many pages link to this page?– on the whole Web– in our database?

• www.linkpopularity.com• Link popularity is used for ranking

– Many measures– Number of links in– Weighted number of links in (by weight of referring page)


FALL 2003


Search Engine Sizes (Sept. 2, 2003)

SOURCE: SEARCHENGINEWATCH.COM

ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma

SEARCHES/DAY(MILLIONS) 250 80 18

2900 per second!

BILLIONS OF PAGES


FALL 2003


Search Engine Usage

SHARE BY SEARCH SITE SHARE BY ENGINE



FALL 2003


Search Engines Disjointness

SOURCE: SEARCHENGINESHOWDOWN

Four searches, 10 engines, total of 141 hits on March 6, 2002


FALL 2003


Search Engine EKG


Shows activity of the Lycos crawlerat one sample site, calafia.com, bynumber of pages visited during each crawl


FALL 2003


Search Engine EKG Comparison

SO

UR

CE

: S

EA

RC

HE

NG

INE

WA

TC

H.C

OM


FALL 2003


Search Engine Differences

• Coverage (number of documents)• Spidering algorithms (visit SpiderCatcher)

– Frequency, depth of visits

• Inexing policies• Search interfaces• Ranking• One solution: use a metasearcher (search agent)


FALL 2003


Metasearchers

• All the engines operate differently. Different– sizes– query languages– crawling algorithms– storage policies (stop words, punctuation, fonts)– freshness– ranking

• Submit the same query to many engines and collect the results

• Metacrawler


FALL 2003


Clustering

• Viewing large numbers of unstructured hits is not useful

• Answer: cluster them• Vivisimo• Kartoo• iBoogie• SurfWax


FALL 2003


Search Spying

• Peeking at queries as they are being submitted• AllTheWeb• Metaspy. Spies on Metacrawler• AskJeeves• Epicurious (recipes)• StockCharts.com• Yahoo buzz index• Kanoodle• IQSeek


FALL 2003


Time Spent Per Visitor (minutes)by Search Engine, Jan. 2003


AJ Ask JeevesAOL America OnlineAV AltavistaELNK EarthLinkGG GoogleISP InfoSpaceLS LookSmartLY LycosMSN MicrosoftNS NetscapeOVR OVERTUREYH Yahoo

Up 58% in ONE YEAR!


FALL 2003


Audience Reachby Search Site, Jan, 2003

AJ Ask JeevesAOL America OnlineAV AltavistaELNK EarthLinkGG GoogleISP InfoSpaceLS LookSmartLY LycosMSN MicrosoftNS NetscapeOVR OVERTUREYH Yahoo

Audience Reach = % of active surfers visiting during month.

Totals exceed 100% because of overlap



FALL 2003


Robot Exclusion

• You may not want certain pages indexed but still viewable by browsers. Can’t protect directory.

• Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall

• They look for file robots.txt at highest directory level in domain. If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt

• A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX">


FALL 2003


Robots Exclusion Protocol

• Format of robots.txt– Two fields. User-agent to specify a robot– Disallow to tell the agent what to ignore

• To exclude all robots from a server:User-agent: *Disallow: /

• To exclude one robot from two directories:User-agent: WebCrawler

Disallow: /news/Disallow: /tmp/

• View the robots.txt specification.


FALL 2003


Key Takeaways

• Engines are a critical Web resource• Very sophisticated, high technology• They don’t cover the Web completely• Spamdexing is a problem• New paradigms needed as Web grows• What about images, music, video?

– www.corbis.com, Google images


FALL 2003


QA&

Download - 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines

Top Related