information organization - how search engines work

7/30/2019 Information Organization - How Search Engines Work

1/41

SIMS 202

Information Organizationand Retrieval

Prof. Marti Hearst and Prof. Ray Larson

UC Berkeley SIMS

Tues/Thurs 9:30-11:00amFall 2000

Uploaded by: CarAutoDriver
http://www.carautodriver.co.uk/http://www.carautodriver.co.uk/


2/41

Last Time

Web Search Directories vs. Search engines How web search differs from other search

Type of data searched over Type of searches done Type of searchers doing search

Web queries are short This probably means people are often using search

engines to find starting points Once at a useful site, they must follow links or usesite search

Web search ranking combines many features


3/41

What about Ranking? Lots of variation here

Pretty messy in many cases Details usually proprietary and fluctuating

Combining subsets of: Term frequencies Term proximities

Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information

Most use a variant of vector space ranking tocombine these

Heres how it might work: Make a vector of weights for each feature

Multiply this by the counts for each feature


4/41

From description of the NorthernLight search engine, by Mark Krellensteinhttp://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm


5/41

High-Precision Ranking

Proximity search can help get high-precision results if > 1 term Hearst 96 paper:

Combine Boolean and passage-level proximity

Proves significant improvements whenretrieving top 5, 10, 20, 30 documents

Results reproduced by Mitra et al. 98 Google uses something similar


6/41

Boolean Formulations, Hearst 96

Results


7/41

Spam

Email Spam: Undesired content

Web Spam: Content is disguised as something it isnot, in order to Be retrieved more often than it otherwise

would Be retrieved in contexts that it otherwise

would not be retrieved in


8/41

Web Spam

What are the types of Web spam? Add extra terms to get a higher ranking Repeat cars thousands of times

Add irrelevant terms to get more hits Put a dictionary in the comments field

Put extra terms in the same color as the backgroundof the web page

Add irrelevant terms to get different types ofhits

Put sex in the title field in sites that are sellingcars

Add irrelevant links to boost your link analysisranking

There is a constant arms race between

web search companies and spammers


9/41

Commercial Issues

General internet search is oftencommercially driven Commercial sector sometimes hides things

harder to track than research On the other hand, most CTOs for search

engine companies used to be researchers, andso help us out

Commercial search engine information changes

monthly Sometimes motivations are commercial rather

than technical Goto.com uses payments to determine ranking order iwon.com gives out prizes


10/41

Web Search Architecture


11/41

Web Search Architecture

Preprocessing Collection gathering phase

Web crawling

Collection indexing phase

Online Query servers

This part not talked about in thereadings


12/41

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm


13/41

Standard Web Search Engine Architecture

crawl theweb

create aninvertedindex

Check for duplicates,store thedocuments

Inverted

index

Search

engine

servers

user

query

Show resultsTo user

DocIds


14/41

More detailedarchitecture,from Brin &Page 98.

Only covers thepreprocessing indetail, not thequery serving.


15/41

Inverted Indexes for Web Search Engines

Inverted indexes are still used, eventhough the web is so huge

Some systems partition the indexes across

different machines; each machine handlesdifferent parts of the data

Other systems duplicate the data acrossmany machines; queries are distributedamong the machines

Most do a combination of these


16/41

From description of the FAST search engine, by Knut Risvik

http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

In this example, the data

for the pages is

partitioned across

machines. Additionally,each partition is allocated

multiple machines to

handle the queries.

Each row can handle 120

queries per second

Each column can handle

7M pages

To handle more queries,

add another row.


17/41

Cascading Allocation of CPUs

A variation on this that produces acost-savings: Put high-quality/common pages on many

machines Put lower quality/less common pages on

fewer machines Query goes to high quality machines

first If no hits found there, go to other

machines


18/41

Web Crawlers

How do the web search engines get allof the items they index?

Main idea: Start with known sites

Record information for these sites

Follow the links from each site

Record information found at new sites

Repeat


19/41

Web Crawlers

How do the web search engines get all ofthe items they index?

More precisely:

Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed:

Record the information found on this page

Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed

In what order should the links be followed?


20/41

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:

http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html


21/41

Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Breadth-first search(must be in presentation mode to see this animation)


22/41

Page Visit Order


Depth-first search(must be in presentation mode to see this animation)


23/41

Page Visit Order


D th Fi t C li


24/41

Depth-First Crawling(more complex graphs & sites)

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4

Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Site Page

1 1

1 2

1 4

1 6

1 3

1 5

3 1

5 1

6 1

5 2

2 1

2 2

2 3


25/41

Breadth First Crawling(more complex graphs & sites)

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4

Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Site Page

1 1

2 1

1 2

1 6

1 3

2 2

2 3

1 4

3 1

1 5

5 1

5 26 1


26/41

Web Crawling Issues Keep out signs

A file called norobots.txt tells the crawler whichdirectories are off limits

Freshness Figure out which pages change often

Recrawl these often Duplicates, virtual hosts, etc

Convert page contents with a hash function Compare new pages to the hash table

Lots of problems Server unavailable Incorrect html Missing links Infinite loops

Web crawling isdifficult

to do robustly!


27/41

Cha-Cha

Cha-cha searches an intranet Sites associated with an organization

Instead of hand-edited categories Computes shortest path from the root

for each hit

Organizes search results according to

which subdomain the pages are found in


28/41

Cha-Cha Web Crawling Algorithm

Start with a list of servers to crawl for UCB, simply start with www.berkeley.edu

Restrict crawl to certain domain(s) *.berkeley.edu

Obey No Robots standard Follow hyperlinks only

do not read local filesystems links are placed on a queue

traversal is breadth-first

See first lecture or the technical papers formore information


29/41

Summary

Web search differs from traditional IRsystems Different kind of collection Different kinds of users/queries Different economic motivations

Ranking combines many features in adifficult-to-specify manner

Link analysis and proximity of terms seemsespecially important This is in contrast to the term-frequency

orientation of standard search Why?


30/41

Summary (cont.)

Web search engine archicture Similar in many ways to standard IR

Indexes usually duplicated across

machines to handle many queries quickly

Web crawling Used to create the collection

Can be guided by quality metrics Is very difficult to do robustly


31/41

Web Search Statistics


32/41

Information from searchenginewatch.com

Searchesper Day

Info missing

For fast.com,

Excite,

Northernlight,

etc.


33/41


Web

SearchEngineVisits


34/41


Percentageof web

users whovisit the

site shown


35/41


SearchEngine

Size

(July2000)


36/41


Does sizematter?

You cant

accessmany hitsanyhow.


37/41


Increasingnumbers

of indexed

pages,self-reported


38/41


Increasingnumbers

of indexedpages

(morerecent)self-

reported


39/41


Web

Coverage


40/41

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm


41/41

Directory

sizes

information organization - how search engines work

Documents