information organization - how search engines work

Upload: stefanos-anastasiadis

Post on 04-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Information Organization - How Search Engines Work

    1/41

    SIMS 202

    Information Organizationand Retrieval

    Prof. Marti Hearst and Prof. Ray Larson

    UC Berkeley SIMS

    Tues/Thurs 9:30-11:00amFall 2000

    Uploaded by: CarAutoDriver

    http://www.carautodriver.co.uk/http://www.carautodriver.co.uk/
  • 7/30/2019 Information Organization - How Search Engines Work

    2/41

    Last Time

    Web Search Directories vs. Search engines How web search differs from other search

    Type of data searched over Type of searches done Type of searchers doing search

    Web queries are short This probably means people are often using search

    engines to find starting points Once at a useful site, they must follow links or usesite search

    Web search ranking combines many features

  • 7/30/2019 Information Organization - How Search Engines Work

    3/41

    What about Ranking? Lots of variation here

    Pretty messy in many cases Details usually proprietary and fluctuating

    Combining subsets of: Term frequencies Term proximities

    Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information

    Most use a variant of vector space ranking tocombine these

    Heres how it might work: Make a vector of weights for each feature

    Multiply this by the counts for each feature

  • 7/30/2019 Information Organization - How Search Engines Work

    4/41

    From description of the NorthernLight search engine, by Mark Krellensteinhttp://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm

  • 7/30/2019 Information Organization - How Search Engines Work

    5/41

    High-Precision Ranking

    Proximity search can help get high-precision results if > 1 term Hearst 96 paper:

    Combine Boolean and passage-level proximity

    Proves significant improvements whenretrieving top 5, 10, 20, 30 documents

    Results reproduced by Mitra et al. 98 Google uses something similar

  • 7/30/2019 Information Organization - How Search Engines Work

    6/41

    Boolean Formulations, Hearst 96

    Results

  • 7/30/2019 Information Organization - How Search Engines Work

    7/41

    Spam

    Email Spam: Undesired content

    Web Spam: Content is disguised as something it isnot, in order to Be retrieved more often than it otherwise

    would Be retrieved in contexts that it otherwise

    would not be retrieved in

  • 7/30/2019 Information Organization - How Search Engines Work

    8/41

    Web Spam

    What are the types of Web spam? Add extra terms to get a higher ranking Repeat cars thousands of times

    Add irrelevant terms to get more hits Put a dictionary in the comments field

    Put extra terms in the same color as the backgroundof the web page

    Add irrelevant terms to get different types ofhits

    Put sex in the title field in sites that are sellingcars

    Add irrelevant links to boost your link analysisranking

    There is a constant arms race between

    web search companies and spammers

  • 7/30/2019 Information Organization - How Search Engines Work

    9/41

    Commercial Issues

    General internet search is oftencommercially driven Commercial sector sometimes hides things

    harder to track than research On the other hand, most CTOs for search

    engine companies used to be researchers, andso help us out

    Commercial search engine information changes

    monthly Sometimes motivations are commercial rather

    than technical Goto.com uses payments to determine ranking order iwon.com gives out prizes

  • 7/30/2019 Information Organization - How Search Engines Work

    10/41

    Web Search Architecture

  • 7/30/2019 Information Organization - How Search Engines Work

    11/41

    Web Search Architecture

    Preprocessing Collection gathering phase

    Web crawling

    Collection indexing phase

    Online Query servers

    This part not talked about in thereadings

  • 7/30/2019 Information Organization - How Search Engines Work

    12/41

    From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

  • 7/30/2019 Information Organization - How Search Engines Work

    13/41

    Standard Web Search Engine Architecture

    crawl theweb

    create aninvertedindex

    Check for duplicates,store thedocuments

    Inverted

    index

    Search

    engine

    servers

    user

    query

    Show resultsTo user

    DocIds

  • 7/30/2019 Information Organization - How Search Engines Work

    14/41

    More detailedarchitecture,from Brin &Page 98.

    Only covers thepreprocessing indetail, not thequery serving.

  • 7/30/2019 Information Organization - How Search Engines Work

    15/41

    Inverted Indexes for Web Search Engines

    Inverted indexes are still used, eventhough the web is so huge

    Some systems partition the indexes across

    different machines; each machine handlesdifferent parts of the data

    Other systems duplicate the data acrossmany machines; queries are distributedamong the machines

    Most do a combination of these

  • 7/30/2019 Information Organization - How Search Engines Work

    16/41

    From description of the FAST search engine, by Knut Risvik

    http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

    In this example, the data

    for the pages is

    partitioned across

    machines. Additionally,each partition is allocated

    multiple machines to

    handle the queries.

    Each row can handle 120

    queries per second

    Each column can handle

    7M pages

    To handle more queries,

    add another row.

  • 7/30/2019 Information Organization - How Search Engines Work

    17/41

    Cascading Allocation of CPUs

    A variation on this that produces acost-savings: Put high-quality/common pages on many

    machines Put lower quality/less common pages on

    fewer machines Query goes to high quality machines

    first If no hits found there, go to other

    machines

  • 7/30/2019 Information Organization - How Search Engines Work

    18/41

    Web Crawlers

    How do the web search engines get allof the items they index?

    Main idea: Start with known sites

    Record information for these sites

    Follow the links from each site

    Record information found at new sites

    Repeat

  • 7/30/2019 Information Organization - How Search Engines Work

    19/41

    Web Crawlers

    How do the web search engines get all ofthe items they index?

    More precisely:

    Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed:

    Record the information found on this page

    Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed

    In what order should the links be followed?

  • 7/30/2019 Information Organization - How Search Engines Work

    20/41

    Page Visit Order

    Animated examples of breadth-first vs depth-first search on trees:

    http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

    Structure to be traversed

    http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 7/30/2019 Information Organization - How Search Engines Work

    21/41

    Page Visit Order

    Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

    Breadth-first search(must be in presentation mode to see this animation)

    http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 7/30/2019 Information Organization - How Search Engines Work

    22/41

    Page Visit Order

    Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

    Depth-first search(must be in presentation mode to see this animation)

    http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 7/30/2019 Information Organization - How Search Engines Work

    23/41

    Page Visit Order

    Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

    D th Fi t C li

    http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 7/30/2019 Information Organization - How Search Engines Work

    24/41

    Depth-First Crawling(more complex graphs & sites)

    Page 1

    Page 3Page 2

    Page 1

    Page 2

    Page 1

    Page 5

    Page 6

    Page 4

    Page 1

    Page 2

    Page 1

    Page 3

    Site 6

    Site 5

    Site 3

    Site 1 Site 2

    Site Page

    1 1

    1 2

    1 4

    1 6

    1 3

    1 5

    3 1

    5 1

    6 1

    5 2

    2 1

    2 2

    2 3

  • 7/30/2019 Information Organization - How Search Engines Work

    25/41

    Breadth First Crawling(more complex graphs & sites)

    Page 1

    Page 3Page 2

    Page 1

    Page 2

    Page 1

    Page 5

    Page 6

    Page 4

    Page 1

    Page 2

    Page 1

    Page 3

    Site 6

    Site 5

    Site 3

    Site 1 Site 2

    Site Page

    1 1

    2 1

    1 2

    1 6

    1 3

    2 2

    2 3

    1 4

    3 1

    1 5

    5 1

    5 26 1

  • 7/30/2019 Information Organization - How Search Engines Work

    26/41

    Web Crawling Issues Keep out signs

    A file called norobots.txt tells the crawler whichdirectories are off limits

    Freshness Figure out which pages change often

    Recrawl these often Duplicates, virtual hosts, etc

    Convert page contents with a hash function Compare new pages to the hash table

    Lots of problems Server unavailable Incorrect html Missing links Infinite loops

    Web crawling isdifficult

    to do robustly!

  • 7/30/2019 Information Organization - How Search Engines Work

    27/41

    Cha-Cha

    Cha-cha searches an intranet Sites associated with an organization

    Instead of hand-edited categories Computes shortest path from the root

    for each hit

    Organizes search results according to

    which subdomain the pages are found in

  • 7/30/2019 Information Organization - How Search Engines Work

    28/41

    Cha-Cha Web Crawling Algorithm

    Start with a list of servers to crawl for UCB, simply start with www.berkeley.edu

    Restrict crawl to certain domain(s) *.berkeley.edu

    Obey No Robots standard Follow hyperlinks only

    do not read local filesystems links are placed on a queue

    traversal is breadth-first

    See first lecture or the technical papers formore information

  • 7/30/2019 Information Organization - How Search Engines Work

    29/41

    Summary

    Web search differs from traditional IRsystems Different kind of collection Different kinds of users/queries Different economic motivations

    Ranking combines many features in adifficult-to-specify manner

    Link analysis and proximity of terms seemsespecially important This is in contrast to the term-frequency

    orientation of standard search Why?

  • 7/30/2019 Information Organization - How Search Engines Work

    30/41

    Summary (cont.)

    Web search engine archicture Similar in many ways to standard IR

    Indexes usually duplicated across

    machines to handle many queries quickly

    Web crawling Used to create the collection

    Can be guided by quality metrics Is very difficult to do robustly

  • 7/30/2019 Information Organization - How Search Engines Work

    31/41

    Web Search Statistics

  • 7/30/2019 Information Organization - How Search Engines Work

    32/41

    Information from searchenginewatch.com

    Searchesper Day

    Info missing

    For fast.com,

    Excite,

    Northernlight,

    etc.

  • 7/30/2019 Information Organization - How Search Engines Work

    33/41

    Information from searchenginewatch.com

    Web

    SearchEngineVisits

  • 7/30/2019 Information Organization - How Search Engines Work

    34/41

    Information from searchenginewatch.com

    Percentageof web

    users whovisit the

    site shown

  • 7/30/2019 Information Organization - How Search Engines Work

    35/41

    Information from searchenginewatch.com

    SearchEngine

    Size

    (July2000)

  • 7/30/2019 Information Organization - How Search Engines Work

    36/41

    Information from searchenginewatch.com

    Does sizematter?

    You cant

    accessmany hitsanyhow.

  • 7/30/2019 Information Organization - How Search Engines Work

    37/41

    Information from searchenginewatch.com

    Increasingnumbers

    of indexed

    pages,self-reported

  • 7/30/2019 Information Organization - How Search Engines Work

    38/41

    Information from searchenginewatch.com

    Increasingnumbers

    of indexedpages

    (morerecent)self-

    reported

  • 7/30/2019 Information Organization - How Search Engines Work

    39/41

    Information from searchenginewatch.com

    Web

    Coverage

  • 7/30/2019 Information Organization - How Search Engines Work

    40/41

    From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

  • 7/30/2019 Information Organization - How Search Engines Work

    41/41

    Directory

    sizes