how search engines work

29
How Search Engines Work Presentation by Chinna

Upload: chinna-botla

Post on 12-Nov-2014

755 views

Category:

Economy & Finance


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: How search engines work

How Search Engines Work

Presentation by

Chinna

Page 2: How search engines work

What is Search Engine

Search engine is a software program that searches for sites based on the words that you

designate as search terms.

"Search engine" is the popular term for an Information Retrieval (IR) system.

2

Page 3: How search engines work

Motto of search engines

A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other types of files.

3

Page 4: How search engines work

4

Purpose of Search Engines

Helping people find what they’re looking for• Starts with an "information need"• Convert to a query• Gets results

In the materials available• Web pages• Other formats• Deep Web

Page 5: How search engines work

HISTORY

Archie – First search tool for the Internet

Gopher – indexed plain text documents

Jughead – searched the files stored in Gopher index systems

Wandex – First Web search engine

5

Page 6: How search engines work

How web search engines work

search engine operates in the following order:

Web CrawlingIndexing

Searching

6

Page 7: How search engines work

7

How do Search Engine Works Spiders

Robots

Page 8: How search engines work

8

Search is Not a Panacea

Search can’t find what’s not there• The content is hugely important

Information Architecture is vitalUsable sites have good navigation

and structure

Page 9: How search engines work

Search Engine Modules

A query processorA search and matching functionA ranking capabilitySummarizing and Presenting

documents.

9

Page 10: How search engines work

Search Engines Mode of Working in Earlier Days

From 1990-1998 (1st Generation of search tools): • Looked at title of web pages• Ranking was based on page content

• Looked at number of times the search term appeared on the page

• Looked at metatags

10

Page 11: How search engines work

SEO (Search Engine Optimization)

Used by companies to get a higher result in search engines

White hat: Using legitimate techniques

Black hat: Using illegal techniques to trick the search engine, like paying sites to link to you.

11

Page 12: How search engines work

12

Search Processing

Page 13: How search engines work

13

Search is Only as Good as the Content

Users blame the search engine • Even when the content is unavailable

Understand the scope of site or intranet• Kinds of information• Divided sites: products / corporate info• Dates• Languages• Sources and data silos: databases...• Update processes

Page 14: How search engines work

14

Making a Searchable Index

Store text to search it laterMany ways to gather text

• Crawl (spider) via HTTP• Read files on file servers• Access databases (HTTP or API)• Data silos via local APIs• Applications, CMSs, via Web Services

Security and Access Control

Page 15: How search engines work

15

Robot Indexing Diagram

Source:James Ghaphery, VCU

Page 16: How search engines work

16

What the Index Needs

Basic information for document or record• File name / URL / record ID• Title or equivalent• Size, date, MIME type

Full text of item More metadata

• Product name, picture ID• Category, topic, or subject• Other attributes, for relevance ranking and

display

Page 17: How search engines work

17

Simple Index Diagram

Page 18: How search engines work

18

Index Issues

StopwordsStemmingMetadata

• Explicit (tags)• Implicit (context)

Semantics• CMS and Database fields• XML tags and attributes

Page 19: How search engines work

19

Search Query Processing

What happens after you click the search button, and before retrieval starts.

Usually in this order• Handle character set, maybe language• Look for operators and organize the query• Look for field names or metadata• Extract words (just like the indexer)• Deal with letter casing

Page 20: How search engines work

20

Search and Retrieval

Retrieval: find files with query termsNot the same as relevance ranking

Recall: find all relevant items

Precision: find only relevant items

Increasing one decreases the other

Page 21: How search engines work

21

Retrieval = Matching

Single-word queries• Find items containing that word

Multi-word queries: combine lists• Any: every item with any query word• All: only items with every word• Phrases: find only items with all words in

orderBoolean and complex queries

• Use algorithm to combine lists

Page 22: How search engines work

22

Why Searches Fail

Empty searchNothing on the site on that topic

(scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure

Page 23: How search engines work

23

Relevance Ranking

Theory: sort the matching items, so the most relevant ones appear first

Can't really know what the user wants Relevance is hard to define and situationalShort queries tend to be deeply ambiguous

• What do people mean when they type “bank”?First 10 results are the most important

Page 24: How search engines work

24

Relevance Processing

Sorting documents on various criteriaStart with words matching query termsCitation and link analysis

• Like old library Citation Indexes• Not only hypertext, but the links• Google PageRank

• Incoming links• Authority of linkers

Taxonomies and external metadata

Page 25: How search engines work

25

Search Results Interface

What users see after they click the Search button

The most visible part of searchElements of the results page

• Page layout and navigation• Results header• List of results items• Results footer

Page 26: How search engines work

26

Search Suggestions

Human judgment beats algorithmsGreat for frequent, ambiguous searches

• Use search log to identify best candidatesRecommend good starting pages

• Product information, FAQs, etc.

Requires human resources• That means money and time

More static than algorithmic search

Page 27: How search engines work

27

Search Metrics

Number of searchesNumber of matches searches

Traffic from search to high-value pages Relate search changes to other metrics

Page 28: How search engines work

Query Example

Consider the Query Mahendra Singh Dhoni

A good answer contains all the three words, and more frequently the better, we call this Term Frequency(TF)

Some Query terms are more important those have better discriminating power than others

For example an answer containing only "Dhoni" is likely to be better than an answer containing only “Mahendra“We call this Inverse Document Frequency (IDF)

28

Page 29: How search engines work

29

Search Will Never Be Perfect

Search engines can’t read minds• User queries are short and ambiguous

Some things will help• Design a usable interface • Show match words in context• Keep index current and complete• Adjust heuristic weighting• Maintain suggestions and synonyms• Consider faceted metadata search