different components of a crawlable search engine

13
Different Components of A Crawlable Search Engine BY PROMPTCLOUD 1

Upload: promptcloud

Post on 13-Jan-2017

610 views

Category:

Internet


2 download

TRANSCRIPT

Different Components of A Crawlable Search EngineBY PROMPTCLOUD

1

2

Search engines act as a powerful

magnet to find a tiny needle from

a haystack

WHY SEARCH ENGINES?

3

IMPORTANCE OF SEARCH ENGINE

In years Search engines have increased the

usability of Web dramatically

4

DIFFERENT TYPES OF SEARCH ENGINES

TYPES

❏ CRAWLER BASED SEARCH ENGINE

❏ HUMAN POWERED DIRECTORIES

❏ HYBRID SEARCH ENGINES

❏ META SEARCH ENGINES

EXAMPLE

➢ GOOGLE

➢ YAHOO

➢ GOOGLE AND YAHOO

➢ DOGPILE

5Real Facts About CRAWLABLE SEARCH ENGINE

1, Before September, 1993 World Wide Web used to get indexed by hand, entirely.

2. The first Web servers for world wide web were edited by Tim-Berners-Lee and were hosted on CERN web server.

3. On 1993, Matthew Gray produced the first web robot namely, World Wide Web Wanderer and used it for generating the first ever index called ‘Wandex’.

Image Credit: Agronet

PHYSICAL ARCHITECTURAL COMPONENTS

7

❏URL SERVER : Provides a list of URL to the crawler to fetch their information.

❏CRAWLER : It automatically traverses the web and downloads web pages and follows links from

pages to pages.

❏STORE SERVER : It stores the downloaded web pages.

❏BARREL : It stores documents processed by indexer with minute details.

❏SORTER : It rearrange the barrel sorted product to generate inverted index.

❏ANCHOR FILE : It holds the information of link’s source, destination and text.

8MAJOR DATA STRUCTURAL COMPONENTS -1

❏BIG FILES : These are virtual files spanning multiple file systems.

❏REPOSITORY : It contains full HTML of every page in a compressed format.

❏DOCUMENT INDEX : A simple index sorted by Doc ID and helps to create Forward index and Anchor file.

❏LEXICON : It is one kind of search engine’s dictionary and contains word list.

MAJOR DATA STRUCTURAL COMPONENTS -2

❏HIT LIST : It precisely holds information of a particular word and its position in a document.

❏FORWARD INDEX : It stores partially sorted words for each document and holds the Anchor text of a corresponding Doc ID.

❏INVERTED INDEX : The documents are rearranged by Word ID from Doc ID by the Sorter service.

9