chapter-2 : literature survey and scope of...

Chapter-2 : Literature survey and scope of research 15

Designing Model for Meta-Search Engine

CHAPTER - 2

Literature survey and scope of research

2.1 INTRODUCTION

This chapter introduces prior studies covering initiatives for the web search,

search engines, search engine optimization techniques, limitations of

existing search engines, meta-search engines, meta-search engine

optimization techniques, difference between search engines and meta-search

engines, limitations of existing meta-search engines, need of a new model of

meta-search engine and scope of research in meta-search engine for specific

information retrieval in an efficient manner.

2.2 LITERATURE SURVEY

2.2.1 History of the web surfing for web search

The roots of web search engine technology are in Information Retrieval (IR)

systems, which can be traced back to the work of Kuhn at IBM during the late

1950s. IR has been an active field within information science, and has been given a

big boost since the 1990s with the new requirements that the Web has brought.

[11]

Many methods used by current search engines can be traced back to the

developments in IR during the 1970s and 1980s. Especially influential is the

SMART (System for the Mechanical Analysis and Retrieval of Text) retrieval

system, initially developed by Gerard Salton and his collaborators at Cornell

University during the early 1970s. [11]

Prior to 1990, there was no approach to search the Web. At that time there

were a small number of websites. Most sites contained collections of files

that user could download. The only way user could find out that a file was

on a specific site. Then came a tool which is called Archie. It was the first

program to search the Web for the contents of all websites all over the world.

It is not actually search engine but like Yahoo, it is to search list of files.

Information seeker needed to know the exact name of the file for which



he/she is looking for. Prepared with that information, Archie would advise

from which website it is possible to download the file.

2.2.2 Initiative of search engine development

The students Alan Emtage, Peter J. Deutsch, and Bill Heelan at McGill

University in Montreal, Canada, produced the initial search engine in 1990.

The initial tool is called Archie, short for archive. The program was

searching file names of the files and not individual pages.

If Archie was the grandfather of all search engines, then Veronica was the

grandmother. Developed by the University of Nevada Computing Services, it

was searching Gopher servers for files. A Gopher server stores plain-text

documents while an FTP server stores other kinds of files (images, programs,

etc.) also. Jughead performed functions similar to Veronica. [59]

By 1993, the Web was beginning to change. Rather than being populated

mainly by FTP sites, Gopher sites, and e-mail servers, web sites began to

grow. In response to this change, Matthew Gray introduced World Wide Web

Wanderer. The program was a series of robots that hunted down web URLs

and listed them in database called Wandex. [59]

Again around 1993, ALIWEB was developed as the web page equivalent to

Archie and Veronica. Instead of cataloging files or text documents,

webmasters would submit a special index file with site information. [59]

The next development in cataloging the web came late in 1993 with spiders.

Like robots, spiders scoured the web for web page information. These early

versions looked at the titles of the web pages, the header information, and

the URL as a source for key words. The database techniques used by these

early search engines were primitive.

For instance, a search process would give up hits (List of URLs / Links) in

the order that the hits (List of URLs / Links) were in the database. Only one



of these search engine made effort to rank the hits (List of URLs / Links)

according to the website’s relationships to the key words.

The first popular search engine, Excite, has its roots in these early days of

web cataloging. The Excite project was begun by a group of Stanford

undergraduates. It was released for general use in 1994. [59]

One of the earliest search engines to be built was Lycos, founded in January

1994, operational in June 1994, and a publicity traded company in April

1996. Lycos was born from research project at Carnegie Mellon University

by Br. Michael Mauldin. [59]

Again in 1994, two Stanford Ph.D. students posted web pages with links on

them. They called these pages Yahoo!. As the number of links began to grow,

they developed a hierarchical listing. As the pages become more popular,

they developed a way to search through all of the links. Yahoo! became the

first popular searchable directory. It was not considered a search engine

because all the links on the pages were updated manually rather than

automatically by spider or robot and the search feature searched only those

links. [59]

Another search engine, WebCrawler, went online in spring 1994. It was alsi

started as a research project, at the University of Washington, by Brian

Pinkerton. [19]

The first full-text search was WebCrawler. WebCrawler began as an

undergraduate seminar project at the University of Washington. It became

so popular that has virtually shut down the University of Washington's

network because of the amount of traffic it generated. Eventually, AOL

bought it and operated it on their own network. Later, Excite bought

WebCrawler from AOL but AOL still uses it in their NetFind feature. At

Home Corp. currently owns Webcrawler (as well as Excite and Blue

Mountain Cards). [59]



The next search engine to appear on the web was Lycos. It was named for

the wolf spider (Lycosidae lycosa) because the wolf spider pursues its prey.

According to Michael Maudlin in Lycos: Design choices in an Internet search

service" (1997), by 1997, Lycos had indexed more than 60,000,000 web

pages and ranked 1st on Netscape's list of search engines. [59]

The next major player in the search engine wars as it was started i.e.

Infoseek. The Infoseek search engine itself was unremarkable and showed

little innovation beyond Webcrawler and Lycos. What made this search

engine stand out was its deal with Netscape to become the browser's default

search engine replacing Yahoo. [59]

By 1995, Digital Equipment Corporation (DEC) introduced AltaVista. This

search engine contained some innovations that set it apart from the others.

First, it ran on a group of DEC Alpha-based computers. At the time, these

were among the most powerful processors in existence. This meant that the

search engine could run even with very high traffic hardy slowing down.

(The DEC Alpha processor ran a version of UNIX. From its inception, UNIX

had been designed for such heavy multi-use loads.) It also featured the

ability for the user to ask a question rather than enter key words. This

innovation made it easier for the average user to find the results needed. It

was also the first to implement the use of Boolean operators (and, or, but,

not) to help in refining searches. [59]

Next came HotBot, a project from the University of California at Berkeley,

designed as the most powerful search engine. [59] Hotbot was owned by

Wired, had funky colors, fast results, and a cool name that sounded geeky,

but died off not long after Lycos bought it and ignored it. [60] It’s current

owner, Wired Magazine claims that it can index more than 10,000,000 pages

a day. Wired claims that HotBot should be able to update its entire index

daily making it contain the most up-to-date information of any major search

engine. [59]

http://www.hotbot.com/



Google developed in Stanford University around 1998, used concept of link

popularity and Page Rank as its main ranking algorithm. Yahoo, launched

in 1994 by Stanford University, started out as a listing of personal favourite

websites with URL and description of each page. MSN-search is a search

engine owned by Microsoft, launched in 1999 and was powered by results

from Looksmart and Inktomi till 2004, after that it uses its own crawler-

based index. [59]

Since birth of modern Internet in early 1990s, need for IR led to growth,

dominance and detach of various search engines likeWandex, Aliweb,

Excite, Webcrawler, Lycos, AltaVista, Inktomi, AskJeeves and Northern

Light. [15]

However, majority(80%) of Internet users are hooked on to three search

engines – Google, Yahoo and MSN – Search. [15]

2.2.3 Brief about working of search engine

Search engines are a kind of tools which are designed to search information

on the Web. The search engine results are generally displayed in a vertical

sequence often referred to as search results pages. Links / URLs available

on those pages are referred as hits.

Search engine basically works on steps like, Web crawling, Indexing,

Searching, etc. They work by storing information about web pages in

databases, which they retrieve from the web.

A search engine takes advantage of the hyperlinks that connect Web sites on

the Internet. A software program called a Web crawler automatically browses

the Web in a systematic way and sends out inquiries that “crawl” from site

to site. [2]

Since crawler is a software program, it is given different instructions on

different computers. For instance, WebCrawler, a program launched in

1994, was the first software to index entire web sites rather than just page

titles. Search engine crawlers operate within different sets of instructions or



parameters, such as search titles and first paragraphs only, or to search

entire documents, including metadata. [2]

The information the crawler software collects automatically put into an

index, when a query is submitted to that search engine’s index. Each search

engine has its own index. Thus the index searched by Google is not the

same as the one searched by Yahoo or MSN (Microsoft Network). [2]

2.2.4 Optimization techniques used by existing search engines

The aim of SEO (Search Engine Optimization) is to get higher position of

links in organic listings. Set of techniques are used for going up to the top of

search engine listings. SEO has conceptually expanded to include all likely

ways of promoting web traffic. Mainly there are two approaches for listing

results on screen. One, organic (natural way listing) and second, pay per

click (paid listing).

Following are some examples of search engines that use Pay Per Click

Strategy [76]:

i. Google

The Google AdWords program places paid listings within Google's

search results, as well as on some other sites that carry its listings.

ii. Overture

Overture is the oldest major paid placement search engine. It

distributes its listings to a wide-range of search engines, including

that of its owner, Yahoo. Overture launched as GoTo in 1997 and

incorporated the former University of Colorado-based World Wide Web

Worm. In February 1998, it shifted to its pay-for-placement model.

The company changed its name from GoTo to Overture in October

2001. It was purchased by Yahoo in October 2003.



Following are some more the most popular paid search engines listed

below [77]:

i. Yahoo! Search Marketing is the oldest pay per click search engine

(formerly Overture), which produces relevant results. It incorporates

its paid listings into some of major search engines.

ii. FindWhat.com is an important pay-per-click search engine with

results incorporated into many metacrawlers search results. Bids

start at $0.01.

iii. Kanoodle offers a paid search listings with distribution to a large

network of other search engines and search box providers. Bids start

at $0.01.

iv. Sprinks: Pay-per-click searching service provided by About.com that

sends links to some meta search engines, and the Sprinks site itself.

v. Search123: Pay-per-click search engine that incorporates its paid

listings on the sites of some traffic partners.

vi. Xuppa is a paid search placement service with distribution on some

metacrawlers. Previously named as Bay9.

vii. Ah-ha.com powered by FAST Search, allows paid listings to appear at

the top of its results.

viii. ePilot.com is a pay-per-click search engine which distributes its

results on many search partners. Bids start at $0.01.

ix. ValleyAlley: Pay-per-click searching service that sends links to some

meta crawlers. Bids start at $0.01.

x. Win4Win: A paid search engine, which provides top listing for

advertisers in its results.

xi. theInfoDepot.: Pay-per-click search engine that uses Open Directory

Project database.

xii. eFind.com allows advertisers to bid for the top of listing with search

results.

http://searchmarketing.yahoo.com/

http://www.findwhat.com/

http://www.kanoodle.com/

http://www.sprinks.com/

http://search123.com/

http://www.xuppa.com/

http://www.ah-ha.com/

http://www.epilot.com/

http://www.valleyalley.com/

http://www.win4win.com/

http://theinfodepot.com/

http://www.efind.com/



Organic search result listings appear as search results without the

payment of a special charge to the search engine provider. Pay per click

strategy is to gain company’s revenue.

On and Off Page Components of Search Engine Optimization

SEO is mainly achieved by the combination of 2 main factors mainly on

page and off page factors. [75]

1. On page factors

Search Engine Optimization refers to the text and content on web site pages.

It acts as the foundation for the ongoing SEO process. Its work on the

website and content, so that the search engine can find the web page when

searching for the web sites for a particular keyword. This has a significant

impact on search engine results. [75]

Some of the on page factors include:

i. Search engine friendly web page URLs in the site. The inner pages of

the site have the URLs followed by the domain and then describing the

content.[75]

ii. Optimization of all meta tags which mainly includes title, keywords

and description. [75]

iii. Internal linking between the pages in the site. Internal linking must be

done wherever required such that google does not spam it. The major

pages of the site must be linked to the homepage. [75]

iv. Creation of sitemap is important so that all web pages are indexed by

search engines. [75]

v. Good quality content is liked by most of the search engines. Content

should be information rich which is relevant and inspiring-yet does

not forget the spiders. [75]

vi. Make sure that html code is free of errors and warnings. [75]



Some of the on page factors to be avoided:

i. Hidden link or text: use of hidden or invisible text for getting listed on

search engines by using a font color similar to the background color

must be avoided which will affect page rankings. [75]

ii. Cloaking mechanisms: never show up two different versions of the

site, one for the search engines and a completely different page to real

users. This will risk the site on being penalized. [75]

iii. Duplicate content to be avoided. There is no substitute for unique,

original and useful content.[75]

2. Off page factors

As the name indicates off page optimization is the work that needs to be

done off the pages of the website. [75]

Some of the off page factors include:

i. Use of anchor text in the links wherever required according to

relevancy. Also the text surrounding the links should not be

ignored.[75]

ii. Building quality links for link building purposes which include

relevance, page rank and authority sites. [75]

iii. Link popularity can be attained by using social networking sites, log

commenting, forum postings, article/press release promotions,

directory submissions, link baiting, posting classified ads, link

exchange with relevant sites and so on.[75]

On page and off page factors are two different aspects of SEO efforts which

work towards getting qualified traffic which provides the path that leads

towards conversion. [75] Compared to off page factors, on page optimization

is relatively easy to achieve. [75]



Elements of Search Engine Optimization

The major elements of SEO are:

i. Keyword-rich text: According to design matters to the search

engines one can access the keyword-rich text, and it matters to

human visitors so they can easily find that keyword-rich text once

they arrive at site. [74]

ii. Site and page architecture: To get optimized results from search

engine robust site and page architecture plays vital role.[74]

iii. Link development: This is one of the most overlooked components

of a successful SEO. Link development can be defined as collection

of links for the site from other web sites which to improve search

engine ranking of the web site. [74]

2.2.5 Limitations of existing search engines

When discussing web search engines, in most cases one arrives quickly at a discussion of Google. In fact, Google is often seen as synonymous with web search.

[68] It may be irritating to see that many search engines claiming to search the

‘whole of the web’ are available on the market; however, only a few of them have

their own, web-scale index. Outside of these few, most search engines license

search results from other search engines, the most famous example being Yahoo

using results from Microsoft’s Bing search engine (Microsoft, 2009) [25]

Another point to consider is the market shares of different search engines.

While there may be at least a small variety of web search engines, user’s

acceptance of these choices differs greatly among them. When discussing

the search engine market, it is often forgotten that while search engines are

surely commercial enterprises, they also serve as facilitators of information,

and therefore, they serve as the interest of the public. When considering

that mainly one search engine is used, one has to ask whether this search

engine does indeed serve these interests? [25]

Size of internet in terms of data continues to grow exponentially. No single

search engine indexes more than about one-third of ‘indexable web’, and

combining results of 6 search engines yields about 3.5 times as many



documents on an average as compared to the results from only one engine.

Search engines do not sites equally and no engine indexes more than about

16% of web. Major engines index less than half of the web and average

overlap between engines is very small. Indexable web is approx. 11.5 billion

pages. Intersection of Google, MSN, Ask and Yahoo indexes is 28.85%, or

about 2.7 billion pages, and their union is about 9.36 billion pages. Even if

two search engines use same databases, search results may vary, because

each search engine uses its own ranking algorithm. Non indexable web often

contains large amount of data, whose major part is not available through

traditional search engines. A combination of retrieval paradigms brings

improvements in information retrieval results. Coverage limitations, non-

uniform user interfaces, query limitations and duplicates, lower

effectiveness of search engines. This has led to the development of meta-


It is true, that no single search tool indexes the entire web. In the late 90’s,

the web had between 6 and 8 billion web pages. At that time, Google indexed

2.4 billion, AllTheWeb 2.1 billion and AltaVista about 1 billion. Meta-search

engines were designed to fill in the gaps by searching many search engines

simultaneously. [58]

2.2.6 Initiative of meta-search engine development

During 1995, a novel type of search engine was introduced called meta-

search engine. The idea was simple. The meta-search engine would get user

input key words from the user and then forward all keywords to all of the

most important search engines. These search engines would send the hits

(URLs / Links) back to the meta-search engine and the meta-search engine

would set-up the hits (URLs / Links) all on single page for concise viewing.

The first of these meta-search engines was Metacrawler. Metacrawler took

the output of the search engines but not the advertising banners that users

of the search engines see reducing the advertising revenues of the search



engine companies. Metacrawler finally relented and began including the

banner ads with each set of search results. [59]

Besides Metacrawler, other major meta-search engines exist including

ProFusion, Dogpile, Ask Jeeves, and C-Net's Search.com. [59]

Examples of traditional meta-search engines are: Mamma.com, Dogpile.com,

Metasearch, Ixquick.com, Clusty, Hotbot, etc. [58]

The first meta-search engine launched focusing on travel in 2000, which is

domain specific. The earliest versions were not web based; users had to

download special software on to their desktop. These sites moved to the web

and became more mainstream in late 2003 and throughout 2004. Several

sites now play in this space like Kyak, Mobissimo, Travelzoo, etc. [58]

Most meta-search engines draw their search results from multiple other

search engines, then combine and re-rank those results is known fact. This

was a useful feature back when search engines were less sense at crawling

the web and each engine had a significantly unique index. [60]

Unlike most meta-search engines, Hotbot only pulls results from one search

engine at a time from the web. Currently Dogpile, owned by Infospace, is

probably the most popular meta-search engine on the market, but like all

other meta-search engines, it has limited market share. [60]

2.2.7 Optimization techniques used by existing meta-search engines

Optimization techniques used by meta-search engines are also

advertisement prone like search engines. Another strategy is pay per click

strategy, which is based on paid listing strategy.

http://www.dogpile.com/

http://www.infospace.com/



2.2.8 Difference between meta-search engines and search engines

The major differences between meta-search engines and search engines are:

i. Meta-search engines do not have own database of web pages like

search engine, thus do not need indexing.

ii. Meta-search engines provide search results by forwarding user input

search text to other search engines and merging the results returned

by different search engines.

In the area of Search Engine and Meta-Search Engine following type of work

has been done:

i. Well known web search engine, in which whenever user provides text

to get the information in search browser then web search engine

provides information about many web pages, which are retrieved, from

the Web itself.

ii. Search optimization technique is developed.

iii. Meta-Search Engine search tool is developed which provides

information by searching information from various fixed number of

search engines and finally retrieves aggregate result. Some of them are

domain specific. However, there are some problems related to timeout.

2.2.9 Existing meta-search engines

By doing literature survey information about various existing meta-search

engines are collected and summarized as below:

1. Mamma: It is a mother of meta-search engine having time out

problem.

2. Blingo: Retrieves search results form single search engine that is,

Yahoo. It makes its revenue like any other search engine when user

clicks on a sponsored link on result page.

3. Yippi: Retrieves search results for conservative values.

4. DeeperWeb: It offers integration with Google search engine and

retrieves results from Google only.



5. Dogpile: It fetches search results from three fixed number of search

engines: Google, Yahoo and Yandex.

6. Excite: It fetches search results from three fixed number of search

engines: Google, Yahoo and Yandex.

7. Harvester42: It is for information related to genes and proteins from

several species.

8. HotBot: Not functioning properly through user interface.

9. Info.com: It fetches search results from fixed number of search

engines: Google, Yahoo, Bing and Yandex. It uses pay per click

strategy.

10. Kyak: It is a travel meta-search engine. It cannot be used for other

information related search.

11. Metacrawler: It fetches search results from search engines like,

Google, Yahoo and Yandex.

12. Mobissimo: It is a travel meta-search engine. It cannot be used for

searching other information.

13. Otalo: It is also a travel meta-search engine and It cannot be used for

other information search.

14. Ixquick: It returns the search results from multiple search engines. It

uses a star to rank its results by giving one star for every search

result that has been returned from a search engine.

15. PCH Search and Win: Retrieves search results from Google and

Yahoo.

16. SideStep: Meta-search engine for travel.

17. WebCrawler: Retrieves search results from Google and Yahoo.

2.2.10 Difference between existing meta-search engines

Meta-Search Engines differ on the basis of their functionalities featured, in

particular for the way they transmit the user query to the search engines and for

the way they collect and present the obtained results. For instance some Meta-

Search Engines simply append the obtained results without performing any



processing on those results. Some of them directly present parts of the pages

returned by the search engines.

Existing meta-search engine uses fixed number of search engines. Some of

them retrieve results from one or two search engines like, Blingo and

WebCrawler. Some of them are domain specific, specifically they are for

travel based search like, Kyak, Mobissimo, Otalo and SideStep. Mother of

meta-search engines popularly known as mamma faces problem related to

time out during search process. And most of meta-search engines use

various marketing strategy to gain their revenue.

2.2.11 Limitations of existing meta-search engines

One of the major problems with meta-search in general is that most meta-

search engines tend to mix pay per click ads in their organic search results, and for

some commercial queries 70% or more of the search results may be paid results.

[60]

It is known that meta-search engines groups results from different search

engines and displays results on screen based on own ranking algorithm. In

meta-search engine optimization, it is required to have proper ranking

algorithm, which may result in an organic way (Natural Way).

The problem of meta-search is known as the rank aggregation problem,

where meta-search engine submits a query to multiple search engines, and

then has to combine the individual ranked lists returned into a single

ranked list, which is to be presented to the user. One of the problems that a

meta-search engine has to solve, when combining results, is that of

detecting and removing duplicate web pages that are returned by several


Moreover, meta-search tools have no database of their own, but send the

same enquiry to a variety of search engines. [26]



One can summarize focused limitations of meta-search engines as below:

i. Existing meta-search engines are subject to time outs when search

processing takes too long time, in that case it retrieves the few of

required hits from each search engine. And the total number of hits

retrieved may be considerably less than the total hits by doing a direct

search on one of a search engine. Mother of meta-search engine,

popularly known as “mamma” has this problem.

ii. Existing meta-search engines work on fixed number of multiple search

engines. Some of them are using only one or two search engines.

iii. Some of existing meta-search engine is domain specific.

iv. Optimization technique used by some of meta-search engine is pay

per click strategy used.

2.2.12 Need of ranking method in meta-search engine

A meta-search engine has the advantage of being lightweight, since there is no need for crawling and large-scale indexing. [11]

Meta-search engines often have only light information about the relevance of

web pages returned for s search query. In many cases all that the meta-

search has to go with is a ranked ordering of the returned results, and a

summary of each of the web pages included in the results. Despite this,

some meta-search engines rely on relevance scores to combine the results,

which means that they need to infer the scores in some way, while other

meta-search engines combine the results based solely on the ranked results

obtained from the search engines queried. [11] Meta-search engine ranking

algorithms may differ to generate aggregate list of search results, they may

require training data to update search result position based on rank. Also

require to learn about the search engines they are querying. [11]

A meta-search engine, which uses relevance scores, can store a

representative of each search engine, giving an indication of the contents of

the search engine’s index. The index of representatives could be built as the

meta-engine is queried, so that it is compact and represents user queries



rather than the full set of keywords in the underlying search engine’s index.

The meta-index enables a standard normalization of relevance scores across

all the search engines deployed. In order to get the relevance information

about the web pages returned, the meta-search engine can simply download

these pages before merging the results, but this will, obviously, slow down

the response time for the query. [11] Hence, there is a need of good ranking

method in meta-search engine.

2.2.13 Ranking methods

2.2.13.1 About ranking

A ranking is a relationship between a set of items such that, for any two

items, the first is either ‘ranked higher than’, ‘ranked lower than’ or ‘ranked

equal to’ the second. The web search engine may rank the pages it finds

according to an estimation of their relevance, making it possible for the user

quickly to select the pages they are likely to want to see. [86]

2.2.13.2 Strategies for assigning rankings

It is not always possible to assign rankings uniquely. For example, in a race

or competition two (or more) entrants might tie for a place in the ranking. When

computing an ordinal measurement, two (or more) of the quantities being ranked

might measure equal. In these cases, one of the strategies shown below for

assigning the rankings may be adopted. [86]

A common shorthand way to distinguish these ranking strategies is by the

ranking numbers that would be produced for four items, with the first item

ranked ahead of the second and third (which compare equal) which are both

ranked ahead of the fourth. [86]

Ranking strategies

Standard competition ranking ("1224" ranking)

In competition ranking, items that compare equal receive the same ranking

number, and then a gap is left in the ranking numbers. The number of

ranking numbers that are left out in this gap is one less than the number of



items that compared equal. Equivalently, each item's ranking number is 1

plus the number of items ranked above it. This ranking strategy is

frequently adopted for competitions, as it means that if two (or more)

competitors tie for a position in the ranking, the position of all those ranked

below them is unaffected (i.e., a competitor only comes second if exactly one

person scores better than them, third if exactly two people score better than

them, fourth if exactly three people score better than them, etc.). [86]

Thus if A ranks ahead of B and C (which compare equal) which are both

ranked ahead of D, then A gets ranking number 1 ("first"), B gets ranking

number 2 ("joint second"), C also gets ranking number 2 ("joint second") and

D gets ranking number 4 ("fourth"). [86]

Modified competition ranking ("1334" ranking)

Sometimes, competition ranking is done by leaving the gaps in the ranking

numbers before the sets of equal ranking items (rather than after them as in

standard competition ranking). The number of ranking numbers that are left

out in this gap remains one less than the number of items that compared

equal. Equivalently, each item's ranking number is equal to the number of

items ranked equal to it or above it. This ranking ensures that a competitor

only comes second if they score higher than all but one of equal ranking

items. [86]



number 3 ("joint third"), C also gets ranking number 3 ("joint third") and D

gets ranking number 4 ("fourth"). In this case, nobody would get ranking

number 2 ("second") and that would be left as a gap. [86]

Dense ranking ("1223" ranking)

In dense ranking, items that compare equal receive the same ranking

number, and the next item(s) receive the immediately following ranking

number. Equivalently, each item's ranking number is 1 plus the number of



items ranked above it that are distinct with respect to the ranking order.

[86]



number 2 ("joint second"), C also gets ranking number 2 ("joint second") and

D gets ranking number 3 ("third"). [86]

Ordinal ranking ("1234" ranking)

In ordinal ranking, all items receive distinct ordinal numbers, including

items that compare equal. The assignment of distinct ordinal numbers to

items that compare equal can be done at random, or arbitrarily, but it is

generally preferable to use a system that is arbitrary but consistent, as this

gives stable results if the ranking is done multiple times. An example of an

arbitrary but consistent system would be to incorporate other attributes into

the ranking order (such as alphabetical ordering of the competitor's name)

to ensure that no two items exactly match. [86]

With this strategy, if A ranks ahead of B and C (which compare equal) which

are both ranked ahead of D, then A gets ranking number 1 ("first") and D

gets ranking number 4 ("fourth"), and either B gets ranking number 2

("second") and C gets ranking number 3 ("third") or C gets ranking number 2

("second") and B gets ranking number 3 ("third"). [86]

Fractional ranking ("1 2.5 2.5 4" ranking)

Items that compare equal receive the same ranking number, which is

the mean of what they would have under ordinal rankings. Equivalently, the

ranking number of 1 plus the number of items ranked above it plus half the

number of items equal to it. This strategy has the property that the sum of

the ranking numbers is the same as under ordinal ranking. [86]


ranked ahead of D, then A gets ranking number 1 ("first"), B and C each get



ranking number 2.5 (average of "joint second/third") and D gets ranking

number 4 ("fourth"). [86]

For example: Suppose, the data set available is, 1 1 2 3 3 4 5 5 5. There are

5 different numbers, so there would be five different ranks. If 1 and 1 were

actually different numbers, they would occupy ranks 1 and 2. Since they are

the same number, you find there rank by finding the average as follows:

(rank) 1 + (rank) 2 / 2 numbers total = 1.5 (average rank). The next number

in the data set, 2, is thus assigned the rank of 3 (the average takes up 1 and

2 in the first two 1's). The two 3's in the set would occupy ranks 4 and 5 if

they were different numbers, so the average rank would be computed as

follows: 4 + 5 / 2 = 4.5, 4 would get the rank of 6 (because your average

took into account rank 4 and 5 in the average). There are 3 5's in the data

set. Their average rank is computed as "7+8+9/3 = 8 [86]

Resultant ranks would be: 1.5 1.5 3 4.5 4.5 6 8 8 8 [86]

2.2.13.3 Ranking methods in search engines

Search engine ranking methods are closely secured secrets, for at least two

reasons: search engine companies want to protect their methods from their

competitors, and they also want to make it difficult for web site owners to

manipulate their rankings.

A specific page's relevance ranking for a specific query currently depends on

three factors:

i. Its relevance to the words and concepts in the query.

ii. Its overall link popularity.

iii. Whether or not it is being penalized for excessive search engine

optimization (SEO).

2.2.13.4 Ranking methods in meta-search engines

Meta-search engines are tools that receive user queries and dispatch them

to multiple search engines (they are also called component engines for meta-



search engines). Then, meta-search engine collect the returned results,

reorder them and display the ranked result list to the user. The ranking

methods that meta-search engines utilize are based on a variety of

parameters, such as the ranking a result receives and the number of its

appearances in the component engine’s result lists. These parameters are

being used to compute a rank (also called score) for each received result.

Better results organization can be achieved by employing good ranking

methods that take into consideration additional information about a web

page. Another core step is to implicitly collect some data concerning the user

that submits the query. This will assist the engine to decide which results

suit better to his / her informational needs.

However, none of these studies propose a ranking method that is suitable

for meta-search engines. The existing methods assign scores according to

objective criteria, such as the rank, a result receives from the component

engines etc. None of them can accept any kind of input from different users

(subjective data) and produce different results respectively.

In other words, the current methods lack ranking method, which offers

competitive advantage to URLs position on resultant page and output the

similar kind results for the similar kind queries, submitted by different

users.

2.2.13.5 Integrating site into search engines using ranking method

Following are several steps that need to be followed to integrate site into

search engine [27]:

i. Choosing the right keywords that are going to bring the most hits on

web site.

ii. Using the right title tags on website.

iii. Ensuring appropriate content writing on web site.



iv. Choosing the right search engines to submit web site and

understanding the free and paid listing service options available.

The base case is that spiders crawl the entire Web, starting from known

pages and following all links, and also crawling pages that are hand-

submitted like Google. If a site has high PageRank, it is spidered more often

and more deeply.

However, search engines are trying to encourage site owners to pay for the

privilege of having their pages spidered. Teoma's index is very hard to get

into without paying money, and Inktomi's isn't that easy either. And even if

users do get into Inktomi for free, they'll take a long time to respider, while if

users pay they respider constantly.

Advantage of being respidered often is that users can twist their page and

page contents to come up higher in their relevancy rankings.

Users can also pay to appear on a search page. That is, user’s link will

appear when someone searches on a specific keyword or keyphrase. Google

does a good job of making it pretty clear which results at the top or on the

right of the page are paid.

Paid search results are typically all pay-per-click, based on keyword. The

advertiser pays the search engine vendor a specific amount of money each

time a link is clicked.

Use of meta tag

Meta tags are a key part of the overall search engine optimization program

that needs to implement for web site. Meta tags have never guaranteed top

rankings on crawler-based search engines, but they may offer a degree of

control and the ability to impact how web pages are indexed within search

engines. [27]



Meta tags give search engines more information about a web page. This is

implicit information, which means it is not visible to visitors of the web page

itself. [87]

Meta tags can be found in the <head> element of a web page. Because, by

putting meta tags in the <body> part, some browsers may not recognize

them. [87]

Often, meta tags contain a name attribute, which sets a type of metadata.

The value of this metadata is expressed through a content attribute. The

meta description tag is most useful tag, as the name suggests, it gives

search engines a short description about the web page. [87] That is given as

below:

<meta name=”description” content=”about search engine optimization”/>

2.2.14 Meta-Search Engine perspective

As it is known that, a meta-search engine represents result from the

combination of multiple search engines where in it provides a better

performance than any individual search engine. The advantages of meta-

search engines are that the results can be presented using different ranking

formulas and their attributes. This can be more specific than the output of

individual search engine. Therefore retrieval of the results should be

simpler. In most of cases, the search result is not necessarily all the web

pages matching the user input search query, as the number of results per

search engine retrieved by the meta-search engines are limited. Pages

returned by more than one search engine should require aggregating on

meta-search engine.

It is observed that the volume of information on the web is vast and that is

been covered by search engine for user input search text. Using a meta-

search to obtain large data base contents of search engine is very important

on the web. It is known that major search engines cover only relatively a

small portion of the entire web.



Meta-Search engine uses source of different search engines. They have to

provide a better, specific and improved search results. It is found that to

achieve higher quality through combination process, it is necessary that the

input module retrieve not just different form of information, but they should

provide different relevant information using rank. Different retrieval

algorithms are used to retrieve many of the same relevant information. To

get reliable search results a good combination technique is required.

Reliable behaviour is considered to be another important and desirable

quality of a meta-search engine. It was proved that the same search engine

often returns results to the same input search text in different way over

time, which may be due to the evolution of the database and different

ranking algorithm. With database it is observed that each search engine

have its strengths and weakness, performing well on some input search text

and inadequately on others.If meta-search engine has own database then

this problem can be minimized.

Meta-search engine is the solution that provides all of search engine

information that can be incorporated logically in such a way that it takes the

advantage of each.

2.3 SCOPE OF RESEARCH

There is a great scope of research in designing a new model of Meta-Search

Engine in terms of improving efficiency and effectiveness of results using

optimization techniques using following strategies:

i. Change in ranking formulas

ii. Use of databases for indexing purpose

iii. More normalization of databases

iv. New strategy to improve response time

v. Proper design of page to increase load speed



2.4 SUMMARY

This chapter presents history of web surfing. It also presents initiative of

search engine development, working of search engine, optimization

technique used by search engine, limitations of search engine and difference

between search engine and meta-search engine. Moreover, it provides list of

existing meta-search engines with their functionalities, differences and

limitations. It also gives overview of different ranking methods. It also

presents scope of research in this area.

chapter-2 : literature survey and scope of...

Documents