result merging in a peer-to-peer web search...

Result Merging in aPeer-to-Peer Web Search

Engine

Sergey Chernov

UNIVERSITAT DES SAARLANDES

January, 2005

A Result Merging in aPeer-to-Peer Web Search

Engine

A thesis submitted in partial fulfillment of the requirement for the degree of

Master of Science

in

Computer Science

Submitted by

Sergey Chernov

under the guidance of

Prof. Dr-Ing. Gerhard Weikum

Christian Zimmer

UNIVERSITAT DES SAARLANDES

January, 2005

Abstract

A tremendous amount of information in the Internet requires powerful search

engines. Currently, only the commercial centralized search engines like Google

can process terabytes of Web documents. Such approaches fail in indexing

the “Hidden Web” located in the intranets and local databases, and with

an exponential growing of information volume the situation becomes even

worse. Peer-to-Peer (P2P) systems can be pursued for extending the current

search capabilities. The Minerva project is a Web search engine based on a

P2P architecture. In this thesis, we investigate the effectiveness of the dif-

ferent result merging methods for the Minerva system. Each peer provides

an efficient search engine for its own focused Web crawl. Each peer can pose

a query against a number of selected peers; the selection is based on a data-

base ranking algorithm. The best top-k results from several highly ranked

peers are collected by the query initiator and merged into a single list. We

address problem of the result merging. We select several merging methods,

which are feasible for use in a heterogeneous, dynamic, distributed environ-

ment. The experimental framework for these methods was implemented and

the effectiveness of the merging techniques was studied with the TREC Web

data. The language modeling based ranking method produced the most ro-

bust and accurate results under the different conditions. We also proposed

a new merging method, which incorporates the preference-based language

model. The novelty of the method is that the preference-based language

model is obtained from the pseudo-relevance feedback on the best peer in

the database ranking. In every tested setup, the new method was at least as

effective as the baseline or slightly better.

i

I hereby declare that this thesis is entirely my own work and that I have

not used any other media than the ones mentioned in the thesis.

Saarbrucken, the 12nd January, 2005

Sergey Chernov

ii

Acknowledgements

I would like to thank my academic advisor Professor Gerhard Weikum for

his guidance and encouragement through the duration of my master thesis

project. I wish to express my sincere gratitude to my supervisors Chris-

tian Zimmer, Sebastian Michel, and Matthias Bender for their invaluable

assistance and feedback. I would like to thank Kerstin Meyer-Ross for her

continuous support in everything. I am very grateful to the members of the

Databases and Information Systems group AG5, fellow students from the

IMPRS program and all my friends from the Max-Planck Institute who pro-

vided me with friendly and stimulating environment. I would like to extend

special thanks to Pavel Serdyukov and Natalie Kozlova for the numerous dis-

cussions and helpful ideas. It is difficult to explain how grateful I am to my

mother, Galina Nikolaevna, and my father, Alevtin Petrovich, their wisdom

and care made it possible for me to study. Finally, I want to thank the one

person who was most supportive and patient during this process, my dear

wife Olga. I would never accomplish this work without her love.

iii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Description of the remaining chapters . . . . . . . . . . . . . . 5

2 Web search and Peer-to-Peer Systems 6

2.1 Information retrieval basics . . . . . . . . . . . . . . . . . . . 6

2.2 Web search engines . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Peer-to-Peer architecture . . . . . . . . . . . . . . . . . . . . . 11

2.4 P2P Web search engines . . . . . . . . . . . . . . . . . . . . . 12

2.5 Minerva project . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Result merging in distributed information retrieval 15

3.1 Distributed information retrieval in general . . . . . . . . . . . 15

3.2 Result merging problem . . . . . . . . . . . . . . . . . . . . . 19

3.3 Prior work on collection fusion . . . . . . . . . . . . . . . . . . 21

3.3.1 Collection fusion properties . . . . . . . . . . . . . . . 21

3.3.2 Cooperative environment . . . . . . . . . . . . . . . . . 22

3.3.3 Uncooperative environment . . . . . . . . . . . . . . . 23

3.3.4 Learning methods . . . . . . . . . . . . . . . . . . . . . 25

3.3.5 Probabilistic methods . . . . . . . . . . . . . . . . . . . 26

3.4 Prior work on the data fusion . . . . . . . . . . . . . . . . . . 26

3.4.1 Data fusion properties . . . . . . . . . . . . . . . . . . 27

3.4.2 Basic methods . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.3 Mixture methods . . . . . . . . . . . . . . . . . . . . . 29

iv

3.4.4 Metasearch approved methods . . . . . . . . . . . . . . 30

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Selected result merging strategies 31

4.1 Target properties for result merging methods . . . . . . . . . . 31

4.2 Score normalization with global IDF . . . . . . . . . . . . . 33

4.3 Score normalization with ICF . . . . . . . . . . . . . . . . . . 35

4.4 Score normalization with CORI . . . . . . . . . . . . . . . . . 36

4.5 Score normalization with language modeling . . . . . . . . . . 37

4.6 Score normalization with raw TF scores . . . . . . . . . . . . 39

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Our approach 40

5.1 Result merging with the preference-based language model . . . 40

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Implementation 44

6.1 Global statistics classes . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Testing components . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Experiments 49

7.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 49

7.1.1 Collections and queries . . . . . . . . . . . . . . . . . . 49

7.1.2 Database selection algorithm . . . . . . . . . . . . . . . 51

7.1.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . 51

7.2 Experiments with selected result merging methods . . . . . . . 52

7.2.1 Result merging methods . . . . . . . . . . . . . . . . . 52

7.2.2 Merging results . . . . . . . . . . . . . . . . . . . . . . 53

7.2.3 Effect of limited statistics on the result merging . . . . 62

7.3 Experiments with our approach . . . . . . . . . . . . . . . . . 64

7.3.1 Optimal size of the top-n . . . . . . . . . . . . . . . . . 65

7.3.2 Optimal smoothing parameter β . . . . . . . . . . . . . 65

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

v

8 Conclusions and future work 69

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Bibliography 71

A Test queries 77

vi

List of Figures

2.1 The Minerva system architecture . . . . . . . . . . . . . . . . 13

3.1 Simple metasearch architecture . . . . . . . . . . . . . . . . . 17

3.2 A query processing scheme in the distributed search system . . 18

3.3 Collection fusion vs. data fusion . . . . . . . . . . . . . . . . . 20

3.4 An overlapping in the collection fusion problem . . . . . . . . 22

3.5 Statistics propagation for the collection fusion . . . . . . . . . 24

3.6 Data fusion on a single search engine . . . . . . . . . . . . . . 27

6.1 Main classes involved in merging . . . . . . . . . . . . . . . . 45

6.2 A general view on the experiments implementation . . . . . . 47

7.1 The macro-average precision with the database ranking RANDOM 55

7.2 The macro-average recall with the database ranking RANDOM 55

7.3 The macro-average precision with the database ranking CORI 57

7.4 The macro-average recall with the database ranking CORI . . 57

7.5 The macro-average precision with the database ranking IDEAL 58

7.6 The macro-average recall with the database ranking IDEAL . 58

7.7 The macro-average precision of the LM04 result merging method

with the different database rankings . . . . . . . . . . . . . . . 59

7.8 The macro-average precision with the database ranking CORI

with the global statistics collected over the 10 selected databases 63

7.9 The macro-average precision with the database ranking IDEAL

with the global statistics collected over the 10 selected databases 63


with the different size of top-n for the preference-based model

estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vii


with the different size of top-n for the preference-based model

estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


with the top-10 documents for the preference-based model es-

timation with β = 0.6 and LM04 result merging method . . . 67


with the top-10 documents for the preference-based model es-

timation with β = 0.6 and LM04 result merging method . . . 67

A.1 Relevant documents distribution for the TF method with the

IDEAL ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.2 Relevant documents distribution for the LM04PB06 method

with the IDEAL ranking . . . . . . . . . . . . . . . . . . . . . 80

A.3 Residual between the number of relevant documents of the

CE06LM04 and TF methods with the IDEAL database ranking 81

viii

List of Tables

4.1 The target properties of the result merging methods . . . . . . 33

7.1 Topic-oriented experimental collections . . . . . . . . . . . . . 50

7.2 The difference in percents of the average precision between

the result merging strategies and corresponding baselines with

the RANDOM ranking. The LM04 technique is compared

with the SingleLM method; all others are compared with the

SingleTFIDF approach . . . . . . . . . . . . . . . . . . . . . 61

7.3 The difference in percents of the average precision between

the result merging strategies and corresponding baselines with

the CORI ranking. The LM04 technique is compared with the

SingleLM method; all others are compared with the SingleTFIDF

approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.4 The difference in percents of the average precision between the

result merging strategies and corresponding baselines with the

IDEAL ranking. The LM04 technique is compared with the

SingleLM method; all others are compared with the SingleTFIDF

approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.1 The topic-oriented set of the 25 experimental queries (topics

are coded as “HM” for the Health and Medicine, and “NE”

for the Nature and Ecology) . . . . . . . . . . . . . . . . . . . 78

A.2 The number of relevant documents for the TF and LM04PB06

methods with the IDEAL database ranking (LM04PB06 name

is shortened to LMPB for convinience) . . . . . . . . . . . . . 79

ix

Chapter 1

Introduction

1.1 Motivation

Millions of new documents are created every day across the Internet and

even more are changed. The huge amount of information increases expo-

nentially and a search engine is the only hope to find the documents which

are relevant to a particular user’s need. Routine access to the information is

now based on full-text information retrieval, instead of controlled vocabulary

indices. Currently, a huge number of people use the Web for text and image

search on a regular basis. In this thesis, we consider the problems of search

on text data.

The need for effective search tools becomes more important every day, but

currently only a few centralized search engines like Google (www.google.com)

can cope with this task and they are only partially effective. The so-called

Hidden Web consists of all intranets and local databases behind portal pages.

According to estimation from [SP01], it is 2 to 50 times larger then the Visible

Web, which can be crawled by the search robots. Taking into account that

even the largest Google crawl of more than 8 billion pages encompasses only

a part of the Visible Web, we can imagine how many probably relevant pages

a centralized search engine does not consider during a search. This problem

comes from the technical limitations of a single search engine.

The desire to overcome the limitations of a single search engine established

a new scientific direction — distributed information retrieval or metasearch,

we will use both terms as synonyms. The main technique that was devel-

1

oped in this field is an intermediate broker called a metasearch engine. It has

access to the query interfaces of the individual search engines and text data-

bases. Briefly, when a metasearch engine receives a user’s query, it passes

the query to a set of appropriate individual search engines and databases.

Then it collects the partial results and combines them to improve the overall

result. Numerous examples of metasearch engines are available on the Web

(www.search.com, www.profusion.com, www.inquirus.com).

The metasearch approach contains several significant sub-problems, which

arise in the query execution process. The database selection problem arises

when a query is routed from a metasearch engine to the individual search

engines. A naive routing approach is to propagate a query to all available

engines. The scalability of such a strategy is unsatisfactory since it is in-

efficient to ask more than several dozens of servers. The database selec-

tion process helps to discover a small number of the most useful databases

for a current query and to ask only this limited subset without significant

loss in recall. Many of the database selection methods were developed to

tackle this issue [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The

result merging problem is another important sub-problem of the metasearch

technique. In information retrieval, the output result is a ranked list of

documents, which are sorted by their similarity score values. In distrib-

uted information retrieval, aforementioned list is obtained from several re-

sult lists, which are merged into one. The result merging problem is not

trivial, numerous merging techniques have been studied in the literature

[CLC95, TVGJL95, Bau99, Cra01, SJCO02]. The issue of an automatic

database discovery was not fully addressed; so adding new data sources to

the metasearch engine mainly remains a manual task.

The major drawback of metasearch is that large search engines are not

interested in cooperation. A search result is a commercial product which

they want to “sell” by their own. For example, the STARTS proposal

[GCGMP97] is a quite effective communication protocol designed especially

for metasearch, but it is not widely used because of the reason above. The

new Peer-to-Peer (P2P) technologies can help us to remove the limitations

caused by an uncooperativeness of search engines vendors. The computation

power of processors increases every year, and so does the network bandwidth.

2

Millions of personal computers have enough storage and computational re-

sources to index their own documents and perform small crawls of the inter-

esting fragments of the Web. They can provide a search on their local index,

but do not have to uncover the data itself unless they want to. This is the

way to incorporate the Hidden Web pages into a global search mechanism.

Collaborative crawling can span a larger portion of the Web, since every

peer can contribute its own focused crawl into the system. This method is

cheap and provides us with topic-oriented search opportunities; we can also

use intellectual inputs from other users to improve our own search. Such

considerations launched the Minerva project, a new P2P Web search engine.

The metasearch field has many common properties with search in a P2P

system, but some important distinctions should be taken into account. A

P2P environment is much more dynamic than traditional metasearch:

• Queries are processed on millions of small indices instead of dozens of

large indices;

• Global query execution might require resource sharing and collabora-

tion among different peers and cannot be fully performed on one peer;

• Limited statistics is a necessary requirement for a scalable P2P system,

while in the distributed informational retrieval rich statistics can be

provided by a centralized source;

• Cooperativeness of peer in a P2P system, in contrast to a metasearch

setting, helps to reduce heterogeneity in such parameters as represen-

tative statistics or index update time.

Distributed information retrieval accommodates features from two re-

search areas: information retrieval and distributed systems. The goal of

effectiveness is inherent for the former, it aims at high relevance of the re-

turned documents, and the collaboration of users in a P2P setting gives us

additional opportunities to refine the search results.

The main goals of the Minerva project include the traditional metasearch

goals and new issues:

1. Increased search coverage of the Web;

3

2. Retrieval effectiveness comparable with centralized search engine;

3. Scalable architecture to combine millions of small search engines.

For this purpose, we want to exploit existing solutions from distributed infor-

mation retrieval and adapt them to our new setup, with the aforementioned

distinctive properties in mind. We also want to find novel methods, which

are suitable for P2P architecture and can improve our system. The practi-

cal goal is to create a prototype of a highly scalable, effective, and efficient

distributed P2P Web search engine.

1.2 Our contribution

The main purpose of this thesis is to develop an effective result merging

method for the Minerva system. We analyze major sub-problems of the result

merging, and review several existing techniques. The selected methods have

been implemented and evaluated in the Minerva prototype. In addition, a

new preference-based language model method for result merging is proposed.

Our approach combines the preference-based and the result merging rankings.

The novelty of the method is that the preference-based language model is

obtained from the pseudo-relevance feedback on the best peer in the database

ranking.

We address the issue of effectiveness. It is determined by the underlying

result merging scheme. As in distributed information retrieval in a P2P

system, the similarity scores for each document are computed on the base

of the local database statistics. It makes the scores incomparable due to

the differences in statistics on the different databases. A score computation

based on the global statistics is the most accurate solution in our case. For

the cooperative data sources, as we have in the Minerva system, we can

collect the local database-dependent statistics and replace it by the globally

estimated one, which is fair for all databases. We elaborated on this issue

by testing several global score computation techniques and discovering the

most effective scoring function.

We also exploited additional information about the user’s preferences in

order to improve the quality of the final ranking. Our method combines two

4

rankings. The first ranking is the language modeling result merging scheme.

The second one is based on the language model from pseudo-relevance feed-

back. The user preferences are inferred using the pseudo-relevance feedback,

the top-k results from the best ranked database are assumed relevant. The

novelty of our method is that pseudo-relevant feedback is obtained on the

top-ranked peer before the global query execution.

1.3 Description of the remaining chapters

Background information about information retrieval and P2P systems is

presented in Chapter 2. An overview of distributed information retrieval and

recent work on result merging is introduced in Chapter 3. Chapter 4 presents

details of the merging techniques that we select for our experimental stud-

ies. Chapter 5 contains the new approach that is using the preference-based

language model. In Chapter 6, we present implementation details. The ex-

perimental setup, evaluation methodology, and our results are presented in

Chapter 7. Chapter 8 finalizes the thesis with the conclusions and sugges-

tions for future work.

5

Chapter 2

Web search and Peer-to-Peer

Systems

In this chapter, we give a short description of Web search, P2P systems,

and their potential as a platform for distributed information retrieval. Sec-

tion 2.1 contains introductory information about information retrieval. In

Section 2.2, we review the Web search engines. In Section 2.3, some general

properties of P2P systems are discussed. Section 2.4 presents recent ap-

proaches for combining search mechanisms with P2P architecture. Section

2.5 describes our approach, the Minerva project.

2.1 Information retrieval basics

Information retrieval deals with search engine architectures, algorithms,

and methods that are concerned on the information search in the Internet,

digital libraries, and text databases. The main goal is to find the relevant

documents for a query from a collection of documents. The documents are

preprocessed and placed into an index, which provides the base for retrieval.

A typical search engine is based on the single database model of the text

retrieval. In the model the documents from the Web and local databases

are collected into a centralized repository and indexed. The whole model is

effective if the index is large enough, to satisfy most of the user’s information

needs and a search engine uses an appropriate retrieval system. A retrieval

system is the set of retrieval algorithms for the different purposes: ranking,

6

stemming, index processing, relevance feedback and so on.

The widely used bag-of-words model assumes that every document may

be represented by the words, which are contained in it. The most frequent

words like “the”, “and” or “is” do not have rich semantics. They are called

the stopwords and we remove them from the document representation. The

full set of stopwords is stored in a stopwords list. The words variations with

the same stem like “run”, “runner” and “running” are mapped into the one

term, corresponding to a particular stem, a stemming algorithm performs

this process. In current example the term is “run”.

An important characteristic of a retrieval system is its underlying model

of retrieval process. This model specifies the procedure of the probability

estimation that a document will be judged relevant. The final document

ranking is based on this estimation. The ranking is presented to the user after

a query execution. The simple retrieval process models include a probabilistic

model and a vector space model ; the latter is the most widely used in the

search engines. In the vector space model the document D is represented

by the vector ~d = (w1, w2, . . . , wm) where wi is the weight indicating the

importance of term ti in representing the semantics of the document and

m is the number of distinct terms. For all terms that do not occur in the

document, corresponding entries will be equal to zero and the full document

vector is very sparse.

When the term occurs in the document, two factors are of importance

in a weight assignment. The first factor is a term frequency TF — it is a

number of term’s occurrences in the document. The weight of the term in

the document’s vector is proportional to TF . The more often term occurs,

the more it is important in representing a document’s semantics. The second

factor, which affects the weight, is a document frequency DF — it is the

number of documents with particular term. The term weight is multiplied

by the inverse document frequency IDF . The more frequently term appears

in the documents, the less its importance in a discriminating between the

documents having the term from the documents not having it. The worldwide

standard for a term weighting is TF · IDF product and its modifications.

A simple query Q is a set of keywords. It is also transformed into an

m-dimensional vector ~q = (w1, w2, . . . , wm) using all preprocessing steps like

7

stopwords elimination, stemming and term weighting. After a creation of

~q and ~d, a similarity between the document’s vector and the query’s vector

is estimated. This estimation is based on a similarity function, it can be a

distance or angle measure. The most popular similarity function is the cosine

measure, which is computed as a scalar product between ~q and ~d.

Another popular approach that tries to overcome a heuristic nature of

term weight estimation comes from the probabilistic model. The Language

modeling approach [PC98] to information retrieval attempts to predict a

probability of a query generation given a document. Although details may

be different, the main idea is following: every document is viewed as a sample

generated from a special language. A language model for each document can

be estimated during indexing. The relevance of a document for a particular

query is formulated as how likely the query was generated from the language

model for that document. The likelihood for the query Q to be generated

from the language model of the document D is computed as follows [SJCO02]:

P (Q|D) =

|Q|∏i=1

λ · P (ti|D) + (1 − λ) · P (ti|G) (2.1)

Where:

ti — is the query item in the query Q;

P (ti|D) — is the probability for ti to appear in the document D;

P (ti|G) — is the probability for the term ti to be used in the common

language model, e.g. in English;

λ — is the smoothing parameter between zero and one.

The role of P (ti|C) is to smooth the probability of the document D to

generate the query term ti, particularly when P (ti|D) is equal to zero.

The usual measures for a retrieval effectiveness evaluation are the recall

and precision, they are defined as follows [MYL02]:

recall =NumberOfRetrievedRelevantDocuments

NumberOfRelevantDocuments(2.2)

precision =NumberOfRetrievedRelevantDocuments

NumberOfRetrievedDocuments(2.3)

The effectiveness of a text retrieval system is evaluated using a set of test

queries. The relevant document set is identified beforehand. For every test

8

query a precision value is computed on the different levels of recall, these

values are averaged over the whole query set and an average recall-precision

curve is produced. In ideal case, when a system retrieves only the full set of

relevant results every time the recall and precision values should be equal to

1. In practice, we cannot achieve such effectiveness due to the query ambi-

guity, specific user’s understanding of a relevance notion and other factors.

Incorporating the explicit user’s feedback and implicitly inferred user’s pref-

erences from the previous search sessions can improve the retrieval quality.

2.2 Web search engines

Information retrieval system for Web pages is called a Web search engine.

The capabilities of these systems are very broad; the modern techniques

allow queries on text, image, and sound files. In our work, we consider the

problem of text data retrieval. Web search engines are also differentiated

by their application area. The general-purpose search engines can search

across the whole Web, while the special-purpose engines are concerned on

the specific information sources or specific subjects. We are interested in the

general-purpose Web search engines.

The Web search engines inherited many properties from traditional infor-

mation retrieval. Every Web search engine has a text database or, equally, a

document collection that consists of all documents searchable by this engine.

An index for these documents is created before query time, every term in it

represents the single keyword or phrase. For each term one inverted index

list is constructed, this list contains document identifiers for every document

with current term, along with the corresponding similarity values. During

query execution, a search engine takes a union over the inverted index lists

corresponding to the query terms. Then search engine sorts all found docu-

ments in a descending order of their similarity score and presents the resulting

ranking to the user.

There are also distinctive features of the Web retrieval that were not

used in traditional information retrieval. The most prominent examples are

additional hyperlink relationships between the documents and an intensive

document tagging. These differences can serve as the sources of additional

9

information for a search refinement and they are exploited in the different

retrieval algorithms.

The Web developers created a significant portion of the hyperlinks on

the Web manually and this is the implicit intellectual input. The linkage

structure can be used as expert evidence that two pages connected by a

hyperlink are also semantically related. It can also be an indication that

the Web designer, who placed the hyperlinks on some pages, assesses their

content as valuable. Several algorithms are based on these considerations.

The PageRank algorithm computes the global importance of the Web page

in a large Web graph, which is inferred from the set of crawled pages [BP98].

The advantage of this algorithm is that a global importance of the page can

be precomputed before query execution time. A HITS algorithm [Kle99]

uses only a small subset of the Web graph, this subset is constructed during

query time. Such on-line computation is inconvenient for the search engines

with a high query workload, but it allows a topic-oriented page authority

computation.

The HTML tagging of documents can also be of use in the Web search

engines. Rich information about an importance of terms is inferred from

their position in a document. The terms in the title section are more impor-

tant than in the body of the document. Emphasizing with a font size and

style also indicates an additional importance of the term. The sophisticated

term weightings schemes, which are based on these observations, improve the

retrieval quality.

There are several important limitations of the existing Web search en-

gines. The first restriction is imposed on a size of searchable index. Accord-

ing to the Google statistics (www.google.com), this search engine has the

largest crawled index on the Web, its current size is about 8 billion pages.

At the same time the Hidden Web or Deep Web, which embraces the pages

that were excluded from the crawling for commercial or technical reasons,

has a size about 2-50 times larger then the Visible Web [SP01]. Even now

it is unrealistic for a single search engine to maintain an index of this size

and the information volume increases even faster than a computation power

of the centralized Web search engines. The second problem is the outdating

of the crawled information. The news pages are changed daily and it is im-

10

possible to update the whole index at this rate. Some updating strategies

help track changes on the most popular sites in the Internet, but many index

entries are completely outdated. The novel opportunities provided by the

peer-to-peer systems help to solve these problems.

2.3 Peer-to-Peer architecture

A distributed system is a collection of autonomous computers that co-

operate in order to achieve a common goal [Cra01]. In ideal case a user of

such system does not explicitly notice other computers, their location, stor-

age replication, load balancing, reliability or functionality. P2P system is an

instance of the distributed system; it is decentralized, self-organized, highly

dynamic loose coupling of many autonomous computers.

P2P systems have become famous several years ago with the Napster

(http://www.napster.com/) and Gnutella (http://www.gnutella.com/) file-

sharing systems. In the file-sharing P2P communities, every computer can

join as a peer using the client program. Other peers can access all resources

shared by the peers in this environment. The main feature of such systems

is that the peer who is looking for a file can directly contact the peer that

is sharing this file. The only information that has to be propagated is the

peer’s address and a short description of the shared data.

The first systems like Napster used a centralized server with all peers

addresses and names of the shared files. Other approaches avoided a single

point of failure and used the Gnutella-style flooding protocol. It consequently

broadcasts a request for a particular file through a small number of closest

neighbors until the message will expire. The modern P2P applications like

e-Donkey (http://www.edonkey2000.com/) are extremely popular now; they

have numerous improvements over the predecessors.

Therefore, we harness a power of the thousands autonomous personal

computers all over the world to create a temporal community for a collab-

orative work. A P2P technology is trying to make systems: scalable, self-

organized, fault-tolerant, publicly available, load-balanced. This list of desir-

able P2P properties is not exhaustive and there are also issues like anonymity,

security, etc., but the selected properties are fundamental for our task. For

11

example, modern P2P systems are often based on a mixture topology when

some “super-peers” establish the different levels of hierarchy, but we are in-

terested in a pure P2P flat structure. It gives equal rights to all peers and

makes a system more scalable.

The limitation of search capabilities is a considerable drawback of the

most P2P systems. Sometimes you have to know an exact filename of a

data of interest or you will miss the relevant results. The combination of the

search engine mechanisms for an effective retrieval with a powerful paradigm

of a P2P community is a promising research direction.

2.4 P2P Web search engines

The idea of a Peer-to-Peer Web search engine is extensively investigated

nowadays. The interesting combinations of the search services with the P2P

platforms are described in several following approaches.

ODISSEA [SMW+03] is different from many other P2P search approaches.

It assumes two-layered search engine architecture and a global index struc-

ture distributed over the nodes of the system. Under a global index orga-

nization, in contrast to a local one, a single node holds the entire inverted

index for a particular term. The distributed version of Fagin’s threshold al-

gorithm is used for result aggregation over the inverted lists. It is efficient

only over very short queries about 2-3 words. For a distributed hash table

(DHT) implementation, this system incorporated the Pastry protocol.

PlanetP [CAPMN03] is another content search infrastructure. Each node

maintains an index of its content and summarizes the set of terms in its index

using a Bloom filter. The global index is the set of all summaries. Summaries

are propagated and kept synchronized using a gossiping algorithm. This

approach is effective for several thousands peers, but it is not scalable. Its

retrieval quality is rather low for the top-k queries with a small k.

GALANX [WGDW03] is the P2P system, which is implemented on the

top of BerkeleyDB. Similar to the Minerva system, it maintains a local peer

index on every node and distributes information about term presence on a

peer with a DHT. The different query routing strategies are evaluated during

the simulation. Most of them are based on the Chord protocol and proposed

12

strategies improve the basic effectiveness by the enlarging of the index size.

The presented query routing approaches are not highly scalable since the

index volume continuously increases with the number of peers in the system.

2.5 Minerva project

The project Minerva [BMWZ04] is another Web search engine that is

based on P2P architecture. See Figure 2.1. In this system, every peer Pi

provides an efficient search engine for its own focused Web crawl Ci. The

documents Dij are indexed locally and the result is posted into a global

directory as a set of index statistics Si. A posting process and all other com-

munications between the peers are based on the Chord protocol [SMK+01].

Every peer contains a set of peerlists Li for a disjoint subset of terms Ti,

where∑|P |

i=1 Ti = T . The peerlist l is a mapping t → P ′, where t is a partic-

ular term and P ′ is a subset of peers which contain at least one document

with this term. The terms are hashed and their corresponding peerlists are

distributed fairly across the peers by the Chord protocol. During query exe-

cution all necessary peerlists, one for each query keyword, are obtained, and

merged into one.

Figure 2.1: The Minerva system architecture

Every peer can pose a query against a number of selected peers that are

most probable to contain the relevant documents. The selection is based

13

on a query routing strategy and this issue is known in a literature as the

database selection problem. A search engine on every selected peer processes

its inverted index until it obtains the top-k highly ranked documents for a

current query. Then the best top-k results from these peers are collected

by the query initiator and merged into one top-k list, this task is known as

the result merging problem. Quality of the final top-k list depends heavily

on a term weighting scheme on peers and merging algorithm, whereas speed

depends mostly on a local index processing scheme.

2.6 Summary

In this chapter, we introduce several basic concepts from information re-

trieval and Web search. We describe some key ideas of P2P systems and

review several combinations of Web search engines with a P2P platform.

Also small description of the new P2P Web search engine Minerva was pro-

vided. The scalability issue is recognized as an extremely important one. P2P

architecture seems valuable in terms of the effective and efficient retrieval.

14

Chapter 3

Result merging in distributed

information retrieval

In this chapter, we review recent work on distributed information re-

trieval. In Section 3.1, we give a short overview of the general metasearch

issues. Section 3.2 contains a comprehensive description of the result merging

task. In Section 3.3, we elaborate on the collection fusion task. In Section

3.4, we address the problems of the data fusion task.

3.1 Distributed information retrieval in gen-

eral

During the past ten years, emerged a new research direction — distributed

information retrieval or metasearch. Metasearch is the task of collecting and

combining the search results from a set of different sources. A typical scenario

includes several search engines that execute a query and one metasearch

engine that merges the results and creates a single ranked document list.

Several interesting surveys of distributed information retrieval problems and

solutions are presented in [Cal00, Cra01, Cro00, MYL02].

Distributed information retrieval task appears when the documents of in-

terest are spread across many sources. In such situation, it might be possible

to collect all documents to one server or establish multiple search engines, one

for each collection of documents. The search process is performed across the

15

network with communications between many servers. This is a distinctive

feature of distributed information retrieval.

The search in the distributed environment has several attractive proper-

ties, which make it preferable to the single engine search. Several of these

important features are listed in [MYL02]:

• The increased coverage of the Web, the indices from many sources are

used in one search;

• The solution for the problem of the search scalability, a combination is

cheaper than a centralized solution;

• The automation of the result preprocessing and combining, a user does

not have to compare and combine the results from different sources

manually;

• The improved retrieval effectiveness, the combination of different search

engines can produce a better ranking than any single ranking algorithm.

The metasearch is based on the multi-database model, where several text

databases are modeled explicitly. The multi-database model for information

retrieval has many tasks in common with the single-database model but also

has some additional problems [Cro00]:

• The resource description task;

• The database selection task;

• The result merging task.

These issues are essentially the core of distributed information retrieval re-

search, we briefly describe them below.

The main unit in the metasearch is an intermediate broker that is called

a metasearch engine. It obtains and stores a limited summary about every

database participating in a search process and decides which databases are

most appropriate for a query. A metasearch engine also propagates a query

to the selected single search engines, collects and reorganizes results. Simple

metasearch architecture is presented on Figure 3.1. A user poses a query Q

against a metasearch engine, which in turn propagates it to several search

16

Figure 3.1: Simple metasearch architecture

engines. Then the result rankings Ri are retrieved to the broker, merged,

and presented to the user as a single document ranking Rm.

A summary statistics from a search engine is called resource description

or database representative. A full-text database provides information about

its contents in a set of statistics. It may include a data about the number

of the specific term occurrences in the particular documents or in a whole

collection, the number of indexed documents etc. Information for building

resource description is obtained during the index creation step. The richness

of the database representatives depends on the level of cooperation in the

system. For example, the STARTS standard [GCGMP97] is the good choice

for a cooperative environment, where all search engines present their results

in the unified informative format. On the other hand, when they are unwilling

to cooperate we can infer their statistics from query-based sampling [SC03].

The collected resource descriptions are used for the database selection

or query routing task. In practice, we are not interested in the databases,

which are unlikely to contain relevant documents. Therefore, we can select

from all data sources only those, which are probably relevant to our query

according to their resource descriptions. For each database, we calculate the

usefulness measure that is usually based on the vector space model. Creating

the effective and robust usefulness measure for the database ranking is the

most prominent task of database selection. Several attempts to address this

17

problem are described in [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02].

The result merging problem arises when a query is executed on several

selected databases and we want to create one single ranking out of these

results. This problem is not trivial since the computation of similarity

score between documents and query uses local collection statistics. There-

fore, the scores are not directly comparable. The most accurate solution

could be obtained by a global score normalization and requires a coopera-

tion from sources. We are especially interested in this latter problem. The

carefully designed result merging algorithm can provide us with the high

quality results and give us an opportunity to speed-up a local index process-

ing. More information about the result merging methods can be found in

[CLC95, TVGJL95, Bau99, Cra01, SJCO02, SC03].

Figure 3.2: A query processing scheme in the distributed search system

More precisely the query processing scheme is presented on Figure 3.2

[Cra01]. A query Q is posed on the set of search engines that are represented

by their resource descriptions Si. A metasearch engine selects a subset of

servers S’, which are most probable to contain the relevant documents. The

size of this subset usually does not exceed 10 databases. The broker routes

Q to these selected search engines Si’ and obtains a set of document rankings

R from the selected servers. In a real world, a user is interested only in the

top-k best results where k can vary from 5 to 30. All rankings Ri are merged

18

into one rank Rm and the top-k results from it are presented to the user.

Text retrieval aims at the high relevance of the results at the minimum

response time. These two components are translated into the general issues

of effectiveness or quality and efficiency or speed of the query processing.

This thesis concerned on the effectiveness of the result merging problem.

3.2 Result merging problem

A common issue in the metasearch is how to combine several ranked lists

of the relevant documents from the different search engines into one ranked

list. It is the so-called result merging problem. The following section reviews

some modern merging methods.

Result merging is divided into two main sub-problems. The first one is

collection fusion, where the results are merged from the disjoint or nearly

disjoint document sets. The second sub-problem is data fusion, which arises

when we merge the different rankings obtained on the identical document

sets.

The main difference between the collection fusion and data fusion is that

in the first case we want to approximate the result of a single search system

on which the document set consists of all document’s sub-sets involved in the

merging. Therefore, the optimal solution is to obtain the very same retrieval

effectiveness as the search engine with the united database has. However, in

the data fusion problem the task is to merge the different rankings in such

way that the final ranking is better than every participating ranking. The

maximum quality of the result here is undefined but it should be no less than

the quality of the best single ranking. Simple intuition for these two problems

is presented on Figure 3.3. A comprehensive description of the differences

between collection fusion and data fusion can be found in [VC99b, Mon02].

In metasearch, we often do not know beforehand what kind of a merg-

ing problem we have because it depends on the level of overlap between the

documents of combined databases. If the overlap is very high the situation

is closer to the data fusion, otherwise it is the collection fusion task. The

metasearch on the Web was addressed mainly as the collection fusion prob-

lem. In fact, the overlap of search results on the different search engines is

19

Figure 3.3: Collection fusion vs. data fusion

surprisingly low. However, some approaches also take into account the data

fusion methods, sometimes even both types are evaluated in the mixture

setups.

Another important property is a level of search engine cooperation. We

divide all merging methods by the environment type into two categories:

• Cooperative (integrated) environment;

• Uncooperative (isolated) environment.

The uncooperative or isolated merging methods have no other access to the

individual databases than a ranked list of documents in the response to a

query [Voo95]. The cooperative or integrated merging techniques assume an

access to the database statistics values like TF , DF etc. In general, both

types of merging methods can produce more effective result than the single

collection with the full set of documents, if the data fusion strategy is used

[TVGJL95]. In practice, the merged results produced by the uncooperative

strategies have been less effective than the single collection run.

Our primary goal is to find a subset of the effective merging methods,

which we can apply and evaluate in the P2P Web search engine Minerva.

We assume here that all peers in the Minerva system are cooperative and

provide all necessary statistics.

20

3.3 Prior work on collection fusion

A formal definition of the collection fusion problem was stated in [TVGJL95].

It is mixed with the data fusion definition, therefore, we modified it. Assume

a set of document collections C associated with the search engines. With

respect to the query Q, each collection Ci contains a number of relevant

documents. After the query Q is posed against the collection Ci, the search

engine returns the ranked list Ri of documents Dij in a decreasing order of

their similarity Sij to the query. The top-k results is the merged ranked list

of length k containing the documents Dij with the highest similarity values

Sij in a decreasing order. Consider a document collection Cg =⋃

Ci and the

top-k results Rg, which contains the documents Dgj with similarity values

Sgj. The collection fusion task is given Q, C, and k find from⋃

Rj the top-k

results Rc of the documents Dcj such that Scj = Sgj.

3.3.1 Collection fusion properties

An ideal collection fusion method combines the documents from local search

results into one ranked list in a descending order of their global similarity

scores. The global similarity scores are produced by the single global search

system over the united database containing all local documents. In the coop-

erative environment, where all search engines provide necessary statistics, we

can achieve the consistent merging as produced by a non-distributed system,

it is also known as the perfect merging and merging with normalized scores

[Cra01]. In practice, no efficient collection fusion technique can guarantee

exactly the same ranking as on the centralized database with all documents

from all databases involved. Three main factors affect the collection fusion:

1. Only the documents returned by the selected servers can participate in

a merging. Some relevant documents will be missed after the database

selection step.

2. Different statistics and retrieval algorithms caused their own separate

problem of incomparable scores. A missing of the documents might be

the case when the top-k results are merged and necessary document is

21

locally ranked (k+1)th or greater. This problem can be solved by the

global statistics normalization methods in the cooperative environment.

3. Overlapping between the databases. See Figure 3.4. The pure col-

lection fusion approaches [VF95, Kir97, CLC95] do not consider over-

lapping. It is quite difficult to accurately estimate the actual level of

the document’s overlap between datasets. Our assumption is that the

degradation of the result quality due to overlapping is small when the

efforts for statistics correction are significant.

Figure 3.4: An overlapping in the collection fusion problem

3.3.2 Cooperative environment

In [SP99, SR00] was claimed that the simple raw score merging could show

a good retrieval performance. It seems that the raw-score approach might

be a valid first attempt for the merging of result lists, which are provided

by the same retrieval model. In [CLC95] was suggested that the collection

fusion based on the raw TF values seems as a valuable approach when the

involved databases are more or less homogeneous and the retrieval quality

degrades only by 10%. However, we assumed topically organized collections

and they have highly skewed statistics.

The most effective collection fusion methods are the score normaliza-

tion techniques, which are based on consistent global collection statistics.

All search engines must produce the document’s relevance score using the

same retrieval algorithms, including document ranking algorithm, stemming

method, stopwords list. A metasearch engine collects all required local sta-

tistics from the selected databases before or during query time. Notice, that

22

under the common TF · IDF scheme the TF component is a document-

dependent and fair across all databases. In contrast, when the IDF compo-

nent is a collection-dependent we should globally normalize it. Analogously

in the language modeling the P (ti|D) component remains unchanged and

P (ti|G) should be recomputed. The communications for such aggregation

are presented on Figure 3.5 [Cra01].

In the scheme A from [VF95], the search engines exchange their DF sta-

tistics between themselves before query time. During query execution we

compute the comparable similarity scores. Under the scheme B [CLC95], the

databases also return the comparable scores, but the document frequency

statistics is collated at the metasearch engine and sent with the query. In

the case C [Kir97], instead of a communication before query time, all search

engines return the TF and DF statistics together with the documents rank-

ings, and full information is used for the fair fusion. The distinction between

first two schemes and the last one is that the A and B are based on DF sta-

tistics for all search engines, but the scheme C is based only on the statistics

from the selected search engines. All three schemes will return the statistics

for the comparable scores and the metasearch engine performs the fusion by

sorting result documents in a descending order of their similarity scores.

In [LC04b] was proposed the method for merging results in a hierarchi-

cal peer-to-peer network. SESS is a cooperative algorithm that requires the

neighbors to provide a summary statistics for each of their top-ranked docu-

ments. It is an extended version of Kirsch’s algorithm [Kir97]. It allows very

accurate normalized document scores to be determined before downloading

of any document. However, the limitation here is that the hierarchical system

is assumed.

3.3.3 Uncooperative environment

When the environment is uncooperative or the search engines are intended to

cheat, it is still possible to obtain a good approximation of the globally com-

puted scores. In the approach from [CLC95, Cal00, SJCO02] was proposed a

merging strategy, which is based on both resource and document scores. The

database selection algorithm CORI assigns a score to each database. This

23

Figure 3.5: Statistics propagation for the collection fusion

24

database score reflects its ability to provide relevant documents to a query.

Then the local document score is weighted with the database score and some

heuristically set constants. This method can work in a semi-cooperative en-

vironment when the document scores are available or in the uncooperative

setup with the slightly degraded accuracy. The similar merging strategy was

proposed in [RAS01] with different formulas for the database rank estima-

tion. The final score is again the product of the database score and the local

score computed in a heuristic way.

In [PFC+00] it was claimed that when the database selection is employed,

it is not necessary to maintain the collection wide information, e.g. global

IDF . Local information can be used to achieve the superior performance.

This means that distributed systems can be engineered with more autonomy

and less cooperation. In contrast, in [LCC00] was discovered that it is better

to organize the collections topically, and that for the result merging, the top-

ically organized collections require the global IDF for the best performance.

The normalized scores are not as good as the global IDF for merging when

the collections are topically organized. The current polemic is a good indi-

cator that a comprehensive experimental evaluation of the collection fusion

methods is still an open research.

3.3.4 Learning methods

Another merging strategy uses the logistic regression for the score transfor-

mation [CS00]. This method requires training queries for learning the model.

The presented experiments show that the logistic regression approach is sig-

nificantly better than the Round-Robin, raw-score, and normalized raw-score

approaches. In [SC03] a linear regression model was used for the collection

fusion with a small overlapping between collections. It is assumed that there

is a centralized sample database. It stores a certain number of documents

from each resource. The metasearch engine runs the query on the sample

database at the same time it is propagated to the search engines. Then a

central broker finds the duplicate documents in the results for the sample

database and for each resource. A normalization of the document scores in

all results is performed by a linear regression analysis with the document

25

scores from the sample database taken as a baseline. The experimental re-

sults showed that this method performs slightly better than CORI does.

However, all learning-based approaches assume some kind of training set. Is

is unaffordable in a highly dynamic environment. It is hard to maintain such

information for thousands of databases when they freely join and leave the

system.

3.3.5 Probabilistic methods

Several collection fusion methods were developed for the probabilistic re-

trieval model. The approach in [Bau99] was designed for the cooperative

environment. A probabilistic ranking algorithm is based on the exported

statistics from the search engines. The consistent merging in the metasearch

is achieved with respect to the probability ranking principle. Another ap-

proach with a probabilistic principle is investigated in [NF03]. This paper

explores the family of the linear and logistic mapping functions for the dif-

ferent retrieval methods. The retrieval quality for the distributed retrieval is

only slightly improved by using the logistic function.

The language modeling for the collection fusion [SJCO02] is another prob-

abilistic method based on the same assumptions as Equation 2.1. The merg-

ing of results from the different text databases is performed under the single

probabilistic retrieval model. This new approach is designed primarily for

Intranet environments, where it is reasonable to assume that the resource

providers are relatively homogeneous and can adopt the same kind of a search

engine. The language model based merging approach is performed to inte-

grate the results. Compared with the heuristic methods like CORI algorithm,

this framework tends to be better justified by the probability theory.

3.4 Prior work on the data fusion

A formal definition of the data fusion for the distributed retrieval is given

here. Assume a set of identical document collections C associated with the

different search engines and retrieval algorithms. With respect to a query

Q, each collection Ci contains the same number of the relevant documents.

26

After Q is posed against Ci, the search engine returns a ranked list Ri of

documents Dij in a decreasing order of their similarity Sij to a query. The

top-k results is a merged ranked list of a length k containing the documents

Dij with the highest similarity values Sij in a decreasing order. The data

fusion task is given Q, C, and k find from⋃

Rj the top-k results Rd of

the documents Ddj such that the∑k

j=1 Sdj is maximized. In our setup the

rankings from different local search engines should be combined so that they

collect the most relevant documents from all rankings and put them into the

merged top-k results. See Figure 3.6.

Figure 3.6: Data fusion on a single search engine

3.4.1 Data fusion properties

Data fusion attempts to make use of three Diamond’s effects [Dia96]. They

can occur during a combination of different rankings over a single document

collection [VC99b]:

• The skimming effect happens when the retrieval approaches represent

the documents differently and thus retrieve different relevant docu-

ments. A combination model that takes the highly ranked items from

every retrieval approach could outperform the effectiveness of any single

combined ranking.

• The chorus effect occurs when several retrieval approaches suggest that

an item is relevant to a query. It is used as an evidence of the higher

relevance of that document.

27

• The dark horse effect occurs because a retrieval approach may produce

unusually accurate estimates of relevance for some documents, in com-

parison with the other retrieval approaches. A combination model may

exploit this fact by using the most correct document score.

It is obvious that all three effects are inversely correlated. For example,

if we pay more attention to the chorus effect, we decrease our chance to get

advantage from the dark horse effect. The optimal tradeoff between these

three situations is essentially the data fusion or retrieval expert combination

task.

In some sense, the data fusion problem may be defined as a voting pro-

cedure where a set of ranking algorithms selects the best k documents. The

most effective data fusion schemes are the linear combinations of similarity

scores produced by the different search engines and the problem is to find the

optimal weights for such combination. Two factors influence the performance

of any data fusion approach [Mon02]:

• The effective algorithms; each system participating in the fusion should

have an acceptable effectiveness, comparable with others.

• The uncorrelated rankings: the rankings, which are produced by the

different algorithms, should be independent from each other.

The previous experiments confirmed that the rankings which do not sat-

isfy aforementioned requirements reduce the quality of the fused ranking.

3.4.2 Basic methods

In [SF94] was proposed a number of combination techniques including oper-

ators like Min, Max, CombSum, and CombMNZ. CombSum sets the score of

each document in the combination to the sum of the scores obtained by the

individual resource, while in CombMNZ the score of each document was ob-

tained by multiplying this sum by the number of resources that had non-zero

scores. CombSum is equivalent to an averaging while CombMNZ is equiva-

lent to a weighted averaging. In [Lee97] these methods were studied further

with six different search servers. The main contribution was to normalize

28

each information retrieval algorithm on a per-query basis that substantially

improves the results. It was showed that the CombMNZ algorithm is the

best followed by the CombSum, while the operators like Min and Max were

the worst. Three newer modifications of these algorithms can be found in

[WC02]. They differ by a weight estimation mechanism. Another method

[VC99a] based on a linear combination of scores. In this approach, the rele-

vance of a document to a query is computed by combining both a score that

captures the quality of each resource and a score that captures the quality

of the document with respect to a query.

3.4.3 Mixture methods

Some methods were developed for the mixture of the collection fusion and

data fusion problems. A simple but ineffective merging method is the Round-

Robin, which takes one document in turn from each of the available result

sets. The quality of such method depends on the performance of the compo-

nent search engines. Only if all the results have a similar effectiveness, then

the Round-Robin performs well, but if some result lists are irrelevant then

the whole result becomes poor. In [TVGJL95, VGJL94] was demonstrated

a way of improving the Round-Robin method. A probabilistic mechanism is

used to determine each of the documents in a merged list. It is based on the

length of returned document lists or the estimated usefulness of a database.

In particular, using a random experiment, one selects one of the contributing

ranked lists and then selects the top available element from that list and

places it in the next position in a result list. This procedure repeats until all

contributing lists are depleted. Later in [YR98] it was proposed a determinis-

tic version of this method. Two new techniques for the merging search results

are introduced in [CHT99, Cra01]: the feature distance ranking algorithm

and the reference statistics method. They are reasonably effective in the iso-

lated environment. It was shown that the feature distance algorithm is also

effective in the integrated environment. In [WCG03] the problem of merging

results exploiting the document overlaps was addressed. This case is some-

where in the middle between the disjoint and identical database approaches.

The task is how to merge the documents that appear in only one result with

29

those that appear in several different results. The new algorithms for the

result merging are proposed, which take advantage of the use of duplicate

documents in two ways: one correlates the scores from different results; the

other one regards the duplicates as an increased evidence of being relevant

to a query.

3.4.4 Metasearch approved methods

The result merging policies are used by metasearch engines. They often

cannot obtain additional statistics from the single search engines and there-

fore use the less effective fusion strategies. Most of their merging schemes

are based on the data fusion methods. The Metacrawler [SE97] is one of

the first metasearch engines that were developed for the Web. It uses the

CombSum data fusion method after eliminating the duplicate URLs. In the

Profusion [GWG96], which is another metasearch engine, all duplicate URLs

are merged using the Max function. The Inquirus metasearch engine [LG98]

downloads the documents and analyzes their content. A combination of the

similarity and proximity matches is included into the ranking formula. In the

Mearf [OKK02] were introduced several methods of merging that are based

on the similarity to result clusters. The clusters are obtained from the top-k

document summaries, provided by a search engine. As the implicit relevance

judgements for evaluation, they used the fact if user clicked on the presented

link or not. A re-ranking is founded on both the analytical results from the

contents and links of returned web page summaries.

3.5 Summary

In current chapter the common metasearch issues were described, in par-

ticular we elaborate on the result merging task. We provided details on two

main problems of the result merging — the collection fusion and data fusion.

According to our taxonomy, we review related work in both fields. This

chapter established a basis for the following selection and evaluation of the

result merging methods.

30

Chapter 4

Selected result merging

strategies

This chapter contains the detailed descriptions of the result merging

strategies selected for the evaluation in the Minerva system. In Section 4.1

we investigate the properties of the available result merging methods and

define their target values. We describe the score normalization techniques:

with the global IDF values in Section 4.2, with the ICF values in Section

4.3, the CORI merging in Section 4.4, the language modeling-based merging

in Section 4.5, and with the TF values in Section 4.6.

4.1 Target properties for result merging methods

A subset of properties, which is specific for the Minerva system, imposes the

restrictions on the result merging methods. We identified the most distinctive

environment properties and summarized them in the Table 4.1. All proper-

ties are coded as desirable “++”, acceptable “+−” or undesirable “−−”.

31

N Property Description Options Grade

1Document

overlap

The effectiveness of the methods on

disjoint and overlapping collections of

documents is different. In a P2P

environment we assume some

overlapping between the collections on

peers.

Overlapping ++

Disjoint +−

2 Inputs

Some methods use only ranks of the

documents to perform merging while

others use the similarity scores and

additional information about

collections. In general, the score-based

result merging methods work better

than the rank-based methods when an

additional information about the

collection is available.

Scores ++

Ranks −−

3Database

selection

The result merging methods are often

effectively combined with the database

selection step. They include an

information about the database and

cannot be performed efficiently

without a particular database

selection method. Also some methods

are sensitive to the differences in

information which is used in the

database selection step and the result

merging step.

Used ++

Not used +−

4Training

data

The most effective results can be

achieved with models learned

from a current data. But

learning methods imply

relatively static environment

with limited number of nodes.

Used −−

Not used ++

32

5 Scalability

It was discovered that some

particularly good methods perform

poorly with the large number of

querying databases or increasing

number of the top-k results.

High ++

Low −−

6Content

distribution

Another feature of a merging method

is its ability to deal with different

types of document distributions across

the databases. A uniform distribution

assumes that all collections have the

similar proportions of the documents

relevant to a query. The collections in

the Minerva are topic-oriented, which

is traditionally a difficult testbed for

the merging algorithms.

Skewed ++

Uniform +−

7 Integration

The result merging techniques are

designed with the certain degree of the

search engine’s cooperation in mind.

A search engine may provide us with

all necessary statistic about its

database or we have to obtain it with

the additional efforts or discard some

parameters.

Cooperative ++

Non-

cooperative +−

Table 4.1: The target properties of the result merging

methods

4.2 Score normalization with global IDF

In [CLC95, VF95, Kir97] were proposed several methods that are based

on the globally computed IDF values and where the difference is in the par-

ticular algorithms for collecting necessary statistics. The rationale here is

that in case of a disjoint partitioning of the documents this method is ex-

33

pected to be most effective and equal to a centralized search engine approach

using TF · IDF score function. By the globally computed IDF values, we

eliminate the statistics estimation differences for the scores from different

databases. Since our environment is cooperative, we can collect the required

statistics by posting local DFi values and number of documents in the collec-

tion |Ci| into a global directory and global IDF (GIDF ) values and scores

are computed as follows:

GIDF = log(

∑|C|i=1 |Ci|∑|C|

i=1 |DFi|) (4.1)

s =

|q|∑k=1

TFijk ·GIDFik (4.2)

Where:

s — is a similarity score for query and document under the vector space

model.

If GIDF values are computed over every disjoint collection, this score

for the distributed environment should be exactly the same as for a single

database with all documents Dij. However, in practice, an overlapping be-

tween the documents in different collections will affect GIDF values. If a

document was crawled and indexed by several databases then all terms in

this document will have the higher DF values than they should. In theory,

we can consider this fact by providing a mechanism for finding duplicate

documents among all peers. However, in practice, such effort to eliminate

the skew in scores is unaffordable in the distributed system. A score correc-

tion with respect to overlap will cost too much and, probably, will give us a

negligible improvement. We assume that the overlapping will affect all terms

in approximately the same degree on a very large collection.

Another assumption is that such skewed term GIDF values may reflect

some latent tendencies. For example, the terms in the highly replicated

documents may be deemed less important than terms that are not so popular,

because the replicated documents may be easily found due to their wide

dissemination.

Problem may also occur if GIDF ′ score is computed only over a subset of

collections C ′, which were chosen for a query by the database selection algo-

34

rithm. It corresponds to the non-distributed case where only the documents

from selected databases were placed into the single collection:

GIDF ′ = log(

∑|C′|i=1 |Ci|∑|C′|

i=1 |DFi|) (4.3)

s =

|q|∑k=1

TFijk ·GIDF ′ik (4.4)

The effectiveness of such GIDF ′ values will be different from the fully com-

puted GIDF but may still perform well. We want to investigate how the

database selection and GIDF ′ estimation influences a retrieval effectiveness.

4.3 Score normalization with ICF

In distributed information retrieval emerged a measure, which does not

exist in traditional information retrieval — the inverted collection frequency

or ICF :

GIDF ′ = log(|C||CF |

) (4.5)

Where:

CF — is the collection frequency that is equal to the number of collections

where the term occurs at least once;

|C| — is the overall number of collections.

ICF is analogues to IDF measure but one level higher; instead of the

notion of document, we use the notion of collection. It can replace the IDF

part in the score computation since a term that occurs in many collections

is deemed less important than the rare one. The ICF measure is fair for all

collections and can be used in a scoring function:

s =

|q|∑k=1

TFijk · ICFik (4.6)

The advantage of this measure is that it is easy to compute. Only the infor-

mation if the term occurs in the collection or not and the number of nodes

in the system are needed. But this approximation may perform worse than

GIDF because it is the more “averaged” view on a term importance.

35

4.4 Score normalization with CORI

CORI method for the result merging were proposed in [CLC95], exten-

sively tested, and improved in [Cal00, LCC00, SJCO02, LC04b, SC03]. It is

a de-facto standard for the result merging problem. This approach heuristi-

cally combines the local scores with the server rank. The rank is obtained

during the server selection step. For a general analysis, it will represent

all other heuristic approaches of this kind as the most effective one. The

normalized score suitable for the merging is calculated in several steps.

Assuming a query q of m terms is posed against database Ci. At first,

the database selection step is performed and the rank ri for each database

for one term is computed as follows [Cal00]:

rik = b + (1 − b) · T · I (4.7)

T =DFi

DFi + 50 + 150 · cwi/cw(4.8)

T =log( |C|+0.5

CF)

log(C + 1.0)(4.9)

Where:

cwi — is the vocabulary size on Ci

cw — is the average vocabulary size over all Ci

b or “default belief” — is a heuristically set constant, usually 0.4.

The final database rank Ri is computed as follows:

Ri =

|q|∑k=1

rik (4.10)

After all databases are ranked and the number of them is selected for

query execution, the local document scores on every database are computed

and preliminary normalized:

snormijk =

(slocalijk − smin

ik )

(smaxik − smin

ik )(4.11)

sminik = min

j(TFijk) · IDFik (4.12)

smaxik = max

j(TFijk) · IDFik (4.13)

Where:

36

snormijk — preliminary normalized local score;

slocalijk — locally computed TF · IDF score for k-th term in Dij;

sminik — minimum possible term score among all Dij in Ci database;

smaxik — maximum possible term score among all Dij in Ci database.

The preliminary normalized scores snormijk should reduce the statistics dif-

ferencies, which are caused by the different local IDF values. However, for

the effective merging the database rank is also normalized so that low ranked

databases still have an opportunity to contribute the documents into the fi-

nal ranking. With respect to the maximum and minimum values, which an

algorithm can potentially assign to a database, the rank is normalized as

follows:

Rnormi =

(Ri −Rmin)

(Rmax −Rmin)(4.14)

Where:

Rmin — database rank estimated with component T set to zero;

Rmax — database rank estimated with component T set to one.

The globally normalized score s is composed from the locally normalized

score snormijk and the normalized database rank Rnorm

i in a heuristic way:

sijk =snorm

ijk + 0.4 · snormijk ·Rnorm

i

1.4(4.15)

s =

|q|∑k=1

sijk (4.16)

The first version of CORI method did not use the intermediate normaliza-

tion steps for rank and score; they were added later for the better accuracy.

This method is superior for the uncooperative environment, but it is also

competitive for the cooperative collections case.

4.5 Score normalization with language mod-

eling

The language modeling approach [PC98] to information retrieval attempts

to predict a probability of a query generation from the language model, from

which a particular document was generated. We suppose that each document

37

represents a sample that was generated from a particular language. The

relevance of a document for a particular query is estimated as how probable

that the query was generated from the language model for that document.

A language modeling local scoring function for every peer is similar to the

Equation 2.1 and is defined as the likelihood for a query q to be generated

from a document Dij [SJCO02]:

P (q|Dij) =

|q|∏k=1

λ · P (tk|Dij) + (1 − λ) · P (tk|Ci) (4.17)

Where:

tk — is a query term;

P (tk|Dij) — is the probability for the query term ti to appear in the

document Dij;

P (tk|Ci) — is the probability for the term tk to appear in the collection

Ci, to which document Dij belongs;

λ is a weighting parameter between zero and one.

The role of the term P (tk|Ci) is to smooth the probability for the doc-

ument Dij to generate the query term tk, especially when it is zero. The

idea of smoothing is similar to the TF · IDF term weighting scheme that

is used in a vector space model where ”popular” words are discouraged and

”rare” words are emphasized by the IDF component. Like the IDF value

is a collection dependent in TF · IDF scoring, the P (tk|Ci) component is a

collection dependent in the language modeling. We replace Ci with a global

collection model G estimated on all available peers. A formula for P (tk|G)

looks like:

P (q|Dij) =

|q|∏k=1

λ · P (tk|Dij) + (1 − λ) · P (tk|G) (4.18)

P (tk|Dij) = TFijk/|Dij| (4.19)

P (tk|G) =

∑|C|i=1

∑|Di|j=1 TFijk∑|C|

i=1

∑|Di|j=1

∑|cwi|l=1 TFijl

(4.20)

This language modeling based score is a collection independent and fair for

all documents on all peers. The necessary information like sums of TF values

on the document collections are posted into a distributed directory. The sum

38

of the term probabilities is more convenient than multiplication. That is why

an order-preserving logarithmic transformation is applied to P (tk|Dij):

s =

|q|∑k=1

log(λ · P (tk|Dij) + (1 − λ) · P (tk|G)) (4.21)

We also investigated the effect of the partially available information when

only selected databases contribute to G′ and the scores are computed as:

s =

|q|∑k=1

log(λ · P (tk|Dij) + (1 − λ) · P (tk|G′)) (4.22)

Both full and partial score types are used for the result merging and expected

to be reasonably effective.

4.6 Score normalization with raw TF scores

For the short popular queries, the TF -based scoring is a simple but still

reasonably good solution:

s =

|q|∑k=1

TFijk (4.23)

Result merging experiments often contain the fusion by raw TF values as

an indicator of the IDF component importance for a current query. If the

all importance values associated with the different query terms are more or

less the same, this method will show a very competitive result. We have also

evaluated it in the Minerva system.

4.7 Summary

In current chapter, we provide descriptions of result merging strategies se-

lected for the evaluation in the Minerva system. In Section 4.1 we investigate

the properties of the available result merging methods and define their tar-

get values. We describe the score normalization techniques: with the global

IDF values in Section 4.2, with the ICF values in Section 4.3, the CORI

merging in Section 4.4, the language modeling-based merging in Section 4.5,

and with the TF values in Section 4.6.

39

Chapter 5

Our approach

In current chapter we present our approach for combining of the result

merging with the preference-based language model. The latter is obtained

with a pseudo-relevance feedback on the best peer in the database ranking.

5.1 Result merging with the preference-based

language model

An important source of the user-specific information is the user’s collec-

tion of documents. In the Minerva system, the Web pages are crawled with

respect to the user’s bookmarks and therefore are assumed to reflect some of

his specific interests. We can exploit this fact using a pseudo-relevance feed-

back for finding the preference-based language model from the most relevant

database. The desctription of our approach is presented below.

When user poses a topic-oriented query Q at first we collect necessary

statistics and build peers ranking PQ. The probability distribution for the

whole set of documents G from peers in PQ is estimated. Then Q is executed

on the best database in the ranking PQ1 . According to our query routing

strategy this database should have more relevant documents than any other

database, therefore, it is the best choice for our preference-based language

model estimation. The concatenation of the top-n results from the best

peer represents a user-specific preference set U . From U we estimate the

preference-based language model. The language model U is the mixture of

40

the general language model and is composed from two parts:

P (tk|U) = λ · PML(tk|U) + (1 − λ) · PML(tk|G) (5.1)

Where:

PML(tk|U) — is the maximum likelihood estimate of term tk in the top-n

results on PQi ;

PML(tk|G) — is the maximum likelihood estimate of term tk across all

selected peers PQ;

λ — is the empirically set smoothing parameter.

The Equation 5.1 is based on the Jelinek-Mercer smoothing. The PML(tk|G)

and PML(tk|U) are defined as:

PML(tk|G) =

∑|P Q|i=1

∑|Di|j=1 TFijk∑|P Q|

i=1

∑|Di|j=1

∑|pwi|l=1 TFijl

(5.2)

PML(tk|U) =

∑|top−k|j=1 TFijk∑|top−k|j=1 |Dij|

(5.3)

Where:

Dij — is a document j on peer PQi ;

TFijk — term frequency of the term tk in document Dij;

pwi — is the vocabulary size on PQi

When both P (tk|U)ML and P (tk|G)ML components are obtained, we ap-

ply adapted version of the EM algorithm from [TZ04] to compute P (tk|U).

Pavel Serdyukov implements this algorithm for the Minerva project [Ser05].

Probabilities P (Q|G) and P (Q|U) and query Q are sent to every peer PQi

in ranking. We compute similarity scores for result merging in three steps.

In the beginning, the globally normalized similarity score sLMgn is computed

with Equation 5.4. Then, the preference-based similarity scores are computed

with the cross-entropy function. See Equation 5.5. The documents with the

higher dissimilarity between preference-based and document language models

will have the lower score. Both sLMgn and sLMpbscores are combined as in

Equation 5.6 with the empirically set parameter β, it lies in interval from

zero to one.

41

Algorithm for our approach

1. query Q is posed

2. statistics SQ for terms in Q is collected from peers

3. peers ranking PQ is created for Q

4. probability P (Q|G) is estimated for the whole set of documents G

on peers in PQ

6. Q is executed on PQ1 , top-n result documents are concatenated into

a set U and probability P (Q|U) is estimated

6. Q with P (Q|G) and P (Q|U) are propagated to every PQi

7. for each document Dij on each PQi

7.1 globally normalized similarity score sLMgnk is computed:

sLMgnk = log(λ · P (tk|Dij) + (1− λ) · P (tk|G)) (5.4)

7.2 preference-based similarity score sLMpbk is computed:

sLMpbk = −P (tk|U) · log(P (tk|Dij)) (5.5)

7.3 both scores are combined into result merging scores sLMrm:

sLMrm =|Q|∑k=1

β · sLMgnk + (1− β) · sLMpb

k (5.6)

8. top-k urls with the highest sLMrm scores are returned from each Pi

9. returned results are sorted in descending order of sLMrm and the

best top-k urls are presented to the user

5.2 Discussion

The merging with the preference-based language model is close to the

query re-weighting technique that is described in [ZL01]. This approach tries

to refine the initial estimation of the query language model with the addi-

tional pseudo-relevant documents. Our approach is designed for the distrib-

uted setup. Executing the query on the one peer we select the top-n results

for our preference-based model from another peer, which was ranked the best

by the database selection algorithm. One user with the highly specified col-

lection of documents implicitly helps another user to refine final document

42

ranking. The estimation of preference-based set U is performed by analogy

with the cluster-based retrieval approach from [LC04a], the preference set is

treated as the cluster of relevant documents.

The simple intuition for using preference-based language model is given

here. Assume that the user is mostly interested in the documents that were

written with the same specific subset of the general language in mind as

we have among our best top-n results. We are looking for specific medicine

information and some peer has many documents from medicine Internet do-

main. After the execution of query ”symptoms of diabetes’ on this peer we

infer the preference-based set of documents U from the top-n results. This

model has the term distribution, which is typical for the medicine articles.

Now we want to find the documents that were generated with the language

model for U in mind, we treat it like a “relevance model”.

The proposed scheme for combining the result merging ranking with the

preference-based model ranking has limited performance. Both fused rank-

ings are correlated since they are term-frequency based. This constraint is

typical for information retrieval in general. The most prominent gain we

can get from some additional independent features like PageRank or explicit

structure of the text. The important question: “what is the appropriate size

of the top-n?” should be answered empirically.

5.3 Summary

In this chapter we present our method for combining of the result merg-

ing with the preference-based language model. This approach exploites a

pseudo-relevance feedback on the best peer in the database ranking to build

a preference documents set. We provide method details and discuss it.

43

Chapter 6

Implementation

In this chapter, we provide a brief description of the merging methods imple-

mentation in the Minerva system. We included a short summary for several

essential classes from the result merging package.

6.1 Global statistics classes

The Minerva system and all merging methods in it are implemented with

Java 1.4.2. The document databases associated with the peers are maintained

on Oracle 9.2.0. On the Figure 6.1, we present the diagram for the main

implemented classes.

For collecting statistics across many peers, three main classes are used:

• RMICFscoring ;

• RMGIDFscoring ;

• RMGLMscoring.

RMICFscoring class constructor takes SimpleGlobalQueryProcessor and RM-

Query input objects and computes the ICF value for each query term. The

calculated quantities are put into the termICF hash and accessible with the

getTermICF(String term) method. In the same manner, the RMGIDFscor-

ing and RMGLMscoring classes produce global GIDF values and global

language model respectively.

44

Figure 6.1: Main classes involved in merging

45

Three objects of the classes described above are wrapped into the one

RMGlobalStatistics object. The RMQuery class represents the query and the

RMGlobalStatistics object is placed inside it. Therefore, every global statis-

tics required for the query execution is propagated with the query. When the

RMTopKLocalQueryProcessor executes the query, the class RMSwitcherT-

FIDF is involed to re-weight the result scores. From RMQuery the switcher

takes the name of the fusion method and necessary global statistics and re-

turns the new scores.

The experiments with the limited global statistics are performed with

the RMGIDF10peersScoring and RMGLM10peersScoring classes. For the

experiments with the top-k language model, we reused the RMGLMscoring

class.

6.2 Testing components

A simplified view on general components of the experiments implemen-

tation is shown on Figure 6.2. We skipped the description of many na-

tive classes of the Minerva project as unimportant for the merging problem.

We mentioned only several specific classes that are helpful for the general

overview of our experiments. The selected classes of the three main com-

ponents are highlighted with the color borders. The order of execution is

following:

Start peers (Green)→Test methods (Blue)→Evaluate results (Red).

The “Green box” is executed from the RMDotGov50peersStart class. It

starts 50 peers; each of them is represented by the Peer object and associ-

ated with the instance of the RMTopKLocalQueryProcessor class. The latter

is connected with the Oracle database server. The SocketBasedPeerDescrip-

tor object object allows communicating with the peer through the network.

Every query received by the peer is executed with the local query processor

and the returned results are wrapped into the QueryResult object.

Next, the “Blue box” is executed from the RMDotGovExperiments50peers

class. The RMPeersRanking class is intended to build the database rankings

46

Figure 6.2: A general view on the experiments implementation

47

RANDOM, CORI and IDEAL. The first one is constructed inside the RM-

PeersRanking object while two others are obtained with the RMCORI2 and

RMIdealRankingReader objects. The queries are wrapped into the RMQuery

object. They are read from files with RMQueries and necessary global sta-

tistics are added with the classes from Figure 6.1. The SocketBasedPeerDe-

scriptor objects are created for the communication with already running

peers from “Green box’. Query results from different peers are merged into

the QueryResultList object.

The last component is the “Red box”. It is started from the RMDotGov-

RelevanceEvaluation10of50peers class. Its goal is to compute precision and

recall measures for the merged lists. Query results are taken from the text

files, which are produced by the QueryResultList class in the “Blue”component.

48

Chapter 7

Experiments

Current chapter contains a detailed description of our experiments with

result merging strategies selected for the evaluation in the Minerva system.

In Section 7.1, the system configuration and the dataset parameters are de-

scribed. Section 7.2 contains the experiments with existing result merging

methods and the discussion of results afterwards. In Section 7.3, we present

the results of the experiments with our approach with a preference-based

language model.

7.1 Experimental setup

7.1.1 Collections and queries

The previous experiments provided the different pro and contra for the result

merging algorithms. We want to check the existing methods once again

because the following combination of the features was not tested in the known

experiments:

• Minerva works with real Web data;

• There is an overlap between the documents on different peers;

• Collections are topically organized;

• Database selection algorithm is executed before the result merging step;

• Queries are topically oriented.

49

We conducted new experiments with 50 databases, which were created

from the TREC-2002, 2003 and 2004 Web Track datasets from the “.GOV”

domain. For these three volumes, four topics were selected. The relevant

documents from each topic were taken as a training set for the classification

algorithm and 50 collections were created. The non-classified documents

were randomly distributed among all databases. Each classified document

was assigned to two collections from the same topic. For example, for the

topic “American music” we have the subset of 15 small collections and all

classified documents are replicated twice in it. The topics with the numbers

of corresponding collections are summarized in the Table 7.1, each collection

is placed on one peer.

N Topic Number of collections

1 Health and medicine 15

2 Nature and ecology 10

3 Historic preservation 10

4 American music 15

Table 7.1: Topic-oriented experimental collections

Assuming that the search is topic-oriented, we selected from the topic

distillation task for the TREC 2002 and 2003 Web Track the set of the 25

out of 100 title queries. We used relevance judgements, which are available

on the NIST site (http://trec.nist.gov). Queries are selected with respect to

two requirements:

• at least 10 relevant documents exist;

• query is related to the “Health and Medicine” or “Nature and Ecology”

topics.

The full table of the selected queries is presented in the Appendix A.

The database selection algorithm chooses each time 10 peers out of 50.

To simulate the merging retrieval algorithm we obtain 500 documents on

every peer with local scores computed as:

slocal =

|Q|∑k=1

TFk

|Dij|∗ log

|C|DFik

(7.1)

50

Then we recompute scores for these documents with current merging method.

Top-30 documents with the best merging scores are retrieved to the peer,

who issued a query. Then all 10 sets of 30 documents are combined into

one ranking in descending order of their similarity scores. Top-30 documents

from this ranking are evaluated.

7.1.2 Database selection algorithm

The database selection step adds an important dimension to the result merg-

ing experiments. Only selected 10 databases are participated in a query

execution and, therefore, the effectiveness of the query routing algorithm in-

fluences the quality of result. CORI merging explicitly use results from the

database selection step for the merging. We evaluated the result merging

methods under the following database rankings:

• RANDOM;

• CORI;

• IDEAL.

RANDOM ranking is the “weakest” algorithm, which just randomly selects

10 peers without use of any information about their collection statistics. It

is the lower bound for the effectiveness of the database selection algorithm.

The CORI database selection algorithm [Cal00] is described in Chapter 4.

The IDEAL ranking is the manually created ranking, where the collections

are sorted in a descending order of the number of relevant documents for the

query.

7.1.3 Evaluation metrics

For the evaluation, we utilized the framework from [SJCO02]. For all tested

algorithms, the average precision measure is computed over the 25 queries at

the level of the top-5, 10, 15, 20, 25, and 30 documents. For example, using

the relevance judgements for topic distillation task for the TREC 2002 and

2003 Web Track, we compute the precision at the level of top-5 documents

separately for each query. Then we average the precision value over all 25

51

queries. Most of the users of the Web search engines do not look after the

first 10 or 20 results. Therefore, the difference in the effectiveness of the

algorithms after the top-30 results is not significant. When we compute

the precision for these fixed levels the micro- and macro-average precision

measures are equal.

We also exploited another baseline — the effectiveness on the single data-

base. The single database contains all the documents from the 50 peers and

uses two retrieval algorithms:

• TF · IDF

• Language modeling

We included the two term weighting functions because the merging with the

language modeling should be compared with the language modeling baseline.

It is fair since in general the language modeling is more effective than the

TF · IDF -based term weighting schemes. For both baselines, the collection-

dependent components IDF and P (tk|C) are computed on the single data-

base. We will keep the notion of the single database with the meaning of the

“united database” in the rest of the paper.

7.2 Experiments with selected result merg-

ing methods

7.2.1 Result merging methods

For the experiments, we used six methods which are described in Chapter 4.

As a lower bound, we took the merging by local TF · IDF values, which

is expected to be the most ineffective merging algorithm. We also used

two single database retrieval algorithms as the upper bound. The language

modeling scores for the result merging were tested with different values of

λ parameter. It was found that the 0.4 value gives the most stable results,

however, all other values are almost equally effective. Instead of the simple

term frequency component TF in all methods, we used the normalized term

52

frequency TF norm:

TF normijk =

TFijk

|Dij|(7.2)

Where:

TFijk — is the term frequency, the number of a term occurrences in the

document;

|Dij| — is the document length in terms.

In order to keep the notation simple we continue using TF in the text

instead of TF norm. The CORI method was tested in two variations with an

additional normalization by the maximum TF value and without normal-

ization. The second variant was consistently better and we left only this

method. Finally, the result merging methods and baselines were coded like

this:

• TF — merging by raw TF scores;

• TFIDF — lower bound, merging by TF · IDF scores with local IDF ;

• TFGIDF — merging by TF ·GIDF scores with global GIDF ;

• TFICF — merging by TF · ICF scores;

• CORI — merging by CORI method;

• LM04 — merging with global language model and λ = 0.4;

• SingleTFIDF — single database baseline with the TF · IDF scores;

• SingleLM — single database baseline with single collection language

model.

7.2.2 Merging results

On Figures 7.1-7.6 we summarize the results from the result merging exper-

iments with six methods and two baselines over three ranking algorithms.

For each ranking algorithm, we placed the average precision and recall plots.

Both measures are macro-averaged.

53

On Figure 7.1 and Figure 7.2 the results with RANDOM ranking algo-

rithm are presented. The performance of all algorithms is similar and signif-

icantly worse than that of the single database algorithms. The degradation

in performance in comparison with the baseline is obvious since databases

are chosen randomly and many relevant documents are excluded from the

merging after the database selection step. The LM04 method is significantly

better than other result merging methods in the top-5. . . 15 interval. Sur-

prisingly, the TFIDF shows a very competitive result, it is slightly better

than other merging strategies in the top-15. . . 30 interval. The explanation

is that the database statistics are not as highly skewed as assumed.

The next experiment with CORI database ranking is summarized on Fig-

ure 7.3 and Figure 7.4. This ranking is more realistic and the comparable

performance is expected from the final database selection algorithm for the

Minerva system. All result merging methods do better than with the RAN-

DOM algorithm. The TFICF strategy has the least effectiveness. From

the fact that query terms are quite popular we infer that many terms are

encountered on every peer. It indicates that the ICF measure for the term

importance is too rough. Another possible reason is that such approximation

does not work with relatively small peer lists, since we have only 50 peers

in the system. The TFGIDF method works worse than we expected and

even does not outperform the local TFIDF scores. It seems that the fair

GIDF values, which are “averaged” over all databases, are more influenced

by the noise while the local IDF values are more topic-specific. For example,

the TFIDF scheme works even better than the single database on the very

small top of results. The CORI merging method shows a mediocre result.

The database selection algorithm plays a role of the variance reduction

technique. It eliminates from the search the large portion of non-relevant

documents from the databases that were not selected. That is why some

methods perform better than two single collection baselines.

Another important observation is that TF is the one of the two best

strategies. The normalized TF value divided by the document length is

equal to the maximum likelihood estimation of P (tk|D). In other words,

the ranking by the TF scores is equal to the language modeling when the λ

equals to one. It is an evidence that for our testbed, with the CORI data-

54

0.080

0.100

0.120

0.140

0.160

0.180

0.200

0.220

0.240

0.260

0.280

Number of documents in the top-k

Ave

rag

e p

reci

sio

n

SingleTFIDF 0.208 0.176 0.160 0.158 0.142 0.135

TF 0.192 0.132 0.125 0.112 0.101 0.093

TFIDF 0.160 0.140 0.131 0.122 0.110 0.099

TFGIDF 0.168 0.120 0.131 0.118 0.107 0.099

TFICF 0.168 0.120 0.123 0.110 0.106 0.096

CORI 0.160 0.136 0.128 0.122 0.107 0.097

SingleLM 0.224 0.200 0.181 0.158 0.149 0.141

LM04 0.224 0.160 0.139 0.114 0.106 0.099

5 10 15 20 25 30

Figure 7.1: The macro-average precision with the database ranking

RANDOM

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

0.160

0.180


Ave

rag

e re

call

SingleTFIDF 0.048 0.079 0.102 0.133 0.145 0.161

TF 0.040 0.053 0.075 0.089 0.098 0.106

TFIDF 0.032 0.057 0.074 0.092 0.103 0.108

TFGIDF 0.035 0.048 0.077 0.091 0.103 0.108

TFICF 0.036 0.049 0.068 0.079 0.098 0.104

CORI 0.032 0.057 0.074 0.092 0.101 0.107

SingleLM 0.047 0.087 0.117 0.132 0.149 0.168

LM04 0.049 0.066 0.080 0.085 0.101 0.109

5 10 15 20 25 30

Figure 7.2: The macro-average recall with the database ranking RANDOM

55

base ranking, the smoothing by the general language model is only slightly

effective.

The LM04 is the second best strategy. It has almost the same absolute

effectiveness as the TF method, but since we compare LM04 with the better

SingleLM baseline, it has the lower relative efficiency in the Table 7.3. If

we continue the comparison of TF and LM04, we say that the performance

of smoothing is decreased from the database selection step. We suggest that

the λ should be tuned for every database ranking algorithm separately.

The third experiment is carried out with the manually created IDEAL

database ranking. The results are presented on Figure 7.5 and Figure 7.6.

It is hard to achieve such an accurate automatic database ranking, which

will rank all databases in decreasing order of the number of relevant docu-

ments. We used the information about the number of relevant documents in

the databases from the TREC 2002 and 2003 Web Track topic distillation

relevance judgements and built the IDEAL rank in a semi-automatic manner

for every query.

There is no absolute winner here; both the TF and TFGIDF methods

perform worse than the TFIDF and CORI methods. For TF it can be

explained so that when all databases have a comparable number of relevant

documents the difference in the IDF values is more important. When we

use the GIDF values they are “smoothed” too much and reflect the term

importance in very general sense.

On the one hand, the local IDF are computed on the reasonably large

number of documents inside a collection, they are not too “overfitted”. On

the other hand, they correspond to a specific situation when the collection is

topically oriented. When all 10 selected collections are close to each under

the IDEAL ranking, the local IDF values are both comparable and topic-

specific. Therefore, the TFIDF method appears to be a good one with the

IDEAL ranking, the result merging by CORI method shows almost the same

effectiveness. The LM04 method is again the best in the top-5. . . 10 interval

and quite good in the rest categories.

The main observations so far:

• All result merging methods are quite close to each other;

56

0.080

0.100

0.120

0.140

0.160

0.180

0.200

0.220

0.240

0.260

0.280


Ave

rag

e p

reci

sio

n

SingleTFIDF 0.208 0.176 0.160 0.158 0.142 0.135

TF 0.264 0.196 0.173 0.160 0.150 0.136

TFIDF 0.248 0.184 0.157 0.146 0.138 0.129

TFGIDF 0.224 0.176 0.152 0.144 0.136 0.133

TFICF 0.208 0.172 0.152 0.146 0.138 0.133

CORI 0.240 0.176 0.157 0.146 0.138 0.129

SingleLM 0.224 0.200 0.181 0.158 0.149 0.141

LM04 0.272 0.196 0.173 0.158 0.149 0.135

5 10 15 20 25 30

Figure 7.3: The macro-average precision with the database ranking CORI

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

0.160

0.180


Ave

rag

e re

call

SingleTFIDF 0.048 0.079 0.102 0.133 0.145 0.161

TF 0.056 0.085 0.106 0.125 0.147 0.158

TFIDF 0.055 0.081 0.101 0.120 0.136 0.149

TFGIDF 0.049 0.078 0.096 0.118 0.138 0.160

TFICF 0.044 0.074 0.099 0.120 0.136 0.159

CORI 0.054 0.078 0.101 0.120 0.137 0.149

SingleLM 0.047 0.087 0.117 0.132 0.149 0.168

LM04 0.057 0.085 0.106 0.123 0.145 0.157

5 10 15 20 25 30

Figure 7.4: The macro-average recall with the database ranking CORI

57

0.080

0.100

0.120

0.140

0.160

0.180

0.200

0.220

0.240

0.260

0.280


Ave

rag

e p

reci

sio

n

SingleTFIDF 0.208 0.176 0.160 0.158 0.142 0.135

TF 0.264 0.184 0.155 0.142 0.125 0.120

TFIDF 0.248 0.204 0.192 0.180 0.165 0.141

TFGIDF 0.240 0.204 0.163 0.164 0.150 0.141

TFICF 0.224 0.188 0.168 0.166 0.150 0.144

CORI 0.240 0.204 0.189 0.178 0.163 0.144

SingleLM 0.224 0.200 0.181 0.158 0.149 0.141

LM04 0.264 0.220 0.184 0.168 0.157 0.145

5 10 15 20 25 30

Figure 7.5: The macro-average precision with the database ranking IDEAL

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

0.160

0.180


Ave

rag

e re

call

SingleTFIDF 0.048 0.079 0.102 0.133 0.145 0.161

TF 0.049 0.074 0.091 0.114 0.124 0.138

TFIDF 0.054 0.084 0.118 0.144 0.161 0.167

TFGIDF 0.052 0.085 0.101 0.131 0.151 0.170

TFICF 0.047 0.082 0.106 0.132 0.150 0.169

CORI 0.053 0.084 0.117 0.143 0.160 0.170

SingleLM 0.047 0.087 0.117 0.132 0.149 0.168

LM04 0.055 0.091 0.116 0.131 0.151 0.166

5 10 15 20 25 30

Figure 7.6: The macro-average recall with the database ranking IDEAL

58

0.075

0.095

0.115

0.135

0.155

0.175

0.195

0.215

0.235

0.255

0.275

Number of documents in top-k

Pre

cisi

on

LM04-RANDOM 0.224 0.160 0.139 0.114 0.106 0.099

LM04-CORI 0.272 0.196 0.17333333 0.158 0.1488 0.13466667

LM04-IDEAL 0.264 0.220 0.184 0.168 0.157 0.145

SingleDBLM 0.224 0.2 0.18133333 0.158 0.1488 0.14133333

5 10 15 20 25 30

Figure 7.7: The macro-average precision of the LM04 result merging method

with the different database rankings

• The LM04 method shows the best performance and robust under every

ranking;

• The TFICF methods does not work well;

• Surprisingly, the TFIDF method is more effective than the TFGIDF

technique;

• The database selection has a significant influence on merging; a good

database ranking allows to outperform single database baseline. See

Figure 7.7.

The Tables 7.2-7.4 contain the difference in average precision in percents.

It is computed as a residual between a particular method and correspond-

ing single database algorithm. For the LM04 technique, the baseline is

SingleLM and for all others the baseline is SingleTFIDF method.

It is not clear why the local IDF -based methods are relatively good in

different setups when the GIDF -based result merging methods are not so

effective. The only observation that we made is that the simplest IDF

59

computation without additional tuning give us a very unreliable model. It is

possible to enhance it with additional heuristics but the simple version has

the very unstable behaviour. Another fact is that language models are both

effective and robust, they outperform all other tested algorithms.

60

TOP TF TFIDF TFGIDF TFICF CORI LM04

5 -7.69% -23.08% -19.23% -19.23% -23.08% 0.00%

10 -25.00% -20.45% -31.82% -31.82% -22.73% -20.00%

15 -21.67% -18.33% -18.33% -23.33% -20.00% -23.53%

20 -29.11% -22.78% -25.32% -30.38% -22.78% -27.85%

25 -29.21% -22.47% -24.72% -25.84% -24.72% -29.03%

30 -30.69% -26.73% -26.73% -28.71% -27.72% -30.19%

Table 7.2: The difference in percents of the average precision between the

result merging strategies and corresponding baselines with the RANDOM

ranking. The LM04 technique is compared with the SingleLM method; all

others are compared with the SingleTFIDF approachTOP TF TFIDF TFGIDF TFICF CORI LM04

5 +26.92% +19.23% +7.69% 0.00% +15.38% +21.43%

10 +11.36% +4.55% 0.00% -2.27% 0.00% -2.00%

15 +8.33% -1.67% -5.00% -5.00% -1.67% -4.41%

20 +1.27% -7.59% -8.86% -7.59% -7.59% 0.00%

25 +5.62% -3.37% -4.49% -3.37% -3.37% 0.00%

30 +0.99% -3.96% -0.99% -0.99% -3.96% -4.72%

Table 7.3: The difference in percents of the average precision between the

result merging strategies and corresponding baselines with the CORI ranking.

The LM04 technique is compared with the SingleLM method; all others are

compared with the SingleTFIDF approachTOP TF TFIDF TFGIDF TFICF CORI LM04

5 +26.92% +19.23% +15.38% +7.69% +15.38% +17.86%

10 +4.55% +15.91% +15.91% +6.82% +15.91% +10.00%

15 -3.33% +20.00% +1.67% +5.00% +18.33% +1.47%

20 -10.13% 13.92% +3.80% +5.06% +12.66% +6.33%

25 -12.36% 15.73% +5.62% +5.62% +14.61% +5.38%

30 -10.89% 4.95% +4.95% +6.93% +6.93% +2.83%

Table 7.4: The difference in percents of the average precision between the re-

sult merging strategies and corresponding baselines with the IDEAL ranking.

The LM04 technique is compared with the SingleLM method; all others are

compared with the SingleTFIDF approach

61

7.2.3 Effect of limited statistics on the result merging

In practice, it is inefficient to collect full statistics for the global language

model or GIDF value from the thousands of peers. We can use the limited

statistics from the 10 selected databases, which are participating in merg-

ing. On Figure 7.8 and Figure 7.9 we present the results from experiments

with limited statistics. We tested LM04 method as the most effective one.

10LM04 method is the variation of LM04, which uses only the statistics

from the 10 merged databases. We did not use the RANDOM ranking in the

rest experiments since they make sense only for a reasonably good database

selection algorithm.

With the CORI database selection algorithm the LM04 has the better

performance with the general language model estimated over all peers. How-

ever, with the IDEAL database the results of the 10LM04 technique are

almost equal to the results of the LM04 method. Our conclusion is that if

we use an effective database ranking algorithm we can merge results only

with statistics from databases, which are involved in merging. It is give us

both scalable and effective merging method.

62

0.080

0.100

0.120

0.140

0.160

0.180

0.200

0.220

0.240

0.260

0.280


Pre

cisi

on

SingleLM 0.224 0.200 0.181 0.158 0.149 0.141

LM04 0.272 0.196 0.173 0.158 0.149 0.135

10LM04 0.248 0.180 0.155 0.144 0.131 0.124

5 10 15 20 25 30


with the global statistics collected over the 10 selected databases

0.080

0.100

0.120

0.140

0.160

0.180

0.200

0.220

0.240

0.260

0.280


Pre

cisi

on

SingleLM 0.224 0.200 0.181 0.158 0.149 0.141

LM04 0.264 0.22 0.184 0.168 0.1568 0.14533333

10LM04 0.256 0.216 0.18133333 0.168 0.1568 0.14666667

5 10 15 20 25 30


with the global statistics collected over the 10 selected databases

63

7.3 Experiments with our approach

In the second series of experiments, we evaluated our technique. We tested

the best result merging method LM04 in combination with the ranking from

the preference-based language model. The detailed description of our ap-

proach is provided in Chapter 5. Here we repeat the main equations for the

merging score computation. The globally normalized similarity score sLMgnk

in method LM04 is computed as:

sLMgnk = log(λ · P (tk|Dij) + (1 − λ) · P (tk|G)) (7.3)

The preference-based similarity score sLMpbk is computed as:

sLMpbk = −P (tk|U) · log(P (tk|Dij)) (7.4)

Finally, both scores are combined into result merging scores sLMrm:

sLMrm =

|Q|∑k=1

β · sLMgnk + (1 − β) · sLMpb

k (7.5)

The influence of two main parameters was investigated:

• n — the number of the top documents, from which the preference model

U is composed;

• β — the smoothing parameter between sLMgnk and sLMpb

k scores;

The value of β we explicitly included in last two digits of the method’s

name. For example, the LM04PB02 is the combination of LM04 score

and preference-based score with β = 0.2. Notice, that codes 04 and 02 in

LM04PB02 refer to different parameters. The first one is λ — smoothing

parameter between the document and global language models in language

modeling result merging method LM04. The second one is β parameter,

which defines a trade-off between combined scores in the our approach. If we

substitute both combined scores into the formula for the final merging score

we obtain:

sLMrm =

|q|∑k=1

β·(log(λ·P (tk|Dij)+(1−λ)·P (tk|G)))+(1−β)·(P (tk|U)·log(P (tk|Dij)

(7.6)

64

When β = 0 the method reduces to the ranking by cross-entropy between

the preference-based and document language models, we call it PB. When

β = 1 we obtain a pure LM04 ranking. For retrieval of the pseudo-relevant

top-n documents, we used TFIDF retrieval algorithm.

7.3.1 Optimal size of the top-n

At first, we conducted experiments for the separate PB ranking in order to

find the optimum n for estimating our preference-based model. A reasonable

assumption is that n should lie in the [5 . . . 30] interval. The lower bound

was set to avoid overfitting and the higher bound was set with respect to

the average number of relevant documents in databases. On Figure 7.10 and

Figure 7.11 we present the results from experiments with different n for the

preference-based language model estimation.

The large variance in results is in accord with our expectations. So far,

there is no method, which will guarantee the accurate best n estimation for

every database rankings. The number of the relevant documents in the top

is crucial for the n tuning. The best database for the CORI ranking gives us

the best average precision for the model with n = 30, while for the IDEAL

ranking the best choice is n = 10.

We concluded that in Minerva system the appropriate choice for the

preference-based model estimation is n = 10. It shows the best performance

with the IDEAL ranking and reasonably good with the CORI ranking as well.

For some queries we have databases with more than 10 relevant documents,

but for others we have less than three relevant documents in the best data-

base. Therefore, it is dangerous to take large n since we can introduce many

irrelevant documents into the preference-based language model estimation.

7.3.2 Optimal smoothing parameter β

After we fixed the n parameter for the preference-based language model es-

timation, we conducted experiments with different values of the β parameter.

We carried out experiments for β = 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99

and obtained the best combination with β = 0.6. On Figure 7.12 and Fig-

ure 7.13 we present results for LM04PB06 method and show the separate

65

0.080

0.130

0.180

0.230

0.280

Ave

rag

e P

reci

sio

n

PBn5 0.288 0.196 0.171 0.154 0.141 0.129

PBn10 0.272 0.196 0.173 0.158 0.149 0.135

PBn15 0.232 0.168 0.149 0.138 0.128 0.117

PBn20 0.272 0.200 0.173 0.154 0.146 0.135

PBn25 0.272 0.196 0.173 0.158 0.149 0.135

PBn30 0.288 0.200 0.179 0.158 0.144 0.131

top-5 top-10 top-15 top-20 top-25 top-30


with the different size of top-n for the preference-based model estimation

0.080

0.130

0.180

0.230

0.280

Ave

rag

e P

reci

sio

n

PBn5 0.256 0.184 0.157 0.142 0.131 0.119

PBn10 0.248 0.200 0.168 0.146 0.130 0.119

PBn15 0.256 0.184 0.157 0.142 0.131 0.119

PBn20 0.208 0.168 0.133 0.128 0.114 0.097

PBn25 0.248 0.180 0.155 0.146 0.131 0.119

PBn30 0.256 0.176 0.157 0.146 0.133 0.120



with the different size of top-n for the preference-based model estimation

66

0.100

0.120

0.140

0.160

0.180

0.200

0.220

0.240

0.260

0.280

0.300

Ave

rag

e P

reci

sio

n

PB 0.272 0.196 0.173 0.158 0.149 0.135

LM04PB06 0.288 0.196 0.176 0.158 0.149 0.135

LM04 0.272 0.196 0.173 0.158 0.149 0.135

SingleLM 0.224 0.200 0.181 0.158 0.149 0.141

Top-5 Top-10 Top-15 Top-20 Top-25 Top-30


with the top-10 documents for the preference-based model estimation with

β = 0.6 and LM04 result merging method

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Ave

rag

e P

reci

sio

n

PB 0.248 0.192 0.165333333 0.146 0.1296 0.118666667

LM04PB06 0.272 0.22 0.186666667 0.17 0.1568 0.144

LM04 0.264 0.22 0.184 0.168 0.1568 0.145333333

SingleLM 0.224 0.2 0.181333333 0.158 0.1488 0.141333333

Top-5 Top-10 Top-15 Top-20 Top-25 Top-30


with the top-10 documents for the preference-based model estimation with

β = 0.6 and LM04 result merging method

67

performance of combined methods for comparison.

The single PB ranking, which is purely based on the pseudo-relevance

feedback, shows the unstable performance under the different database rank-

ings. It is effective with CORI ranking and poor with the IDEAL ranking.

It is the inherent property of the pseudo-relevance feedback, with “lucky”

choice of the top-n documents for the model estimation it increases the per-

formance and with “unlucky” choice decreases it. The performance of PB

method is totally depends on the first database in the ranking, on which

we yield the preference-based language model. The average precision of the

LM04PB06 is slightly better than that of the LM04 on the small top-5 and

top-15 categories. In other cases, it shows the same performance.

We conclude that the LM04PB06 combination of the cross-entropy rank-

ing PB with the LM04 language model with β = 0.6 is slightly more effective

than the single LM04 method.

7.4 Summary

In current chapter, we give a detailed description of our experiments

with the result merging strategies selected for the evaluation in the Minerva

system. In Section 7.1, we provided the testbed description. The results from

testing several known result merging techniques are presented in Section 7.2.

We found that the LM04 result merging method is the most effective and

robust. In Section 7.3 we described the experimental results with proposed

approach. We found that the combination of the pseudo-relevance feedback

base PB method with the best LM04 result merging method gives a small

improvement on some intervals and at least as effective as the best of the

combined methods.

68

Chapter 8

Conclusions and future work

8.1 Conclusions

In this thesis, we investigate the effectiveness of the different result merg-

ing methods for the P2P Web search engine Minerva.

We select several merging methods, which are feasible to use in a hetero-

geneous, dynamic, distributed environment. The experimental framework

for these methods was implemented with Java 1.4.2. and Oracle 9.2.0. We

carry out experiments with the different database rankings and study the

effectiveness of result merging methods with the different size of top-k. The

language modeling ranking method LM04 produces the most robust and

accurate results under the different conditions.

We propose the new result merging method that combines two types of

similarity scores. The first score type is computed with the language model-

ing merging method LM04. The second score type is a cross-entropy value

between the preference-based language model and the document language

model. The novelty of our approach is that the preference-based language

model is obtained from the pseudo-relevance feedback on the best peer in the

peers ranking. The combination is tuned with the heuristically set parame-

ter. In every tested setup, the new method was at least as effective as the

best of the individual merging methods or slightly better.

The main observations are the following:

• All merging algorithms are very close in absolute retrieval effectiveness;

69

• Language modeling methods are more effective than TF · IDF based

methods;

• The effectiveness of the database selection step influences the quality

of result merging;

• The pseudo-relevance feedback information from the topically orga-

nized collections improves the retrieval quality.

8.2 Future work

There are several ways to enhance the result merging in the Minerva. The

effectiveness and efficiency are the two important dimensions for improve-

ment.

The effectiveness of the many text-statistics based methods is similar

and does not significantly improve the final ranking. We can exploit other

sources of evidence and incorporate them into the fused document score.

The linkage-based algorithms (e.g. PageRank) can be added to the retrieval

algorithm. The problem here is how to compute it in a completely distributed

environment.

The efficiency is mainly depends on smart top-k result computation al-

gorithm. We can improve it by introducing the additional communications

between the peers during query processing.

70

Bibliography

[Bau99] Christoph Baumgarten. A probabilistic solution to the selec-

tion and fusion problem in distributed information retrieval. In

SIGIR ’99: Proceedings of the 22nd annual international ACM

SIGIR conference on Research and development in information

retrieval, pages 246–253. ACM Press, 1999.

[BMWZ04] M. Bender, S. Michel, G. Weikum, and C. Zimmer. The min-

erva project: Database selection in the context of p2p search.

2004.

[BP98] Sergey Brin and Lawrence Page. The anatomy of a large-

scale hypertextual Web search engine. Computer Networks and

ISDN Systems, 30(1–7):107–117, 1998.

[Cal00] J. Callan. W.B. Croft, editor, Advances in information re-

trieval,, chapter Distributed information retrieval, pages 127–

150. 2000.

[CAPMN03] Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P.

Martin, and Thu D. Nguyen. Planetp: Using gossiping to build

content addressable peer-to-peer information sharing commu-

nities. In HPDC, pages 236–249, 2003.

[CHT99] Nick Craswell, David Hawking, and Paul B. Thistlewaite.

Merging results from isolated search engines. In Australasian

Database Conference, pages 189–200, 1999.

[CLC95] J. P. Callan, Z. Lu, and W. Bruce Croft. Searching Distributed

Collections with Inference Networks . In E. A. Fox, P. Ing-

71

wersen, and R. Fidel, editors, Proceedings of the 18th Annual

International ACM SIGIR Conference on Research and Devel-

opment in Information Retrieval, pages 21–28, Seattle, Wash-

ington, 1995. ACM Press.

[Cra01] Nicholas Eric Craswell. Methods for Distributed Information

Retrieval. PhD thesis, January 01 2001.

[Cro00] W. Bruce Croft. Combining approaches to ir (invited talk). In

DELOS Workshop: Information Seeking, Searching and Query-

ing in Digital Libraries, 2000.

[CS00] Anne Le Calve and Jacques Savoy. Database merging strategy

based on logistic regression. Inf. Process. Manage., 36(3):341–

359, 2000.

[Dia96] Ted Diamond. Information retrieval using dynamic evidence

combination. PhD thesis, 1996.

[GCGMP97] Luis Gravano, Kevin Chen-Chuan Chang, Hector Garcia-

Molina, and Andreas Paepcke. Starts: Stanford proposal for

internet meta-searching (experience paper). In SIGMOD Con-

ference, pages 207–218, 1997.

[GGMT99] Luis Gravano, Hector Garcıa-Molina, and Anthony Tomasic.

GlOSS: text-source discovery over the Internet. ACM Trans-

actions on Database Systems, 24(2):229–264, 1999.

[GWG96] Susan Gauch, Guijun Wang, and Mario Gomez. ProFusion:

Intelligent Fusion from Multiple, Distributed Seach Engines.

Journal of Universal Computing, Springer-Verlag, 2(9), Sept.

1996.

[Kir97] S. T. Kirsch. Distributed search patent. u.s. patent 5,659,732,

1997.

[Kle99] Jon M. Kleinberg. Authoritative sources in a hyperlinked en-

vironment. Journal of the ACM, 46(5):604–632, 1999.

72

[LC04a] Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval us-

ing language models. In SIGIR ’04: Proceedings of the 27th

annual international conference on Research and development

in information retrieval, pages 186–193. ACM Press, 2004.

[LC04b] Jie Lu and Jamie Callan. Merging retrieval results in hierar-

chical peer-to-peer networks. In SIGIR ’04: Proceedings of the

27th annual international conference on Research and devel-

opment in information retrieval, pages 472–473. ACM Press,

2004.

[LCC00] Leah S. Larkey, Margaret E. Connell, and James P. Callan. Col-

lection selection and results merging with topically organized

u.s. patents and trec data. In CIKM, pages 282–289, 2000.

[Lee97] Jong-Hak Lee. Analyses of multiple evidence combination. In

SIGIR, pages 267–276, 1997.

[LG98] Steve Lawrence and C. Lee Giles. Inquirus, the neci meta search

engine. In WWW7: Proceedings of the seventh international

conference on World Wide Web 7, pages 95–105. Elsevier Sci-

ence Publishers B. V., 1998.

[Mon02] Mark Montague. Metasearch: Data Fusion for Document Re-

trieval. PhD thesis, 2002.

[MYL02] Weiyi Meng, Clement T. Yu, and King-Lup Liu. Building ef-

ficient and effective metasearch engines. ACM Comput. Surv.,

34(1):48–89, 2002.

[NF03] H. Nottelmann and N. Fuhr. From retrieval status values to

probabilities of relevance for advanced IR applications. Infor-

mation Retrieval, 6(4), 2003.

[OKK02] B. Uygar Oztekin, George Karypis, and Vipin Kumar. Ex-

pert agreement and content based reranking in a meta search

environment using mearf. In WWW, pages 333–344, 2002.

73

[PC98] Jay M. Ponte and W. Bruce Croft. A language modeling ap-

proach to information retrieval. In Research and Development

in Information Retrieval, pages 275–281, 1998.

[PFC+00] Allison L. Powell, James C. French, James P. Callan, Mar-

garet E. Connell, and Charles L. Viles. The impact of database

selection on distributed searching. In SIGIR, pages 232–239,

2000.

[RAS01] Yves Rasolofo, Faiza Abbaci, and Jacques Savoy. Approaches

to collection selection and results merging for distributed infor-

mation retrieval. In CIKM, pages 191–198, 2001.

[SC03] Luo Si and Jamie Callan. A semisupervised learning method to

merge search engine results. ACM Trans. Inf. Syst., 21(4):457–

491, 2003.

[SE97] E. Selberg and O. Etzioni. The MetaCrawler architecture for

resource aggregation on the Web. IEEE Expert, (January–

February):11–14, 1997.

[Ser05] Pavel Serdyukov. Query routing in a peer-to-peer web search

engine. Master’s thesis, 2005.

[SF94] Joseph A. Shaw and Edward A. Fox. Combination of multiple

searches. In Proceedings of the 3th Text REtrieval Conference

(TREC-3), pages 105–109, 1994.

[SJCO02] Luo Si, Rong Jin, James P. Callan, and Paul Ogilvie. A lan-

guage modeling framework for resource selection and results

merging. In CIKM, pages 391–397, 2002.

[SMK+01] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and

Hari Balakrishnan. Chord: A scalable Peer-To-Peer lookup

service for internet applications. In Proceedings of the 2001

ACM SIGCOMM Conference, pages 149–160, 2001.

74

[SMW+03] Torsten Suel, Chandan Mathur, Jowen Wu, Jiangong Zhang,

Alex Delis, Mehdi Kharrazi, Xiaohui Long, and Kulesh Shan-

mugasundaram. Odissea: A peer-to-peer architecture for scal-

able web search and information retrieval. In WebDB, pages

67–72, 2003.

[SP99] Jacques Savoy and Justin Picard. Report on the trec-8 experi-

ment: Searching on the web and in distributed collections. In

Proceedings of the 8th Text REtrieval Conference (TREC-8),

1999.

[SP01] Chris Sherman and Gary Price. The invisible web: Uncovering

information sources search engines can’t see, 2001.

[SR00] Jacques Savoy and Yves Rasolofo. Report on the trec-9 ex-

periment: Link-based retrieval and distributed collections. In

Proceedings of the 9th Text REtrieval Conference (TREC-9),

2000.

[TVGJL95] Geoffrey G. Towell, Ellen M. Voorhees, Narendra Kumar

Gupta, and Ben Johnson-Laird. Learning collection FUsion

strategies for information retrieval. In International Confer-

ence on Machine Learning, pages 540–548, 1995.

[TZ04] Tao Tao and ChengXiang Zhai. A mixture clustering model for

pseudo feedback in information retrieval. 2004.

[VC99a] Christopher C. Vogt and Garrison W. Cottrell. Fusion via a

linear combination of scores. Inf. Retr., 1(3):151–173, 1999.

[VC99b] Christopher Charles Vogt and Garrison W. Cottrell. Adaptive

combination of evidence for information retrieval. PhD thesis,

1999.

[VF95] Charles L. Viles and James C. French. Dissemination of col-

lection wide information in a distributed information retrieval

system. In Proceedings of the 18th Annual International ACM

75

SIGIR Conference on Research and Development in Informa-

tion Retrieval, pages 12–20, 1995.

[VGJL94] Ellen M. Voorhees, Narendra Kumar Gupta, and Ben Johnson-

Laird. The collection fusion problem. In Text REtrieval Con-

ference, pages 0–, 1994.

[Voo95] E. Voorhees. Siemens trec-4 report: Further experiments with

database merging, 1995.

[WC02] Shengli Wu and Fabio Crestani. Data fusion with estimated

weights. In CIKM, pages 648–651, 2002.

[WCG03] Shengli Wu, Fabio Crestani, and Forbes Gibb. New methods

of results merging for distributed information retrieval. In Dis-

tributed Multimedia Information Retrieval, pages 84–100, 2003.

[WGDW03] Y. Wang, L. Galanis, and D.J. De Witt. Galanx: An efficient

peer-to-peer search engine system. 2003.

[YL97] Budi Yuwono and Dik Lun Lee. Server ranking for distributed

text retrieval systems on the internet. In Database Systems for

Advanced Applications, pages 41–50, 1997.

[YR98] Ronald R. Yager and Alexander Rybalov. On the fusion of doc-

uments from multiple collection information retrieval systems.

J. Am. Soc. Inf. Sci., 49(13):1177–1184, 1998.

[ZL01] Chengxiang Zhai and John Lafferty. Model-based feedback in

the language modeling approach to information retrieval. In

CIKM ’01: Proceedings of the tenth international conference on

Information and knowledge management, pages 403–410. ACM

Press, 2001.

76

Appendix A

Test queries

In the Table A.1 we summarized the test queries from the topic distillation

task for the TREC 2002 and 2003 Web Track datasets (http://trec.nist.gov).

Queries are selected with respect to two requirements: they have at least 10

relevant documents and related to the “Health and Medicine” or “Nature

and Ecology” topic.

In the Table A.2 we present the relevant documents distributions for the

TF and LM04PB06 merging methods. This is typical for the tested merging

methods when results for some specific queries are enhanced and for others

are made worse. On Figure A.1 and Figure A.2 we placed a graphical

interpretation of the same data. The residual in performance between the

two methods is presented on Figure A.3.

77

N Query number in

TREC

Stemmed query Topic Total number of

relevant documents

1 552 food cancer patient HM 26

2 557 clean air clean water NE 48

3 560 symptom diabet HM 28

4 561 erad boll weevil NE 17

5 563 smoke drink pregnanc HM 23

6 564 mother infant nutrit HM 24

7 569 invas anim plant NE 18

8 574 whale dolphin protect NE 27

9 575 nuclear wast storag transport NE 46

10 576 chesapeak bay ecolog NE 14

11 578 regul zoo NE 20

12 584 birth defect HM 50

13 583 florida endang speci NE 29

14 586 women health cancer HM 38

15 589 mental ill adolesc HM 112

16 594 food prevent high cholesterol HM 27

17 TD5 pest control safeti NE 13

18 TD14 agricultur biotechnolog NE 12

19 TD31 deaf children HM 13

20 TD32 wildlif conserv NE 86

21 TD33 food safeti HM 28

22 TD35 arctic explor NE 45

23 TD36 global warm NE 12

24 TD43 forest fire NE 25

25 TD44 ozon layer NE 12

Table A.1: The topic-oriented set of the 25 experimental queries (topics are

coded as “HM” for the Health and Medicine, and “NE” for the Nature and

Ecology)

78

QueryN top-5 top-10 top-15 top-20 top-25 top-30

TF | LMPB TF | LMPB TF | LMPB TF | LMPB TF | LMPB TF | LMPB

1 1 | 5 1 | 5 2 | 5 3 | 6 3 | 6 5 | 6

2 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0

3 5 | 2 5 | 2 5 | 4 8 | 5 10 | 6 10 | 8

4 4 | 4 7 | 7 8 | 9 9 | 9 10 | 10 10 | 10

5 1 | 1 1 | 2 1 | 2 2 | 2 2 | 2 2 | 2

6 1 | 1 2 | 3 2 | 6 4 | 6 6 | 6 7 | 7

7 1 | 0 1 | 1 1 | 2 1 | 2 3 | 2 3 | 2

8 0 | 0 0 | 1 0 | 1 1 | 1 1 | 1 2 | 1

9 2 | 2 3 | 3 4 | 3 4 | 4 4 | 5 6 | 7

10 0 | 0 1 | 0 1 | 1 1 | 2 2 | 3 2 | 3

11 2 | 3 3 | 3 4 | 4 5 | 4 6 | 4 8 | 5

12 1 | 3 2 | 4 5 | 4 8 | 6 8 | 9 9 | 9

13 0 | 0 0 | 1 0 | 1 1 | 3 1 | 3 1 | 3

14 3 | 3 5 | 5 8 | 8 10 | 11 12 | 12 12 | 12

15 4 | 3 4 | 5 5 | 6 5 | 6 5 | 8 5 | 8

16 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1

17 0 | 2 2 | 3 3 | 3 3 | 3 3 | 3 3 | 3

18 2 | 0 2 | 2 3 | 3 3 | 3 3 | 3 3 | 3

19 0 | 0 0 | 1 1 | 1 1 | 2 1 | 3 1 | 3

20 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0

21 0 | 0 0 | 0 0 | 0 0 | 1 0 | 1 0 | 2

22 0 | 0 1 | 0 1 | 0 1 | 0 1 | 1 2 | 2

23 0 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1

24 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1

25 2 | 2 4 | 4 4 | 4 6 | 6 8 | 7 9 | 9

Table A.2: The number of relevant documents for the TF and LM04PB06

methods with the IDEAL database ranking (LM04PB06 name is shortened

to LMPB for convinience)

79

0

2

4

6

8

10

12

14

5 10 15 20 25 30


Nu

mb

er o

f re

leva

nt

do

cum

ents

N1food cancer patient

N2clean air clean water

N3symptom diabet

N4 erad boll weevil

N5 smoke drink pregnanc

N6 mother infant nutrit

N7 invas anim plant

N8 whale dolphin protect

N9 nuclear wast storag transport

N10 chesapeak bay ecolog

N11 regul zoo

N12 florida endang speci

N13 women health cancer

N14 mental ill adolesc

N15 food prevent high cholesterol

N16 pest control safeti

N17 agricultur biotechnolog

N18 deaf children

N19 wildlif conserv

N20 food safeti

N21 arctic explor

N22 global warm

N23 forest fire

N24 ozon layer

N25 birth defect

Figure A.1: Relevant documents distribution for the TF method with the

IDEAL ranking

0

2

4

6

8

10

12

14

5 10 15 20 25 30


Nu

mb

er o

f re

leva

nt

do

cum

ents



N3symptom diabet

N4 erad boll weevil



N7 invas anim plant




N11 regul zoo







N18 deaf children

N19 wildlif conserv

N20 food safeti

N21 arctic explor

N22 global warm

N23 forest fire

N24 ozon layer

N25 birth defect

Figure A.2: Relevant documents distribution for the LM04PB06 method

with the IDEAL ranking

80

-5

-4

-3

-2

-1

0

1

2

3

4

5


Nu

mb

er o

f re

leva

nt

do

cum

ents



N3symptom diabet

N4 erad boll weevil



N7 invas anim plant




N11 regul zoo







N18 deaf children

N19 wildlif conserv

N20 food safeti

N21 arctic explor

N22 global warm

N23 forest fire

N24 ozon layer

N25 birth defect

Figure A.3: Residual between the number of relevant documents of the

CE06LM04 and TF methods with the IDEAL database ranking

81