result merging in a peer-to-peer web search...
TRANSCRIPT
Result Merging in aPeer-to-Peer Web Search
Engine
Sergey Chernov
UNIVERSITAT DES SAARLANDES
January, 2005
A Result Merging in aPeer-to-Peer Web Search
Engine
A thesis submitted in partial fulfillment of the requirement for the degree of
Master of Science
in
Computer Science
Submitted by
Sergey Chernov
under the guidance of
Prof. Dr-Ing. Gerhard Weikum
Christian Zimmer
UNIVERSITAT DES SAARLANDES
January, 2005
Abstract
A tremendous amount of information in the Internet requires powerful search
engines. Currently, only the commercial centralized search engines like Google
can process terabytes of Web documents. Such approaches fail in indexing
the “Hidden Web” located in the intranets and local databases, and with
an exponential growing of information volume the situation becomes even
worse. Peer-to-Peer (P2P) systems can be pursued for extending the current
search capabilities. The Minerva project is a Web search engine based on a
P2P architecture. In this thesis, we investigate the effectiveness of the dif-
ferent result merging methods for the Minerva system. Each peer provides
an efficient search engine for its own focused Web crawl. Each peer can pose
a query against a number of selected peers; the selection is based on a data-
base ranking algorithm. The best top-k results from several highly ranked
peers are collected by the query initiator and merged into a single list. We
address problem of the result merging. We select several merging methods,
which are feasible for use in a heterogeneous, dynamic, distributed environ-
ment. The experimental framework for these methods was implemented and
the effectiveness of the merging techniques was studied with the TREC Web
data. The language modeling based ranking method produced the most ro-
bust and accurate results under the different conditions. We also proposed
a new merging method, which incorporates the preference-based language
model. The novelty of the method is that the preference-based language
model is obtained from the pseudo-relevance feedback on the best peer in
the database ranking. In every tested setup, the new method was at least as
effective as the baseline or slightly better.
i
I hereby declare that this thesis is entirely my own work and that I have
not used any other media than the ones mentioned in the thesis.
Saarbrucken, the 12nd January, 2005
Sergey Chernov
ii
Acknowledgements
I would like to thank my academic advisor Professor Gerhard Weikum for
his guidance and encouragement through the duration of my master thesis
project. I wish to express my sincere gratitude to my supervisors Chris-
tian Zimmer, Sebastian Michel, and Matthias Bender for their invaluable
assistance and feedback. I would like to thank Kerstin Meyer-Ross for her
continuous support in everything. I am very grateful to the members of the
Databases and Information Systems group AG5, fellow students from the
IMPRS program and all my friends from the Max-Planck Institute who pro-
vided me with friendly and stimulating environment. I would like to extend
special thanks to Pavel Serdyukov and Natalie Kozlova for the numerous dis-
cussions and helpful ideas. It is difficult to explain how grateful I am to my
mother, Galina Nikolaevna, and my father, Alevtin Petrovich, their wisdom
and care made it possible for me to study. Finally, I want to thank the one
person who was most supportive and patient during this process, my dear
wife Olga. I would never accomplish this work without her love.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Description of the remaining chapters . . . . . . . . . . . . . . 5
2 Web search and Peer-to-Peer Systems 6
2.1 Information retrieval basics . . . . . . . . . . . . . . . . . . . 6
2.2 Web search engines . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Peer-to-Peer architecture . . . . . . . . . . . . . . . . . . . . . 11
2.4 P2P Web search engines . . . . . . . . . . . . . . . . . . . . . 12
2.5 Minerva project . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Result merging in distributed information retrieval 15
3.1 Distributed information retrieval in general . . . . . . . . . . . 15
3.2 Result merging problem . . . . . . . . . . . . . . . . . . . . . 19
3.3 Prior work on collection fusion . . . . . . . . . . . . . . . . . . 21
3.3.1 Collection fusion properties . . . . . . . . . . . . . . . 21
3.3.2 Cooperative environment . . . . . . . . . . . . . . . . . 22
3.3.3 Uncooperative environment . . . . . . . . . . . . . . . 23
3.3.4 Learning methods . . . . . . . . . . . . . . . . . . . . . 25
3.3.5 Probabilistic methods . . . . . . . . . . . . . . . . . . . 26
3.4 Prior work on the data fusion . . . . . . . . . . . . . . . . . . 26
3.4.1 Data fusion properties . . . . . . . . . . . . . . . . . . 27
3.4.2 Basic methods . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Mixture methods . . . . . . . . . . . . . . . . . . . . . 29
iv
3.4.4 Metasearch approved methods . . . . . . . . . . . . . . 30
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Selected result merging strategies 31
4.1 Target properties for result merging methods . . . . . . . . . . 31
4.2 Score normalization with global IDF . . . . . . . . . . . . . 33
4.3 Score normalization with ICF . . . . . . . . . . . . . . . . . . 35
4.4 Score normalization with CORI . . . . . . . . . . . . . . . . . 36
4.5 Score normalization with language modeling . . . . . . . . . . 37
4.6 Score normalization with raw TF scores . . . . . . . . . . . . 39
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Our approach 40
5.1 Result merging with the preference-based language model . . . 40
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Implementation 44
6.1 Global statistics classes . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Testing components . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Experiments 49
7.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 49
7.1.1 Collections and queries . . . . . . . . . . . . . . . . . . 49
7.1.2 Database selection algorithm . . . . . . . . . . . . . . . 51
7.1.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . 51
7.2 Experiments with selected result merging methods . . . . . . . 52
7.2.1 Result merging methods . . . . . . . . . . . . . . . . . 52
7.2.2 Merging results . . . . . . . . . . . . . . . . . . . . . . 53
7.2.3 Effect of limited statistics on the result merging . . . . 62
7.3 Experiments with our approach . . . . . . . . . . . . . . . . . 64
7.3.1 Optimal size of the top-n . . . . . . . . . . . . . . . . . 65
7.3.2 Optimal smoothing parameter β . . . . . . . . . . . . . 65
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
v
8 Conclusions and future work 69
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 71
A Test queries 77
vi
List of Figures
2.1 The Minerva system architecture . . . . . . . . . . . . . . . . 13
3.1 Simple metasearch architecture . . . . . . . . . . . . . . . . . 17
3.2 A query processing scheme in the distributed search system . . 18
3.3 Collection fusion vs. data fusion . . . . . . . . . . . . . . . . . 20
3.4 An overlapping in the collection fusion problem . . . . . . . . 22
3.5 Statistics propagation for the collection fusion . . . . . . . . . 24
3.6 Data fusion on a single search engine . . . . . . . . . . . . . . 27
6.1 Main classes involved in merging . . . . . . . . . . . . . . . . 45
6.2 A general view on the experiments implementation . . . . . . 47
7.1 The macro-average precision with the database ranking RANDOM 55
7.2 The macro-average recall with the database ranking RANDOM 55
7.3 The macro-average precision with the database ranking CORI 57
7.4 The macro-average recall with the database ranking CORI . . 57
7.5 The macro-average precision with the database ranking IDEAL 58
7.6 The macro-average recall with the database ranking IDEAL . 58
7.7 The macro-average precision of the LM04 result merging method
with the different database rankings . . . . . . . . . . . . . . . 59
7.8 The macro-average precision with the database ranking CORI
with the global statistics collected over the 10 selected databases 63
7.9 The macro-average precision with the database ranking IDEAL
with the global statistics collected over the 10 selected databases 63
7.10 The macro-average precision with the database ranking CORI
with the different size of top-n for the preference-based model
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vii
7.11 The macro-average precision with the database ranking IDEAL
with the different size of top-n for the preference-based model
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.12 The macro-average precision with the database ranking CORI
with the top-10 documents for the preference-based model es-
timation with β = 0.6 and LM04 result merging method . . . 67
7.13 The macro-average precision with the database ranking IDEAL
with the top-10 documents for the preference-based model es-
timation with β = 0.6 and LM04 result merging method . . . 67
A.1 Relevant documents distribution for the TF method with the
IDEAL ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.2 Relevant documents distribution for the LM04PB06 method
with the IDEAL ranking . . . . . . . . . . . . . . . . . . . . . 80
A.3 Residual between the number of relevant documents of the
CE06LM04 and TF methods with the IDEAL database ranking 81
viii
List of Tables
4.1 The target properties of the result merging methods . . . . . . 33
7.1 Topic-oriented experimental collections . . . . . . . . . . . . . 50
7.2 The difference in percents of the average precision between
the result merging strategies and corresponding baselines with
the RANDOM ranking. The LM04 technique is compared
with the SingleLM method; all others are compared with the
SingleTFIDF approach . . . . . . . . . . . . . . . . . . . . . 61
7.3 The difference in percents of the average precision between
the result merging strategies and corresponding baselines with
the CORI ranking. The LM04 technique is compared with the
SingleLM method; all others are compared with the SingleTFIDF
approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.4 The difference in percents of the average precision between the
result merging strategies and corresponding baselines with the
IDEAL ranking. The LM04 technique is compared with the
SingleLM method; all others are compared with the SingleTFIDF
approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.1 The topic-oriented set of the 25 experimental queries (topics
are coded as “HM” for the Health and Medicine, and “NE”
for the Nature and Ecology) . . . . . . . . . . . . . . . . . . . 78
A.2 The number of relevant documents for the TF and LM04PB06
methods with the IDEAL database ranking (LM04PB06 name
is shortened to LMPB for convinience) . . . . . . . . . . . . . 79
ix
Chapter 1
Introduction
1.1 Motivation
Millions of new documents are created every day across the Internet and
even more are changed. The huge amount of information increases expo-
nentially and a search engine is the only hope to find the documents which
are relevant to a particular user’s need. Routine access to the information is
now based on full-text information retrieval, instead of controlled vocabulary
indices. Currently, a huge number of people use the Web for text and image
search on a regular basis. In this thesis, we consider the problems of search
on text data.
The need for effective search tools becomes more important every day, but
currently only a few centralized search engines like Google (www.google.com)
can cope with this task and they are only partially effective. The so-called
Hidden Web consists of all intranets and local databases behind portal pages.
According to estimation from [SP01], it is 2 to 50 times larger then the Visible
Web, which can be crawled by the search robots. Taking into account that
even the largest Google crawl of more than 8 billion pages encompasses only
a part of the Visible Web, we can imagine how many probably relevant pages
a centralized search engine does not consider during a search. This problem
comes from the technical limitations of a single search engine.
The desire to overcome the limitations of a single search engine established
a new scientific direction — distributed information retrieval or metasearch,
we will use both terms as synonyms. The main technique that was devel-
1
oped in this field is an intermediate broker called a metasearch engine. It has
access to the query interfaces of the individual search engines and text data-
bases. Briefly, when a metasearch engine receives a user’s query, it passes
the query to a set of appropriate individual search engines and databases.
Then it collects the partial results and combines them to improve the overall
result. Numerous examples of metasearch engines are available on the Web
(www.search.com, www.profusion.com, www.inquirus.com).
The metasearch approach contains several significant sub-problems, which
arise in the query execution process. The database selection problem arises
when a query is routed from a metasearch engine to the individual search
engines. A naive routing approach is to propagate a query to all available
engines. The scalability of such a strategy is unsatisfactory since it is in-
efficient to ask more than several dozens of servers. The database selec-
tion process helps to discover a small number of the most useful databases
for a current query and to ask only this limited subset without significant
loss in recall. Many of the database selection methods were developed to
tackle this issue [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The
result merging problem is another important sub-problem of the metasearch
technique. In information retrieval, the output result is a ranked list of
documents, which are sorted by their similarity score values. In distrib-
uted information retrieval, aforementioned list is obtained from several re-
sult lists, which are merged into one. The result merging problem is not
trivial, numerous merging techniques have been studied in the literature
[CLC95, TVGJL95, Bau99, Cra01, SJCO02]. The issue of an automatic
database discovery was not fully addressed; so adding new data sources to
the metasearch engine mainly remains a manual task.
The major drawback of metasearch is that large search engines are not
interested in cooperation. A search result is a commercial product which
they want to “sell” by their own. For example, the STARTS proposal
[GCGMP97] is a quite effective communication protocol designed especially
for metasearch, but it is not widely used because of the reason above. The
new Peer-to-Peer (P2P) technologies can help us to remove the limitations
caused by an uncooperativeness of search engines vendors. The computation
power of processors increases every year, and so does the network bandwidth.
2
Millions of personal computers have enough storage and computational re-
sources to index their own documents and perform small crawls of the inter-
esting fragments of the Web. They can provide a search on their local index,
but do not have to uncover the data itself unless they want to. This is the
way to incorporate the Hidden Web pages into a global search mechanism.
Collaborative crawling can span a larger portion of the Web, since every
peer can contribute its own focused crawl into the system. This method is
cheap and provides us with topic-oriented search opportunities; we can also
use intellectual inputs from other users to improve our own search. Such
considerations launched the Minerva project, a new P2P Web search engine.
The metasearch field has many common properties with search in a P2P
system, but some important distinctions should be taken into account. A
P2P environment is much more dynamic than traditional metasearch:
• Queries are processed on millions of small indices instead of dozens of
large indices;
• Global query execution might require resource sharing and collabora-
tion among different peers and cannot be fully performed on one peer;
• Limited statistics is a necessary requirement for a scalable P2P system,
while in the distributed informational retrieval rich statistics can be
provided by a centralized source;
• Cooperativeness of peer in a P2P system, in contrast to a metasearch
setting, helps to reduce heterogeneity in such parameters as represen-
tative statistics or index update time.
Distributed information retrieval accommodates features from two re-
search areas: information retrieval and distributed systems. The goal of
effectiveness is inherent for the former, it aims at high relevance of the re-
turned documents, and the collaboration of users in a P2P setting gives us
additional opportunities to refine the search results.
The main goals of the Minerva project include the traditional metasearch
goals and new issues:
1. Increased search coverage of the Web;
3
2. Retrieval effectiveness comparable with centralized search engine;
3. Scalable architecture to combine millions of small search engines.
For this purpose, we want to exploit existing solutions from distributed infor-
mation retrieval and adapt them to our new setup, with the aforementioned
distinctive properties in mind. We also want to find novel methods, which
are suitable for P2P architecture and can improve our system. The practi-
cal goal is to create a prototype of a highly scalable, effective, and efficient
distributed P2P Web search engine.
1.2 Our contribution
The main purpose of this thesis is to develop an effective result merging
method for the Minerva system. We analyze major sub-problems of the result
merging, and review several existing techniques. The selected methods have
been implemented and evaluated in the Minerva prototype. In addition, a
new preference-based language model method for result merging is proposed.
Our approach combines the preference-based and the result merging rankings.
The novelty of the method is that the preference-based language model is
obtained from the pseudo-relevance feedback on the best peer in the database
ranking.
We address the issue of effectiveness. It is determined by the underlying
result merging scheme. As in distributed information retrieval in a P2P
system, the similarity scores for each document are computed on the base
of the local database statistics. It makes the scores incomparable due to
the differences in statistics on the different databases. A score computation
based on the global statistics is the most accurate solution in our case. For
the cooperative data sources, as we have in the Minerva system, we can
collect the local database-dependent statistics and replace it by the globally
estimated one, which is fair for all databases. We elaborated on this issue
by testing several global score computation techniques and discovering the
most effective scoring function.
We also exploited additional information about the user’s preferences in
order to improve the quality of the final ranking. Our method combines two
4
rankings. The first ranking is the language modeling result merging scheme.
The second one is based on the language model from pseudo-relevance feed-
back. The user preferences are inferred using the pseudo-relevance feedback,
the top-k results from the best ranked database are assumed relevant. The
novelty of our method is that pseudo-relevant feedback is obtained on the
top-ranked peer before the global query execution.
1.3 Description of the remaining chapters
Background information about information retrieval and P2P systems is
presented in Chapter 2. An overview of distributed information retrieval and
recent work on result merging is introduced in Chapter 3. Chapter 4 presents
details of the merging techniques that we select for our experimental stud-
ies. Chapter 5 contains the new approach that is using the preference-based
language model. In Chapter 6, we present implementation details. The ex-
perimental setup, evaluation methodology, and our results are presented in
Chapter 7. Chapter 8 finalizes the thesis with the conclusions and sugges-
tions for future work.
5
Chapter 2
Web search and Peer-to-Peer
Systems
In this chapter, we give a short description of Web search, P2P systems,
and their potential as a platform for distributed information retrieval. Sec-
tion 2.1 contains introductory information about information retrieval. In
Section 2.2, we review the Web search engines. In Section 2.3, some general
properties of P2P systems are discussed. Section 2.4 presents recent ap-
proaches for combining search mechanisms with P2P architecture. Section
2.5 describes our approach, the Minerva project.
2.1 Information retrieval basics
Information retrieval deals with search engine architectures, algorithms,
and methods that are concerned on the information search in the Internet,
digital libraries, and text databases. The main goal is to find the relevant
documents for a query from a collection of documents. The documents are
preprocessed and placed into an index, which provides the base for retrieval.
A typical search engine is based on the single database model of the text
retrieval. In the model the documents from the Web and local databases
are collected into a centralized repository and indexed. The whole model is
effective if the index is large enough, to satisfy most of the user’s information
needs and a search engine uses an appropriate retrieval system. A retrieval
system is the set of retrieval algorithms for the different purposes: ranking,
6
stemming, index processing, relevance feedback and so on.
The widely used bag-of-words model assumes that every document may
be represented by the words, which are contained in it. The most frequent
words like “the”, “and” or “is” do not have rich semantics. They are called
the stopwords and we remove them from the document representation. The
full set of stopwords is stored in a stopwords list. The words variations with
the same stem like “run”, “runner” and “running” are mapped into the one
term, corresponding to a particular stem, a stemming algorithm performs
this process. In current example the term is “run”.
An important characteristic of a retrieval system is its underlying model
of retrieval process. This model specifies the procedure of the probability
estimation that a document will be judged relevant. The final document
ranking is based on this estimation. The ranking is presented to the user after
a query execution. The simple retrieval process models include a probabilistic
model and a vector space model ; the latter is the most widely used in the
search engines. In the vector space model the document D is represented
by the vector ~d = (w1, w2, . . . , wm) where wi is the weight indicating the
importance of term ti in representing the semantics of the document and
m is the number of distinct terms. For all terms that do not occur in the
document, corresponding entries will be equal to zero and the full document
vector is very sparse.
When the term occurs in the document, two factors are of importance
in a weight assignment. The first factor is a term frequency TF — it is a
number of term’s occurrences in the document. The weight of the term in
the document’s vector is proportional to TF . The more often term occurs,
the more it is important in representing a document’s semantics. The second
factor, which affects the weight, is a document frequency DF — it is the
number of documents with particular term. The term weight is multiplied
by the inverse document frequency IDF . The more frequently term appears
in the documents, the less its importance in a discriminating between the
documents having the term from the documents not having it. The worldwide
standard for a term weighting is TF · IDF product and its modifications.
A simple query Q is a set of keywords. It is also transformed into an
m-dimensional vector ~q = (w1, w2, . . . , wm) using all preprocessing steps like
7
stopwords elimination, stemming and term weighting. After a creation of
~q and ~d, a similarity between the document’s vector and the query’s vector
is estimated. This estimation is based on a similarity function, it can be a
distance or angle measure. The most popular similarity function is the cosine
measure, which is computed as a scalar product between ~q and ~d.
Another popular approach that tries to overcome a heuristic nature of
term weight estimation comes from the probabilistic model. The Language
modeling approach [PC98] to information retrieval attempts to predict a
probability of a query generation given a document. Although details may
be different, the main idea is following: every document is viewed as a sample
generated from a special language. A language model for each document can
be estimated during indexing. The relevance of a document for a particular
query is formulated as how likely the query was generated from the language
model for that document. The likelihood for the query Q to be generated
from the language model of the document D is computed as follows [SJCO02]:
P (Q|D) =
|Q|∏i=1
λ · P (ti|D) + (1 − λ) · P (ti|G) (2.1)
Where:
ti — is the query item in the query Q;
P (ti|D) — is the probability for ti to appear in the document D;
P (ti|G) — is the probability for the term ti to be used in the common
language model, e.g. in English;
λ — is the smoothing parameter between zero and one.
The role of P (ti|C) is to smooth the probability of the document D to
generate the query term ti, particularly when P (ti|D) is equal to zero.
The usual measures for a retrieval effectiveness evaluation are the recall
and precision, they are defined as follows [MYL02]:
recall =NumberOfRetrievedRelevantDocuments
NumberOfRelevantDocuments(2.2)
precision =NumberOfRetrievedRelevantDocuments
NumberOfRetrievedDocuments(2.3)
The effectiveness of a text retrieval system is evaluated using a set of test
queries. The relevant document set is identified beforehand. For every test
8
query a precision value is computed on the different levels of recall, these
values are averaged over the whole query set and an average recall-precision
curve is produced. In ideal case, when a system retrieves only the full set of
relevant results every time the recall and precision values should be equal to
1. In practice, we cannot achieve such effectiveness due to the query ambi-
guity, specific user’s understanding of a relevance notion and other factors.
Incorporating the explicit user’s feedback and implicitly inferred user’s pref-
erences from the previous search sessions can improve the retrieval quality.
2.2 Web search engines
Information retrieval system for Web pages is called a Web search engine.
The capabilities of these systems are very broad; the modern techniques
allow queries on text, image, and sound files. In our work, we consider the
problem of text data retrieval. Web search engines are also differentiated
by their application area. The general-purpose search engines can search
across the whole Web, while the special-purpose engines are concerned on
the specific information sources or specific subjects. We are interested in the
general-purpose Web search engines.
The Web search engines inherited many properties from traditional infor-
mation retrieval. Every Web search engine has a text database or, equally, a
document collection that consists of all documents searchable by this engine.
An index for these documents is created before query time, every term in it
represents the single keyword or phrase. For each term one inverted index
list is constructed, this list contains document identifiers for every document
with current term, along with the corresponding similarity values. During
query execution, a search engine takes a union over the inverted index lists
corresponding to the query terms. Then search engine sorts all found docu-
ments in a descending order of their similarity score and presents the resulting
ranking to the user.
There are also distinctive features of the Web retrieval that were not
used in traditional information retrieval. The most prominent examples are
additional hyperlink relationships between the documents and an intensive
document tagging. These differences can serve as the sources of additional
9
information for a search refinement and they are exploited in the different
retrieval algorithms.
The Web developers created a significant portion of the hyperlinks on
the Web manually and this is the implicit intellectual input. The linkage
structure can be used as expert evidence that two pages connected by a
hyperlink are also semantically related. It can also be an indication that
the Web designer, who placed the hyperlinks on some pages, assesses their
content as valuable. Several algorithms are based on these considerations.
The PageRank algorithm computes the global importance of the Web page
in a large Web graph, which is inferred from the set of crawled pages [BP98].
The advantage of this algorithm is that a global importance of the page can
be precomputed before query execution time. A HITS algorithm [Kle99]
uses only a small subset of the Web graph, this subset is constructed during
query time. Such on-line computation is inconvenient for the search engines
with a high query workload, but it allows a topic-oriented page authority
computation.
The HTML tagging of documents can also be of use in the Web search
engines. Rich information about an importance of terms is inferred from
their position in a document. The terms in the title section are more impor-
tant than in the body of the document. Emphasizing with a font size and
style also indicates an additional importance of the term. The sophisticated
term weightings schemes, which are based on these observations, improve the
retrieval quality.
There are several important limitations of the existing Web search en-
gines. The first restriction is imposed on a size of searchable index. Accord-
ing to the Google statistics (www.google.com), this search engine has the
largest crawled index on the Web, its current size is about 8 billion pages.
At the same time the Hidden Web or Deep Web, which embraces the pages
that were excluded from the crawling for commercial or technical reasons,
has a size about 2-50 times larger then the Visible Web [SP01]. Even now
it is unrealistic for a single search engine to maintain an index of this size
and the information volume increases even faster than a computation power
of the centralized Web search engines. The second problem is the outdating
of the crawled information. The news pages are changed daily and it is im-
10
possible to update the whole index at this rate. Some updating strategies
help track changes on the most popular sites in the Internet, but many index
entries are completely outdated. The novel opportunities provided by the
peer-to-peer systems help to solve these problems.
2.3 Peer-to-Peer architecture
A distributed system is a collection of autonomous computers that co-
operate in order to achieve a common goal [Cra01]. In ideal case a user of
such system does not explicitly notice other computers, their location, stor-
age replication, load balancing, reliability or functionality. P2P system is an
instance of the distributed system; it is decentralized, self-organized, highly
dynamic loose coupling of many autonomous computers.
P2P systems have become famous several years ago with the Napster
(http://www.napster.com/) and Gnutella (http://www.gnutella.com/) file-
sharing systems. In the file-sharing P2P communities, every computer can
join as a peer using the client program. Other peers can access all resources
shared by the peers in this environment. The main feature of such systems
is that the peer who is looking for a file can directly contact the peer that
is sharing this file. The only information that has to be propagated is the
peer’s address and a short description of the shared data.
The first systems like Napster used a centralized server with all peers
addresses and names of the shared files. Other approaches avoided a single
point of failure and used the Gnutella-style flooding protocol. It consequently
broadcasts a request for a particular file through a small number of closest
neighbors until the message will expire. The modern P2P applications like
e-Donkey (http://www.edonkey2000.com/) are extremely popular now; they
have numerous improvements over the predecessors.
Therefore, we harness a power of the thousands autonomous personal
computers all over the world to create a temporal community for a collab-
orative work. A P2P technology is trying to make systems: scalable, self-
organized, fault-tolerant, publicly available, load-balanced. This list of desir-
able P2P properties is not exhaustive and there are also issues like anonymity,
security, etc., but the selected properties are fundamental for our task. For
11
example, modern P2P systems are often based on a mixture topology when
some “super-peers” establish the different levels of hierarchy, but we are in-
terested in a pure P2P flat structure. It gives equal rights to all peers and
makes a system more scalable.
The limitation of search capabilities is a considerable drawback of the
most P2P systems. Sometimes you have to know an exact filename of a
data of interest or you will miss the relevant results. The combination of the
search engine mechanisms for an effective retrieval with a powerful paradigm
of a P2P community is a promising research direction.
2.4 P2P Web search engines
The idea of a Peer-to-Peer Web search engine is extensively investigated
nowadays. The interesting combinations of the search services with the P2P
platforms are described in several following approaches.
ODISSEA [SMW+03] is different from many other P2P search approaches.
It assumes two-layered search engine architecture and a global index struc-
ture distributed over the nodes of the system. Under a global index orga-
nization, in contrast to a local one, a single node holds the entire inverted
index for a particular term. The distributed version of Fagin’s threshold al-
gorithm is used for result aggregation over the inverted lists. It is efficient
only over very short queries about 2-3 words. For a distributed hash table
(DHT) implementation, this system incorporated the Pastry protocol.
PlanetP [CAPMN03] is another content search infrastructure. Each node
maintains an index of its content and summarizes the set of terms in its index
using a Bloom filter. The global index is the set of all summaries. Summaries
are propagated and kept synchronized using a gossiping algorithm. This
approach is effective for several thousands peers, but it is not scalable. Its
retrieval quality is rather low for the top-k queries with a small k.
GALANX [WGDW03] is the P2P system, which is implemented on the
top of BerkeleyDB. Similar to the Minerva system, it maintains a local peer
index on every node and distributes information about term presence on a
peer with a DHT. The different query routing strategies are evaluated during
the simulation. Most of them are based on the Chord protocol and proposed
12
strategies improve the basic effectiveness by the enlarging of the index size.
The presented query routing approaches are not highly scalable since the
index volume continuously increases with the number of peers in the system.
2.5 Minerva project
The project Minerva [BMWZ04] is another Web search engine that is
based on P2P architecture. See Figure 2.1. In this system, every peer Pi
provides an efficient search engine for its own focused Web crawl Ci. The
documents Dij are indexed locally and the result is posted into a global
directory as a set of index statistics Si. A posting process and all other com-
munications between the peers are based on the Chord protocol [SMK+01].
Every peer contains a set of peerlists Li for a disjoint subset of terms Ti,
where∑|P |
i=1 Ti = T . The peerlist l is a mapping t → P ′, where t is a partic-
ular term and P ′ is a subset of peers which contain at least one document
with this term. The terms are hashed and their corresponding peerlists are
distributed fairly across the peers by the Chord protocol. During query exe-
cution all necessary peerlists, one for each query keyword, are obtained, and
merged into one.
Figure 2.1: The Minerva system architecture
Every peer can pose a query against a number of selected peers that are
most probable to contain the relevant documents. The selection is based
13
on a query routing strategy and this issue is known in a literature as the
database selection problem. A search engine on every selected peer processes
its inverted index until it obtains the top-k highly ranked documents for a
current query. Then the best top-k results from these peers are collected
by the query initiator and merged into one top-k list, this task is known as
the result merging problem. Quality of the final top-k list depends heavily
on a term weighting scheme on peers and merging algorithm, whereas speed
depends mostly on a local index processing scheme.
2.6 Summary
In this chapter, we introduce several basic concepts from information re-
trieval and Web search. We describe some key ideas of P2P systems and
review several combinations of Web search engines with a P2P platform.
Also small description of the new P2P Web search engine Minerva was pro-
vided. The scalability issue is recognized as an extremely important one. P2P
architecture seems valuable in terms of the effective and efficient retrieval.
14
Chapter 3
Result merging in distributed
information retrieval
In this chapter, we review recent work on distributed information re-
trieval. In Section 3.1, we give a short overview of the general metasearch
issues. Section 3.2 contains a comprehensive description of the result merging
task. In Section 3.3, we elaborate on the collection fusion task. In Section
3.4, we address the problems of the data fusion task.
3.1 Distributed information retrieval in gen-
eral
During the past ten years, emerged a new research direction — distributed
information retrieval or metasearch. Metasearch is the task of collecting and
combining the search results from a set of different sources. A typical scenario
includes several search engines that execute a query and one metasearch
engine that merges the results and creates a single ranked document list.
Several interesting surveys of distributed information retrieval problems and
solutions are presented in [Cal00, Cra01, Cro00, MYL02].
Distributed information retrieval task appears when the documents of in-
terest are spread across many sources. In such situation, it might be possible
to collect all documents to one server or establish multiple search engines, one
for each collection of documents. The search process is performed across the
15
network with communications between many servers. This is a distinctive
feature of distributed information retrieval.
The search in the distributed environment has several attractive proper-
ties, which make it preferable to the single engine search. Several of these
important features are listed in [MYL02]:
• The increased coverage of the Web, the indices from many sources are
used in one search;
• The solution for the problem of the search scalability, a combination is
cheaper than a centralized solution;
• The automation of the result preprocessing and combining, a user does
not have to compare and combine the results from different sources
manually;
• The improved retrieval effectiveness, the combination of different search
engines can produce a better ranking than any single ranking algorithm.
The metasearch is based on the multi-database model, where several text
databases are modeled explicitly. The multi-database model for information
retrieval has many tasks in common with the single-database model but also
has some additional problems [Cro00]:
• The resource description task;
• The database selection task;
• The result merging task.
These issues are essentially the core of distributed information retrieval re-
search, we briefly describe them below.
The main unit in the metasearch is an intermediate broker that is called
a metasearch engine. It obtains and stores a limited summary about every
database participating in a search process and decides which databases are
most appropriate for a query. A metasearch engine also propagates a query
to the selected single search engines, collects and reorganizes results. Simple
metasearch architecture is presented on Figure 3.1. A user poses a query Q
against a metasearch engine, which in turn propagates it to several search
16
Figure 3.1: Simple metasearch architecture
engines. Then the result rankings Ri are retrieved to the broker, merged,
and presented to the user as a single document ranking Rm.
A summary statistics from a search engine is called resource description
or database representative. A full-text database provides information about
its contents in a set of statistics. It may include a data about the number
of the specific term occurrences in the particular documents or in a whole
collection, the number of indexed documents etc. Information for building
resource description is obtained during the index creation step. The richness
of the database representatives depends on the level of cooperation in the
system. For example, the STARTS standard [GCGMP97] is the good choice
for a cooperative environment, where all search engines present their results
in the unified informative format. On the other hand, when they are unwilling
to cooperate we can infer their statistics from query-based sampling [SC03].
The collected resource descriptions are used for the database selection
or query routing task. In practice, we are not interested in the databases,
which are unlikely to contain relevant documents. Therefore, we can select
from all data sources only those, which are probably relevant to our query
according to their resource descriptions. For each database, we calculate the
usefulness measure that is usually based on the vector space model. Creating
the effective and robust usefulness measure for the database ranking is the
most prominent task of database selection. Several attempts to address this
17
problem are described in [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02].
The result merging problem arises when a query is executed on several
selected databases and we want to create one single ranking out of these
results. This problem is not trivial since the computation of similarity
score between documents and query uses local collection statistics. There-
fore, the scores are not directly comparable. The most accurate solution
could be obtained by a global score normalization and requires a coopera-
tion from sources. We are especially interested in this latter problem. The
carefully designed result merging algorithm can provide us with the high
quality results and give us an opportunity to speed-up a local index process-
ing. More information about the result merging methods can be found in
[CLC95, TVGJL95, Bau99, Cra01, SJCO02, SC03].
Figure 3.2: A query processing scheme in the distributed search system
More precisely the query processing scheme is presented on Figure 3.2
[Cra01]. A query Q is posed on the set of search engines that are represented
by their resource descriptions Si. A metasearch engine selects a subset of
servers S’, which are most probable to contain the relevant documents. The
size of this subset usually does not exceed 10 databases. The broker routes
Q to these selected search engines Si’ and obtains a set of document rankings
R from the selected servers. In a real world, a user is interested only in the
top-k best results where k can vary from 5 to 30. All rankings Ri are merged
18
into one rank Rm and the top-k results from it are presented to the user.
Text retrieval aims at the high relevance of the results at the minimum
response time. These two components are translated into the general issues
of effectiveness or quality and efficiency or speed of the query processing.
This thesis concerned on the effectiveness of the result merging problem.
3.2 Result merging problem
A common issue in the metasearch is how to combine several ranked lists
of the relevant documents from the different search engines into one ranked
list. It is the so-called result merging problem. The following section reviews
some modern merging methods.
Result merging is divided into two main sub-problems. The first one is
collection fusion, where the results are merged from the disjoint or nearly
disjoint document sets. The second sub-problem is data fusion, which arises
when we merge the different rankings obtained on the identical document
sets.
The main difference between the collection fusion and data fusion is that
in the first case we want to approximate the result of a single search system
on which the document set consists of all document’s sub-sets involved in the
merging. Therefore, the optimal solution is to obtain the very same retrieval
effectiveness as the search engine with the united database has. However, in
the data fusion problem the task is to merge the different rankings in such
way that the final ranking is better than every participating ranking. The
maximum quality of the result here is undefined but it should be no less than
the quality of the best single ranking. Simple intuition for these two problems
is presented on Figure 3.3. A comprehensive description of the differences
between collection fusion and data fusion can be found in [VC99b, Mon02].
In metasearch, we often do not know beforehand what kind of a merg-
ing problem we have because it depends on the level of overlap between the
documents of combined databases. If the overlap is very high the situation
is closer to the data fusion, otherwise it is the collection fusion task. The
metasearch on the Web was addressed mainly as the collection fusion prob-
lem. In fact, the overlap of search results on the different search engines is
19
Figure 3.3: Collection fusion vs. data fusion
surprisingly low. However, some approaches also take into account the data
fusion methods, sometimes even both types are evaluated in the mixture
setups.
Another important property is a level of search engine cooperation. We
divide all merging methods by the environment type into two categories:
• Cooperative (integrated) environment;
• Uncooperative (isolated) environment.
The uncooperative or isolated merging methods have no other access to the
individual databases than a ranked list of documents in the response to a
query [Voo95]. The cooperative or integrated merging techniques assume an
access to the database statistics values like TF , DF etc. In general, both
types of merging methods can produce more effective result than the single
collection with the full set of documents, if the data fusion strategy is used
[TVGJL95]. In practice, the merged results produced by the uncooperative
strategies have been less effective than the single collection run.
Our primary goal is to find a subset of the effective merging methods,
which we can apply and evaluate in the P2P Web search engine Minerva.
We assume here that all peers in the Minerva system are cooperative and
provide all necessary statistics.
20
3.3 Prior work on collection fusion
A formal definition of the collection fusion problem was stated in [TVGJL95].
It is mixed with the data fusion definition, therefore, we modified it. Assume
a set of document collections C associated with the search engines. With
respect to the query Q, each collection Ci contains a number of relevant
documents. After the query Q is posed against the collection Ci, the search
engine returns the ranked list Ri of documents Dij in a decreasing order of
their similarity Sij to the query. The top-k results is the merged ranked list
of length k containing the documents Dij with the highest similarity values
Sij in a decreasing order. Consider a document collection Cg =⋃
Ci and the
top-k results Rg, which contains the documents Dgj with similarity values
Sgj. The collection fusion task is given Q, C, and k find from⋃
Rj the top-k
results Rc of the documents Dcj such that Scj = Sgj.
3.3.1 Collection fusion properties
An ideal collection fusion method combines the documents from local search
results into one ranked list in a descending order of their global similarity
scores. The global similarity scores are produced by the single global search
system over the united database containing all local documents. In the coop-
erative environment, where all search engines provide necessary statistics, we
can achieve the consistent merging as produced by a non-distributed system,
it is also known as the perfect merging and merging with normalized scores
[Cra01]. In practice, no efficient collection fusion technique can guarantee
exactly the same ranking as on the centralized database with all documents
from all databases involved. Three main factors affect the collection fusion:
1. Only the documents returned by the selected servers can participate in
a merging. Some relevant documents will be missed after the database
selection step.
2. Different statistics and retrieval algorithms caused their own separate
problem of incomparable scores. A missing of the documents might be
the case when the top-k results are merged and necessary document is
21
locally ranked (k+1)th or greater. This problem can be solved by the
global statistics normalization methods in the cooperative environment.
3. Overlapping between the databases. See Figure 3.4. The pure col-
lection fusion approaches [VF95, Kir97, CLC95] do not consider over-
lapping. It is quite difficult to accurately estimate the actual level of
the document’s overlap between datasets. Our assumption is that the
degradation of the result quality due to overlapping is small when the
efforts for statistics correction are significant.
Figure 3.4: An overlapping in the collection fusion problem
3.3.2 Cooperative environment
In [SP99, SR00] was claimed that the simple raw score merging could show
a good retrieval performance. It seems that the raw-score approach might
be a valid first attempt for the merging of result lists, which are provided
by the same retrieval model. In [CLC95] was suggested that the collection
fusion based on the raw TF values seems as a valuable approach when the
involved databases are more or less homogeneous and the retrieval quality
degrades only by 10%. However, we assumed topically organized collections
and they have highly skewed statistics.
The most effective collection fusion methods are the score normaliza-
tion techniques, which are based on consistent global collection statistics.
All search engines must produce the document’s relevance score using the
same retrieval algorithms, including document ranking algorithm, stemming
method, stopwords list. A metasearch engine collects all required local sta-
tistics from the selected databases before or during query time. Notice, that
22
under the common TF · IDF scheme the TF component is a document-
dependent and fair across all databases. In contrast, when the IDF compo-
nent is a collection-dependent we should globally normalize it. Analogously
in the language modeling the P (ti|D) component remains unchanged and
P (ti|G) should be recomputed. The communications for such aggregation
are presented on Figure 3.5 [Cra01].
In the scheme A from [VF95], the search engines exchange their DF sta-
tistics between themselves before query time. During query execution we
compute the comparable similarity scores. Under the scheme B [CLC95], the
databases also return the comparable scores, but the document frequency
statistics is collated at the metasearch engine and sent with the query. In
the case C [Kir97], instead of a communication before query time, all search
engines return the TF and DF statistics together with the documents rank-
ings, and full information is used for the fair fusion. The distinction between
first two schemes and the last one is that the A and B are based on DF sta-
tistics for all search engines, but the scheme C is based only on the statistics
from the selected search engines. All three schemes will return the statistics
for the comparable scores and the metasearch engine performs the fusion by
sorting result documents in a descending order of their similarity scores.
In [LC04b] was proposed the method for merging results in a hierarchi-
cal peer-to-peer network. SESS is a cooperative algorithm that requires the
neighbors to provide a summary statistics for each of their top-ranked docu-
ments. It is an extended version of Kirsch’s algorithm [Kir97]. It allows very
accurate normalized document scores to be determined before downloading
of any document. However, the limitation here is that the hierarchical system
is assumed.
3.3.3 Uncooperative environment
When the environment is uncooperative or the search engines are intended to
cheat, it is still possible to obtain a good approximation of the globally com-
puted scores. In the approach from [CLC95, Cal00, SJCO02] was proposed a
merging strategy, which is based on both resource and document scores. The
database selection algorithm CORI assigns a score to each database. This
23
Figure 3.5: Statistics propagation for the collection fusion
24
database score reflects its ability to provide relevant documents to a query.
Then the local document score is weighted with the database score and some
heuristically set constants. This method can work in a semi-cooperative en-
vironment when the document scores are available or in the uncooperative
setup with the slightly degraded accuracy. The similar merging strategy was
proposed in [RAS01] with different formulas for the database rank estima-
tion. The final score is again the product of the database score and the local
score computed in a heuristic way.
In [PFC+00] it was claimed that when the database selection is employed,
it is not necessary to maintain the collection wide information, e.g. global
IDF . Local information can be used to achieve the superior performance.
This means that distributed systems can be engineered with more autonomy
and less cooperation. In contrast, in [LCC00] was discovered that it is better
to organize the collections topically, and that for the result merging, the top-
ically organized collections require the global IDF for the best performance.
The normalized scores are not as good as the global IDF for merging when
the collections are topically organized. The current polemic is a good indi-
cator that a comprehensive experimental evaluation of the collection fusion
methods is still an open research.
3.3.4 Learning methods
Another merging strategy uses the logistic regression for the score transfor-
mation [CS00]. This method requires training queries for learning the model.
The presented experiments show that the logistic regression approach is sig-
nificantly better than the Round-Robin, raw-score, and normalized raw-score
approaches. In [SC03] a linear regression model was used for the collection
fusion with a small overlapping between collections. It is assumed that there
is a centralized sample database. It stores a certain number of documents
from each resource. The metasearch engine runs the query on the sample
database at the same time it is propagated to the search engines. Then a
central broker finds the duplicate documents in the results for the sample
database and for each resource. A normalization of the document scores in
all results is performed by a linear regression analysis with the document
25
scores from the sample database taken as a baseline. The experimental re-
sults showed that this method performs slightly better than CORI does.
However, all learning-based approaches assume some kind of training set. Is
is unaffordable in a highly dynamic environment. It is hard to maintain such
information for thousands of databases when they freely join and leave the
system.
3.3.5 Probabilistic methods
Several collection fusion methods were developed for the probabilistic re-
trieval model. The approach in [Bau99] was designed for the cooperative
environment. A probabilistic ranking algorithm is based on the exported
statistics from the search engines. The consistent merging in the metasearch
is achieved with respect to the probability ranking principle. Another ap-
proach with a probabilistic principle is investigated in [NF03]. This paper
explores the family of the linear and logistic mapping functions for the dif-
ferent retrieval methods. The retrieval quality for the distributed retrieval is
only slightly improved by using the logistic function.
The language modeling for the collection fusion [SJCO02] is another prob-
abilistic method based on the same assumptions as Equation 2.1. The merg-
ing of results from the different text databases is performed under the single
probabilistic retrieval model. This new approach is designed primarily for
Intranet environments, where it is reasonable to assume that the resource
providers are relatively homogeneous and can adopt the same kind of a search
engine. The language model based merging approach is performed to inte-
grate the results. Compared with the heuristic methods like CORI algorithm,
this framework tends to be better justified by the probability theory.
3.4 Prior work on the data fusion
A formal definition of the data fusion for the distributed retrieval is given
here. Assume a set of identical document collections C associated with the
different search engines and retrieval algorithms. With respect to a query
Q, each collection Ci contains the same number of the relevant documents.
26
After Q is posed against Ci, the search engine returns a ranked list Ri of
documents Dij in a decreasing order of their similarity Sij to a query. The
top-k results is a merged ranked list of a length k containing the documents
Dij with the highest similarity values Sij in a decreasing order. The data
fusion task is given Q, C, and k find from⋃
Rj the top-k results Rd of
the documents Ddj such that the∑k
j=1 Sdj is maximized. In our setup the
rankings from different local search engines should be combined so that they
collect the most relevant documents from all rankings and put them into the
merged top-k results. See Figure 3.6.
Figure 3.6: Data fusion on a single search engine
3.4.1 Data fusion properties
Data fusion attempts to make use of three Diamond’s effects [Dia96]. They
can occur during a combination of different rankings over a single document
collection [VC99b]:
• The skimming effect happens when the retrieval approaches represent
the documents differently and thus retrieve different relevant docu-
ments. A combination model that takes the highly ranked items from
every retrieval approach could outperform the effectiveness of any single
combined ranking.
• The chorus effect occurs when several retrieval approaches suggest that
an item is relevant to a query. It is used as an evidence of the higher
relevance of that document.
27
• The dark horse effect occurs because a retrieval approach may produce
unusually accurate estimates of relevance for some documents, in com-
parison with the other retrieval approaches. A combination model may
exploit this fact by using the most correct document score.
It is obvious that all three effects are inversely correlated. For example,
if we pay more attention to the chorus effect, we decrease our chance to get
advantage from the dark horse effect. The optimal tradeoff between these
three situations is essentially the data fusion or retrieval expert combination
task.
In some sense, the data fusion problem may be defined as a voting pro-
cedure where a set of ranking algorithms selects the best k documents. The
most effective data fusion schemes are the linear combinations of similarity
scores produced by the different search engines and the problem is to find the
optimal weights for such combination. Two factors influence the performance
of any data fusion approach [Mon02]:
• The effective algorithms; each system participating in the fusion should
have an acceptable effectiveness, comparable with others.
• The uncorrelated rankings: the rankings, which are produced by the
different algorithms, should be independent from each other.
The previous experiments confirmed that the rankings which do not sat-
isfy aforementioned requirements reduce the quality of the fused ranking.
3.4.2 Basic methods
In [SF94] was proposed a number of combination techniques including oper-
ators like Min, Max, CombSum, and CombMNZ. CombSum sets the score of
each document in the combination to the sum of the scores obtained by the
individual resource, while in CombMNZ the score of each document was ob-
tained by multiplying this sum by the number of resources that had non-zero
scores. CombSum is equivalent to an averaging while CombMNZ is equiva-
lent to a weighted averaging. In [Lee97] these methods were studied further
with six different search servers. The main contribution was to normalize
28
each information retrieval algorithm on a per-query basis that substantially
improves the results. It was showed that the CombMNZ algorithm is the
best followed by the CombSum, while the operators like Min and Max were
the worst. Three newer modifications of these algorithms can be found in
[WC02]. They differ by a weight estimation mechanism. Another method
[VC99a] based on a linear combination of scores. In this approach, the rele-
vance of a document to a query is computed by combining both a score that
captures the quality of each resource and a score that captures the quality
of the document with respect to a query.
3.4.3 Mixture methods
Some methods were developed for the mixture of the collection fusion and
data fusion problems. A simple but ineffective merging method is the Round-
Robin, which takes one document in turn from each of the available result
sets. The quality of such method depends on the performance of the compo-
nent search engines. Only if all the results have a similar effectiveness, then
the Round-Robin performs well, but if some result lists are irrelevant then
the whole result becomes poor. In [TVGJL95, VGJL94] was demonstrated
a way of improving the Round-Robin method. A probabilistic mechanism is
used to determine each of the documents in a merged list. It is based on the
length of returned document lists or the estimated usefulness of a database.
In particular, using a random experiment, one selects one of the contributing
ranked lists and then selects the top available element from that list and
places it in the next position in a result list. This procedure repeats until all
contributing lists are depleted. Later in [YR98] it was proposed a determinis-
tic version of this method. Two new techniques for the merging search results
are introduced in [CHT99, Cra01]: the feature distance ranking algorithm
and the reference statistics method. They are reasonably effective in the iso-
lated environment. It was shown that the feature distance algorithm is also
effective in the integrated environment. In [WCG03] the problem of merging
results exploiting the document overlaps was addressed. This case is some-
where in the middle between the disjoint and identical database approaches.
The task is how to merge the documents that appear in only one result with
29
those that appear in several different results. The new algorithms for the
result merging are proposed, which take advantage of the use of duplicate
documents in two ways: one correlates the scores from different results; the
other one regards the duplicates as an increased evidence of being relevant
to a query.
3.4.4 Metasearch approved methods
The result merging policies are used by metasearch engines. They often
cannot obtain additional statistics from the single search engines and there-
fore use the less effective fusion strategies. Most of their merging schemes
are based on the data fusion methods. The Metacrawler [SE97] is one of
the first metasearch engines that were developed for the Web. It uses the
CombSum data fusion method after eliminating the duplicate URLs. In the
Profusion [GWG96], which is another metasearch engine, all duplicate URLs
are merged using the Max function. The Inquirus metasearch engine [LG98]
downloads the documents and analyzes their content. A combination of the
similarity and proximity matches is included into the ranking formula. In the
Mearf [OKK02] were introduced several methods of merging that are based
on the similarity to result clusters. The clusters are obtained from the top-k
document summaries, provided by a search engine. As the implicit relevance
judgements for evaluation, they used the fact if user clicked on the presented
link or not. A re-ranking is founded on both the analytical results from the
contents and links of returned web page summaries.
3.5 Summary
In current chapter the common metasearch issues were described, in par-
ticular we elaborate on the result merging task. We provided details on two
main problems of the result merging — the collection fusion and data fusion.
According to our taxonomy, we review related work in both fields. This
chapter established a basis for the following selection and evaluation of the
result merging methods.
30
Chapter 4
Selected result merging
strategies
This chapter contains the detailed descriptions of the result merging
strategies selected for the evaluation in the Minerva system. In Section 4.1
we investigate the properties of the available result merging methods and
define their target values. We describe the score normalization techniques:
with the global IDF values in Section 4.2, with the ICF values in Section
4.3, the CORI merging in Section 4.4, the language modeling-based merging
in Section 4.5, and with the TF values in Section 4.6.
4.1 Target properties for result merging methods
A subset of properties, which is specific for the Minerva system, imposes the
restrictions on the result merging methods. We identified the most distinctive
environment properties and summarized them in the Table 4.1. All proper-
ties are coded as desirable “++”, acceptable “+−” or undesirable “−−”.
31
N Property Description Options Grade
1Document
overlap
The effectiveness of the methods on
disjoint and overlapping collections of
documents is different. In a P2P
environment we assume some
overlapping between the collections on
peers.
Overlapping ++
Disjoint +−
2 Inputs
Some methods use only ranks of the
documents to perform merging while
others use the similarity scores and
additional information about
collections. In general, the score-based
result merging methods work better
than the rank-based methods when an
additional information about the
collection is available.
Scores ++
Ranks −−
3Database
selection
The result merging methods are often
effectively combined with the database
selection step. They include an
information about the database and
cannot be performed efficiently
without a particular database
selection method. Also some methods
are sensitive to the differences in
information which is used in the
database selection step and the result
merging step.
Used ++
Not used +−
4Training
data
The most effective results can be
achieved with models learned
from a current data. But
learning methods imply
relatively static environment
with limited number of nodes.
Used −−
Not used ++
32
5 Scalability
It was discovered that some
particularly good methods perform
poorly with the large number of
querying databases or increasing
number of the top-k results.
High ++
Low −−
6Content
distribution
Another feature of a merging method
is its ability to deal with different
types of document distributions across
the databases. A uniform distribution
assumes that all collections have the
similar proportions of the documents
relevant to a query. The collections in
the Minerva are topic-oriented, which
is traditionally a difficult testbed for
the merging algorithms.
Skewed ++
Uniform +−
7 Integration
The result merging techniques are
designed with the certain degree of the
search engine’s cooperation in mind.
A search engine may provide us with
all necessary statistic about its
database or we have to obtain it with
the additional efforts or discard some
parameters.
Cooperative ++
Non-
cooperative +−
Table 4.1: The target properties of the result merging
methods
4.2 Score normalization with global IDF
In [CLC95, VF95, Kir97] were proposed several methods that are based
on the globally computed IDF values and where the difference is in the par-
ticular algorithms for collecting necessary statistics. The rationale here is
that in case of a disjoint partitioning of the documents this method is ex-
33
pected to be most effective and equal to a centralized search engine approach
using TF · IDF score function. By the globally computed IDF values, we
eliminate the statistics estimation differences for the scores from different
databases. Since our environment is cooperative, we can collect the required
statistics by posting local DFi values and number of documents in the collec-
tion |Ci| into a global directory and global IDF (GIDF ) values and scores
are computed as follows:
GIDF = log(
∑|C|i=1 |Ci|∑|C|
i=1 |DFi|) (4.1)
s =
|q|∑k=1
TFijk ·GIDFik (4.2)
Where:
s — is a similarity score for query and document under the vector space
model.
If GIDF values are computed over every disjoint collection, this score
for the distributed environment should be exactly the same as for a single
database with all documents Dij. However, in practice, an overlapping be-
tween the documents in different collections will affect GIDF values. If a
document was crawled and indexed by several databases then all terms in
this document will have the higher DF values than they should. In theory,
we can consider this fact by providing a mechanism for finding duplicate
documents among all peers. However, in practice, such effort to eliminate
the skew in scores is unaffordable in the distributed system. A score correc-
tion with respect to overlap will cost too much and, probably, will give us a
negligible improvement. We assume that the overlapping will affect all terms
in approximately the same degree on a very large collection.
Another assumption is that such skewed term GIDF values may reflect
some latent tendencies. For example, the terms in the highly replicated
documents may be deemed less important than terms that are not so popular,
because the replicated documents may be easily found due to their wide
dissemination.
Problem may also occur if GIDF ′ score is computed only over a subset of
collections C ′, which were chosen for a query by the database selection algo-
34
rithm. It corresponds to the non-distributed case where only the documents
from selected databases were placed into the single collection:
GIDF ′ = log(
∑|C′|i=1 |Ci|∑|C′|
i=1 |DFi|) (4.3)
s =
|q|∑k=1
TFijk ·GIDF ′ik (4.4)
The effectiveness of such GIDF ′ values will be different from the fully com-
puted GIDF but may still perform well. We want to investigate how the
database selection and GIDF ′ estimation influences a retrieval effectiveness.
4.3 Score normalization with ICF
In distributed information retrieval emerged a measure, which does not
exist in traditional information retrieval — the inverted collection frequency
or ICF :
GIDF ′ = log(|C||CF |
) (4.5)
Where:
CF — is the collection frequency that is equal to the number of collections
where the term occurs at least once;
|C| — is the overall number of collections.
ICF is analogues to IDF measure but one level higher; instead of the
notion of document, we use the notion of collection. It can replace the IDF
part in the score computation since a term that occurs in many collections
is deemed less important than the rare one. The ICF measure is fair for all
collections and can be used in a scoring function:
s =
|q|∑k=1
TFijk · ICFik (4.6)
The advantage of this measure is that it is easy to compute. Only the infor-
mation if the term occurs in the collection or not and the number of nodes
in the system are needed. But this approximation may perform worse than
GIDF because it is the more “averaged” view on a term importance.
35
4.4 Score normalization with CORI
CORI method for the result merging were proposed in [CLC95], exten-
sively tested, and improved in [Cal00, LCC00, SJCO02, LC04b, SC03]. It is
a de-facto standard for the result merging problem. This approach heuristi-
cally combines the local scores with the server rank. The rank is obtained
during the server selection step. For a general analysis, it will represent
all other heuristic approaches of this kind as the most effective one. The
normalized score suitable for the merging is calculated in several steps.
Assuming a query q of m terms is posed against database Ci. At first,
the database selection step is performed and the rank ri for each database
for one term is computed as follows [Cal00]:
rik = b + (1 − b) · T · I (4.7)
T =DFi
DFi + 50 + 150 · cwi/cw(4.8)
T =log( |C|+0.5
CF)
log(C + 1.0)(4.9)
Where:
cwi — is the vocabulary size on Ci
cw — is the average vocabulary size over all Ci
b or “default belief” — is a heuristically set constant, usually 0.4.
The final database rank Ri is computed as follows:
Ri =
|q|∑k=1
rik (4.10)
After all databases are ranked and the number of them is selected for
query execution, the local document scores on every database are computed
and preliminary normalized:
snormijk =
(slocalijk − smin
ik )
(smaxik − smin
ik )(4.11)
sminik = min
j(TFijk) · IDFik (4.12)
smaxik = max
j(TFijk) · IDFik (4.13)
Where:
36
snormijk — preliminary normalized local score;
slocalijk — locally computed TF · IDF score for k-th term in Dij;
sminik — minimum possible term score among all Dij in Ci database;
smaxik — maximum possible term score among all Dij in Ci database.
The preliminary normalized scores snormijk should reduce the statistics dif-
ferencies, which are caused by the different local IDF values. However, for
the effective merging the database rank is also normalized so that low ranked
databases still have an opportunity to contribute the documents into the fi-
nal ranking. With respect to the maximum and minimum values, which an
algorithm can potentially assign to a database, the rank is normalized as
follows:
Rnormi =
(Ri −Rmin)
(Rmax −Rmin)(4.14)
Where:
Rmin — database rank estimated with component T set to zero;
Rmax — database rank estimated with component T set to one.
The globally normalized score s is composed from the locally normalized
score snormijk and the normalized database rank Rnorm
i in a heuristic way:
sijk =snorm
ijk + 0.4 · snormijk ·Rnorm
i
1.4(4.15)
s =
|q|∑k=1
sijk (4.16)
The first version of CORI method did not use the intermediate normaliza-
tion steps for rank and score; they were added later for the better accuracy.
This method is superior for the uncooperative environment, but it is also
competitive for the cooperative collections case.
4.5 Score normalization with language mod-
eling
The language modeling approach [PC98] to information retrieval attempts
to predict a probability of a query generation from the language model, from
which a particular document was generated. We suppose that each document
37
represents a sample that was generated from a particular language. The
relevance of a document for a particular query is estimated as how probable
that the query was generated from the language model for that document.
A language modeling local scoring function for every peer is similar to the
Equation 2.1 and is defined as the likelihood for a query q to be generated
from a document Dij [SJCO02]:
P (q|Dij) =
|q|∏k=1
λ · P (tk|Dij) + (1 − λ) · P (tk|Ci) (4.17)
Where:
tk — is a query term;
P (tk|Dij) — is the probability for the query term ti to appear in the
document Dij;
P (tk|Ci) — is the probability for the term tk to appear in the collection
Ci, to which document Dij belongs;
λ is a weighting parameter between zero and one.
The role of the term P (tk|Ci) is to smooth the probability for the doc-
ument Dij to generate the query term tk, especially when it is zero. The
idea of smoothing is similar to the TF · IDF term weighting scheme that
is used in a vector space model where ”popular” words are discouraged and
”rare” words are emphasized by the IDF component. Like the IDF value
is a collection dependent in TF · IDF scoring, the P (tk|Ci) component is a
collection dependent in the language modeling. We replace Ci with a global
collection model G estimated on all available peers. A formula for P (tk|G)
looks like:
P (q|Dij) =
|q|∏k=1
λ · P (tk|Dij) + (1 − λ) · P (tk|G) (4.18)
P (tk|Dij) = TFijk/|Dij| (4.19)
P (tk|G) =
∑|C|i=1
∑|Di|j=1 TFijk∑|C|
i=1
∑|Di|j=1
∑|cwi|l=1 TFijl
(4.20)
This language modeling based score is a collection independent and fair for
all documents on all peers. The necessary information like sums of TF values
on the document collections are posted into a distributed directory. The sum
38
of the term probabilities is more convenient than multiplication. That is why
an order-preserving logarithmic transformation is applied to P (tk|Dij):
s =
|q|∑k=1
log(λ · P (tk|Dij) + (1 − λ) · P (tk|G)) (4.21)
We also investigated the effect of the partially available information when
only selected databases contribute to G′ and the scores are computed as:
s =
|q|∑k=1
log(λ · P (tk|Dij) + (1 − λ) · P (tk|G′)) (4.22)
Both full and partial score types are used for the result merging and expected
to be reasonably effective.
4.6 Score normalization with raw TF scores
For the short popular queries, the TF -based scoring is a simple but still
reasonably good solution:
s =
|q|∑k=1
TFijk (4.23)
Result merging experiments often contain the fusion by raw TF values as
an indicator of the IDF component importance for a current query. If the
all importance values associated with the different query terms are more or
less the same, this method will show a very competitive result. We have also
evaluated it in the Minerva system.
4.7 Summary
In current chapter, we provide descriptions of result merging strategies se-
lected for the evaluation in the Minerva system. In Section 4.1 we investigate
the properties of the available result merging methods and define their tar-
get values. We describe the score normalization techniques: with the global
IDF values in Section 4.2, with the ICF values in Section 4.3, the CORI
merging in Section 4.4, the language modeling-based merging in Section 4.5,
and with the TF values in Section 4.6.
39
Chapter 5
Our approach
In current chapter we present our approach for combining of the result
merging with the preference-based language model. The latter is obtained
with a pseudo-relevance feedback on the best peer in the database ranking.
5.1 Result merging with the preference-based
language model
An important source of the user-specific information is the user’s collec-
tion of documents. In the Minerva system, the Web pages are crawled with
respect to the user’s bookmarks and therefore are assumed to reflect some of
his specific interests. We can exploit this fact using a pseudo-relevance feed-
back for finding the preference-based language model from the most relevant
database. The desctription of our approach is presented below.
When user poses a topic-oriented query Q at first we collect necessary
statistics and build peers ranking PQ. The probability distribution for the
whole set of documents G from peers in PQ is estimated. Then Q is executed
on the best database in the ranking PQ1 . According to our query routing
strategy this database should have more relevant documents than any other
database, therefore, it is the best choice for our preference-based language
model estimation. The concatenation of the top-n results from the best
peer represents a user-specific preference set U . From U we estimate the
preference-based language model. The language model U is the mixture of
40
the general language model and is composed from two parts:
P (tk|U) = λ · PML(tk|U) + (1 − λ) · PML(tk|G) (5.1)
Where:
PML(tk|U) — is the maximum likelihood estimate of term tk in the top-n
results on PQi ;
PML(tk|G) — is the maximum likelihood estimate of term tk across all
selected peers PQ;
λ — is the empirically set smoothing parameter.
The Equation 5.1 is based on the Jelinek-Mercer smoothing. The PML(tk|G)
and PML(tk|U) are defined as:
PML(tk|G) =
∑|P Q|i=1
∑|Di|j=1 TFijk∑|P Q|
i=1
∑|Di|j=1
∑|pwi|l=1 TFijl
(5.2)
PML(tk|U) =
∑|top−k|j=1 TFijk∑|top−k|j=1 |Dij|
(5.3)
Where:
Dij — is a document j on peer PQi ;
TFijk — term frequency of the term tk in document Dij;
pwi — is the vocabulary size on PQi
When both P (tk|U)ML and P (tk|G)ML components are obtained, we ap-
ply adapted version of the EM algorithm from [TZ04] to compute P (tk|U).
Pavel Serdyukov implements this algorithm for the Minerva project [Ser05].
Probabilities P (Q|G) and P (Q|U) and query Q are sent to every peer PQi
in ranking. We compute similarity scores for result merging in three steps.
In the beginning, the globally normalized similarity score sLMgn is computed
with Equation 5.4. Then, the preference-based similarity scores are computed
with the cross-entropy function. See Equation 5.5. The documents with the
higher dissimilarity between preference-based and document language models
will have the lower score. Both sLMgn and sLMpbscores are combined as in
Equation 5.6 with the empirically set parameter β, it lies in interval from
zero to one.
41
Algorithm for our approach
1. query Q is posed
2. statistics SQ for terms in Q is collected from peers
3. peers ranking PQ is created for Q
4. probability P (Q|G) is estimated for the whole set of documents G
on peers in PQ
6. Q is executed on PQ1 , top-n result documents are concatenated into
a set U and probability P (Q|U) is estimated
6. Q with P (Q|G) and P (Q|U) are propagated to every PQi
7. for each document Dij on each PQi
7.1 globally normalized similarity score sLMgnk is computed:
sLMgnk = log(λ · P (tk|Dij) + (1− λ) · P (tk|G)) (5.4)
7.2 preference-based similarity score sLMpbk is computed:
sLMpbk = −P (tk|U) · log(P (tk|Dij)) (5.5)
7.3 both scores are combined into result merging scores sLMrm:
sLMrm =|Q|∑k=1
β · sLMgnk + (1− β) · sLMpb
k (5.6)
8. top-k urls with the highest sLMrm scores are returned from each Pi
9. returned results are sorted in descending order of sLMrm and the
best top-k urls are presented to the user
5.2 Discussion
The merging with the preference-based language model is close to the
query re-weighting technique that is described in [ZL01]. This approach tries
to refine the initial estimation of the query language model with the addi-
tional pseudo-relevant documents. Our approach is designed for the distrib-
uted setup. Executing the query on the one peer we select the top-n results
for our preference-based model from another peer, which was ranked the best
by the database selection algorithm. One user with the highly specified col-
lection of documents implicitly helps another user to refine final document
42
ranking. The estimation of preference-based set U is performed by analogy
with the cluster-based retrieval approach from [LC04a], the preference set is
treated as the cluster of relevant documents.
The simple intuition for using preference-based language model is given
here. Assume that the user is mostly interested in the documents that were
written with the same specific subset of the general language in mind as
we have among our best top-n results. We are looking for specific medicine
information and some peer has many documents from medicine Internet do-
main. After the execution of query ”symptoms of diabetes’ on this peer we
infer the preference-based set of documents U from the top-n results. This
model has the term distribution, which is typical for the medicine articles.
Now we want to find the documents that were generated with the language
model for U in mind, we treat it like a “relevance model”.
The proposed scheme for combining the result merging ranking with the
preference-based model ranking has limited performance. Both fused rank-
ings are correlated since they are term-frequency based. This constraint is
typical for information retrieval in general. The most prominent gain we
can get from some additional independent features like PageRank or explicit
structure of the text. The important question: “what is the appropriate size
of the top-n?” should be answered empirically.
5.3 Summary
In this chapter we present our method for combining of the result merg-
ing with the preference-based language model. This approach exploites a
pseudo-relevance feedback on the best peer in the database ranking to build
a preference documents set. We provide method details and discuss it.
43
Chapter 6
Implementation
In this chapter, we provide a brief description of the merging methods imple-
mentation in the Minerva system. We included a short summary for several
essential classes from the result merging package.
6.1 Global statistics classes
The Minerva system and all merging methods in it are implemented with
Java 1.4.2. The document databases associated with the peers are maintained
on Oracle 9.2.0. On the Figure 6.1, we present the diagram for the main
implemented classes.
For collecting statistics across many peers, three main classes are used:
• RMICFscoring ;
• RMGIDFscoring ;
• RMGLMscoring.
RMICFscoring class constructor takes SimpleGlobalQueryProcessor and RM-
Query input objects and computes the ICF value for each query term. The
calculated quantities are put into the termICF hash and accessible with the
getTermICF(String term) method. In the same manner, the RMGIDFscor-
ing and RMGLMscoring classes produce global GIDF values and global
language model respectively.
44
Figure 6.1: Main classes involved in merging
45
Three objects of the classes described above are wrapped into the one
RMGlobalStatistics object. The RMQuery class represents the query and the
RMGlobalStatistics object is placed inside it. Therefore, every global statis-
tics required for the query execution is propagated with the query. When the
RMTopKLocalQueryProcessor executes the query, the class RMSwitcherT-
FIDF is involed to re-weight the result scores. From RMQuery the switcher
takes the name of the fusion method and necessary global statistics and re-
turns the new scores.
The experiments with the limited global statistics are performed with
the RMGIDF10peersScoring and RMGLM10peersScoring classes. For the
experiments with the top-k language model, we reused the RMGLMscoring
class.
6.2 Testing components
A simplified view on general components of the experiments implemen-
tation is shown on Figure 6.2. We skipped the description of many na-
tive classes of the Minerva project as unimportant for the merging problem.
We mentioned only several specific classes that are helpful for the general
overview of our experiments. The selected classes of the three main com-
ponents are highlighted with the color borders. The order of execution is
following:
Start peers (Green)→Test methods (Blue)→Evaluate results (Red).
The “Green box” is executed from the RMDotGov50peersStart class. It
starts 50 peers; each of them is represented by the Peer object and associ-
ated with the instance of the RMTopKLocalQueryProcessor class. The latter
is connected with the Oracle database server. The SocketBasedPeerDescrip-
tor object object allows communicating with the peer through the network.
Every query received by the peer is executed with the local query processor
and the returned results are wrapped into the QueryResult object.
Next, the “Blue box” is executed from the RMDotGovExperiments50peers
class. The RMPeersRanking class is intended to build the database rankings
46
Figure 6.2: A general view on the experiments implementation
47
RANDOM, CORI and IDEAL. The first one is constructed inside the RM-
PeersRanking object while two others are obtained with the RMCORI2 and
RMIdealRankingReader objects. The queries are wrapped into the RMQuery
object. They are read from files with RMQueries and necessary global sta-
tistics are added with the classes from Figure 6.1. The SocketBasedPeerDe-
scriptor objects are created for the communication with already running
peers from “Green box’. Query results from different peers are merged into
the QueryResultList object.
The last component is the “Red box”. It is started from the RMDotGov-
RelevanceEvaluation10of50peers class. Its goal is to compute precision and
recall measures for the merged lists. Query results are taken from the text
files, which are produced by the QueryResultList class in the “Blue”component.
48
Chapter 7
Experiments
Current chapter contains a detailed description of our experiments with
result merging strategies selected for the evaluation in the Minerva system.
In Section 7.1, the system configuration and the dataset parameters are de-
scribed. Section 7.2 contains the experiments with existing result merging
methods and the discussion of results afterwards. In Section 7.3, we present
the results of the experiments with our approach with a preference-based
language model.
7.1 Experimental setup
7.1.1 Collections and queries
The previous experiments provided the different pro and contra for the result
merging algorithms. We want to check the existing methods once again
because the following combination of the features was not tested in the known
experiments:
• Minerva works with real Web data;
• There is an overlap between the documents on different peers;
• Collections are topically organized;
• Database selection algorithm is executed before the result merging step;
• Queries are topically oriented.
49
We conducted new experiments with 50 databases, which were created
from the TREC-2002, 2003 and 2004 Web Track datasets from the “.GOV”
domain. For these three volumes, four topics were selected. The relevant
documents from each topic were taken as a training set for the classification
algorithm and 50 collections were created. The non-classified documents
were randomly distributed among all databases. Each classified document
was assigned to two collections from the same topic. For example, for the
topic “American music” we have the subset of 15 small collections and all
classified documents are replicated twice in it. The topics with the numbers
of corresponding collections are summarized in the Table 7.1, each collection
is placed on one peer.
N Topic Number of collections
1 Health and medicine 15
2 Nature and ecology 10
3 Historic preservation 10
4 American music 15
Table 7.1: Topic-oriented experimental collections
Assuming that the search is topic-oriented, we selected from the topic
distillation task for the TREC 2002 and 2003 Web Track the set of the 25
out of 100 title queries. We used relevance judgements, which are available
on the NIST site (http://trec.nist.gov). Queries are selected with respect to
two requirements:
• at least 10 relevant documents exist;
• query is related to the “Health and Medicine” or “Nature and Ecology”
topics.
The full table of the selected queries is presented in the Appendix A.
The database selection algorithm chooses each time 10 peers out of 50.
To simulate the merging retrieval algorithm we obtain 500 documents on
every peer with local scores computed as:
slocal =
|Q|∑k=1
TFk
|Dij|∗ log
|C|DFik
(7.1)
50
Then we recompute scores for these documents with current merging method.
Top-30 documents with the best merging scores are retrieved to the peer,
who issued a query. Then all 10 sets of 30 documents are combined into
one ranking in descending order of their similarity scores. Top-30 documents
from this ranking are evaluated.
7.1.2 Database selection algorithm
The database selection step adds an important dimension to the result merg-
ing experiments. Only selected 10 databases are participated in a query
execution and, therefore, the effectiveness of the query routing algorithm in-
fluences the quality of result. CORI merging explicitly use results from the
database selection step for the merging. We evaluated the result merging
methods under the following database rankings:
• RANDOM;
• CORI;
• IDEAL.
RANDOM ranking is the “weakest” algorithm, which just randomly selects
10 peers without use of any information about their collection statistics. It
is the lower bound for the effectiveness of the database selection algorithm.
The CORI database selection algorithm [Cal00] is described in Chapter 4.
The IDEAL ranking is the manually created ranking, where the collections
are sorted in a descending order of the number of relevant documents for the
query.
7.1.3 Evaluation metrics
For the evaluation, we utilized the framework from [SJCO02]. For all tested
algorithms, the average precision measure is computed over the 25 queries at
the level of the top-5, 10, 15, 20, 25, and 30 documents. For example, using
the relevance judgements for topic distillation task for the TREC 2002 and
2003 Web Track, we compute the precision at the level of top-5 documents
separately for each query. Then we average the precision value over all 25
51
queries. Most of the users of the Web search engines do not look after the
first 10 or 20 results. Therefore, the difference in the effectiveness of the
algorithms after the top-30 results is not significant. When we compute
the precision for these fixed levels the micro- and macro-average precision
measures are equal.
We also exploited another baseline — the effectiveness on the single data-
base. The single database contains all the documents from the 50 peers and
uses two retrieval algorithms:
• TF · IDF
• Language modeling
We included the two term weighting functions because the merging with the
language modeling should be compared with the language modeling baseline.
It is fair since in general the language modeling is more effective than the
TF · IDF -based term weighting schemes. For both baselines, the collection-
dependent components IDF and P (tk|C) are computed on the single data-
base. We will keep the notion of the single database with the meaning of the
“united database” in the rest of the paper.
7.2 Experiments with selected result merg-
ing methods
7.2.1 Result merging methods
For the experiments, we used six methods which are described in Chapter 4.
As a lower bound, we took the merging by local TF · IDF values, which
is expected to be the most ineffective merging algorithm. We also used
two single database retrieval algorithms as the upper bound. The language
modeling scores for the result merging were tested with different values of
λ parameter. It was found that the 0.4 value gives the most stable results,
however, all other values are almost equally effective. Instead of the simple
term frequency component TF in all methods, we used the normalized term
52
frequency TF norm:
TF normijk =
TFijk
|Dij|(7.2)
Where:
TFijk — is the term frequency, the number of a term occurrences in the
document;
|Dij| — is the document length in terms.
In order to keep the notation simple we continue using TF in the text
instead of TF norm. The CORI method was tested in two variations with an
additional normalization by the maximum TF value and without normal-
ization. The second variant was consistently better and we left only this
method. Finally, the result merging methods and baselines were coded like
this:
• TF — merging by raw TF scores;
• TFIDF — lower bound, merging by TF · IDF scores with local IDF ;
• TFGIDF — merging by TF ·GIDF scores with global GIDF ;
• TFICF — merging by TF · ICF scores;
• CORI — merging by CORI method;
• LM04 — merging with global language model and λ = 0.4;
• SingleTFIDF — single database baseline with the TF · IDF scores;
• SingleLM — single database baseline with single collection language
model.
7.2.2 Merging results
On Figures 7.1-7.6 we summarize the results from the result merging exper-
iments with six methods and two baselines over three ranking algorithms.
For each ranking algorithm, we placed the average precision and recall plots.
Both measures are macro-averaged.
53
On Figure 7.1 and Figure 7.2 the results with RANDOM ranking algo-
rithm are presented. The performance of all algorithms is similar and signif-
icantly worse than that of the single database algorithms. The degradation
in performance in comparison with the baseline is obvious since databases
are chosen randomly and many relevant documents are excluded from the
merging after the database selection step. The LM04 method is significantly
better than other result merging methods in the top-5. . . 15 interval. Sur-
prisingly, the TFIDF shows a very competitive result, it is slightly better
than other merging strategies in the top-15. . . 30 interval. The explanation
is that the database statistics are not as highly skewed as assumed.
The next experiment with CORI database ranking is summarized on Fig-
ure 7.3 and Figure 7.4. This ranking is more realistic and the comparable
performance is expected from the final database selection algorithm for the
Minerva system. All result merging methods do better than with the RAN-
DOM algorithm. The TFICF strategy has the least effectiveness. From
the fact that query terms are quite popular we infer that many terms are
encountered on every peer. It indicates that the ICF measure for the term
importance is too rough. Another possible reason is that such approximation
does not work with relatively small peer lists, since we have only 50 peers
in the system. The TFGIDF method works worse than we expected and
even does not outperform the local TFIDF scores. It seems that the fair
GIDF values, which are “averaged” over all databases, are more influenced
by the noise while the local IDF values are more topic-specific. For example,
the TFIDF scheme works even better than the single database on the very
small top of results. The CORI merging method shows a mediocre result.
The database selection algorithm plays a role of the variance reduction
technique. It eliminates from the search the large portion of non-relevant
documents from the databases that were not selected. That is why some
methods perform better than two single collection baselines.
Another important observation is that TF is the one of the two best
strategies. The normalized TF value divided by the document length is
equal to the maximum likelihood estimation of P (tk|D). In other words,
the ranking by the TF scores is equal to the language modeling when the λ
equals to one. It is an evidence that for our testbed, with the CORI data-
54
0.080
0.100
0.120
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
Number of documents in the top-k
Ave
rag
e p
reci
sio
n
SingleTFIDF 0.208 0.176 0.160 0.158 0.142 0.135
TF 0.192 0.132 0.125 0.112 0.101 0.093
TFIDF 0.160 0.140 0.131 0.122 0.110 0.099
TFGIDF 0.168 0.120 0.131 0.118 0.107 0.099
TFICF 0.168 0.120 0.123 0.110 0.106 0.096
CORI 0.160 0.136 0.128 0.122 0.107 0.097
SingleLM 0.224 0.200 0.181 0.158 0.149 0.141
LM04 0.224 0.160 0.139 0.114 0.106 0.099
5 10 15 20 25 30
Figure 7.1: The macro-average precision with the database ranking
RANDOM
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
0.180
Number of documents in the top-k
Ave
rag
e re
call
SingleTFIDF 0.048 0.079 0.102 0.133 0.145 0.161
TF 0.040 0.053 0.075 0.089 0.098 0.106
TFIDF 0.032 0.057 0.074 0.092 0.103 0.108
TFGIDF 0.035 0.048 0.077 0.091 0.103 0.108
TFICF 0.036 0.049 0.068 0.079 0.098 0.104
CORI 0.032 0.057 0.074 0.092 0.101 0.107
SingleLM 0.047 0.087 0.117 0.132 0.149 0.168
LM04 0.049 0.066 0.080 0.085 0.101 0.109
5 10 15 20 25 30
Figure 7.2: The macro-average recall with the database ranking RANDOM
55
base ranking, the smoothing by the general language model is only slightly
effective.
The LM04 is the second best strategy. It has almost the same absolute
effectiveness as the TF method, but since we compare LM04 with the better
SingleLM baseline, it has the lower relative efficiency in the Table 7.3. If
we continue the comparison of TF and LM04, we say that the performance
of smoothing is decreased from the database selection step. We suggest that
the λ should be tuned for every database ranking algorithm separately.
The third experiment is carried out with the manually created IDEAL
database ranking. The results are presented on Figure 7.5 and Figure 7.6.
It is hard to achieve such an accurate automatic database ranking, which
will rank all databases in decreasing order of the number of relevant docu-
ments. We used the information about the number of relevant documents in
the databases from the TREC 2002 and 2003 Web Track topic distillation
relevance judgements and built the IDEAL rank in a semi-automatic manner
for every query.
There is no absolute winner here; both the TF and TFGIDF methods
perform worse than the TFIDF and CORI methods. For TF it can be
explained so that when all databases have a comparable number of relevant
documents the difference in the IDF values is more important. When we
use the GIDF values they are “smoothed” too much and reflect the term
importance in very general sense.
On the one hand, the local IDF are computed on the reasonably large
number of documents inside a collection, they are not too “overfitted”. On
the other hand, they correspond to a specific situation when the collection is
topically oriented. When all 10 selected collections are close to each under
the IDEAL ranking, the local IDF values are both comparable and topic-
specific. Therefore, the TFIDF method appears to be a good one with the
IDEAL ranking, the result merging by CORI method shows almost the same
effectiveness. The LM04 method is again the best in the top-5. . . 10 interval
and quite good in the rest categories.
The main observations so far:
• All result merging methods are quite close to each other;
56
0.080
0.100
0.120
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
Number of documents in the top-k
Ave
rag
e p
reci
sio
n
SingleTFIDF 0.208 0.176 0.160 0.158 0.142 0.135
TF 0.264 0.196 0.173 0.160 0.150 0.136
TFIDF 0.248 0.184 0.157 0.146 0.138 0.129
TFGIDF 0.224 0.176 0.152 0.144 0.136 0.133
TFICF 0.208 0.172 0.152 0.146 0.138 0.133
CORI 0.240 0.176 0.157 0.146 0.138 0.129
SingleLM 0.224 0.200 0.181 0.158 0.149 0.141
LM04 0.272 0.196 0.173 0.158 0.149 0.135
5 10 15 20 25 30
Figure 7.3: The macro-average precision with the database ranking CORI
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
0.180
Number of documents in the top-k
Ave
rag
e re
call
SingleTFIDF 0.048 0.079 0.102 0.133 0.145 0.161
TF 0.056 0.085 0.106 0.125 0.147 0.158
TFIDF 0.055 0.081 0.101 0.120 0.136 0.149
TFGIDF 0.049 0.078 0.096 0.118 0.138 0.160
TFICF 0.044 0.074 0.099 0.120 0.136 0.159
CORI 0.054 0.078 0.101 0.120 0.137 0.149
SingleLM 0.047 0.087 0.117 0.132 0.149 0.168
LM04 0.057 0.085 0.106 0.123 0.145 0.157
5 10 15 20 25 30
Figure 7.4: The macro-average recall with the database ranking CORI
57
0.080
0.100
0.120
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
Number of documents in the top-k
Ave
rag
e p
reci
sio
n
SingleTFIDF 0.208 0.176 0.160 0.158 0.142 0.135
TF 0.264 0.184 0.155 0.142 0.125 0.120
TFIDF 0.248 0.204 0.192 0.180 0.165 0.141
TFGIDF 0.240 0.204 0.163 0.164 0.150 0.141
TFICF 0.224 0.188 0.168 0.166 0.150 0.144
CORI 0.240 0.204 0.189 0.178 0.163 0.144
SingleLM 0.224 0.200 0.181 0.158 0.149 0.141
LM04 0.264 0.220 0.184 0.168 0.157 0.145
5 10 15 20 25 30
Figure 7.5: The macro-average precision with the database ranking IDEAL
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
0.180
Number of documents in the top-k
Ave
rag
e re
call
SingleTFIDF 0.048 0.079 0.102 0.133 0.145 0.161
TF 0.049 0.074 0.091 0.114 0.124 0.138
TFIDF 0.054 0.084 0.118 0.144 0.161 0.167
TFGIDF 0.052 0.085 0.101 0.131 0.151 0.170
TFICF 0.047 0.082 0.106 0.132 0.150 0.169
CORI 0.053 0.084 0.117 0.143 0.160 0.170
SingleLM 0.047 0.087 0.117 0.132 0.149 0.168
LM04 0.055 0.091 0.116 0.131 0.151 0.166
5 10 15 20 25 30
Figure 7.6: The macro-average recall with the database ranking IDEAL
58
0.075
0.095
0.115
0.135
0.155
0.175
0.195
0.215
0.235
0.255
0.275
Number of documents in top-k
Pre
cisi
on
LM04-RANDOM 0.224 0.160 0.139 0.114 0.106 0.099
LM04-CORI 0.272 0.196 0.17333333 0.158 0.1488 0.13466667
LM04-IDEAL 0.264 0.220 0.184 0.168 0.157 0.145
SingleDBLM 0.224 0.2 0.18133333 0.158 0.1488 0.14133333
5 10 15 20 25 30
Figure 7.7: The macro-average precision of the LM04 result merging method
with the different database rankings
• The LM04 method shows the best performance and robust under every
ranking;
• The TFICF methods does not work well;
• Surprisingly, the TFIDF method is more effective than the TFGIDF
technique;
• The database selection has a significant influence on merging; a good
database ranking allows to outperform single database baseline. See
Figure 7.7.
The Tables 7.2-7.4 contain the difference in average precision in percents.
It is computed as a residual between a particular method and correspond-
ing single database algorithm. For the LM04 technique, the baseline is
SingleLM and for all others the baseline is SingleTFIDF method.
It is not clear why the local IDF -based methods are relatively good in
different setups when the GIDF -based result merging methods are not so
effective. The only observation that we made is that the simplest IDF
59
computation without additional tuning give us a very unreliable model. It is
possible to enhance it with additional heuristics but the simple version has
the very unstable behaviour. Another fact is that language models are both
effective and robust, they outperform all other tested algorithms.
60
TOP TF TFIDF TFGIDF TFICF CORI LM04
5 -7.69% -23.08% -19.23% -19.23% -23.08% 0.00%
10 -25.00% -20.45% -31.82% -31.82% -22.73% -20.00%
15 -21.67% -18.33% -18.33% -23.33% -20.00% -23.53%
20 -29.11% -22.78% -25.32% -30.38% -22.78% -27.85%
25 -29.21% -22.47% -24.72% -25.84% -24.72% -29.03%
30 -30.69% -26.73% -26.73% -28.71% -27.72% -30.19%
Table 7.2: The difference in percents of the average precision between the
result merging strategies and corresponding baselines with the RANDOM
ranking. The LM04 technique is compared with the SingleLM method; all
others are compared with the SingleTFIDF approachTOP TF TFIDF TFGIDF TFICF CORI LM04
5 +26.92% +19.23% +7.69% 0.00% +15.38% +21.43%
10 +11.36% +4.55% 0.00% -2.27% 0.00% -2.00%
15 +8.33% -1.67% -5.00% -5.00% -1.67% -4.41%
20 +1.27% -7.59% -8.86% -7.59% -7.59% 0.00%
25 +5.62% -3.37% -4.49% -3.37% -3.37% 0.00%
30 +0.99% -3.96% -0.99% -0.99% -3.96% -4.72%
Table 7.3: The difference in percents of the average precision between the
result merging strategies and corresponding baselines with the CORI ranking.
The LM04 technique is compared with the SingleLM method; all others are
compared with the SingleTFIDF approachTOP TF TFIDF TFGIDF TFICF CORI LM04
5 +26.92% +19.23% +15.38% +7.69% +15.38% +17.86%
10 +4.55% +15.91% +15.91% +6.82% +15.91% +10.00%
15 -3.33% +20.00% +1.67% +5.00% +18.33% +1.47%
20 -10.13% 13.92% +3.80% +5.06% +12.66% +6.33%
25 -12.36% 15.73% +5.62% +5.62% +14.61% +5.38%
30 -10.89% 4.95% +4.95% +6.93% +6.93% +2.83%
Table 7.4: The difference in percents of the average precision between the re-
sult merging strategies and corresponding baselines with the IDEAL ranking.
The LM04 technique is compared with the SingleLM method; all others are
compared with the SingleTFIDF approach
61
7.2.3 Effect of limited statistics on the result merging
In practice, it is inefficient to collect full statistics for the global language
model or GIDF value from the thousands of peers. We can use the limited
statistics from the 10 selected databases, which are participating in merg-
ing. On Figure 7.8 and Figure 7.9 we present the results from experiments
with limited statistics. We tested LM04 method as the most effective one.
10LM04 method is the variation of LM04, which uses only the statistics
from the 10 merged databases. We did not use the RANDOM ranking in the
rest experiments since they make sense only for a reasonably good database
selection algorithm.
With the CORI database selection algorithm the LM04 has the better
performance with the general language model estimated over all peers. How-
ever, with the IDEAL database the results of the 10LM04 technique are
almost equal to the results of the LM04 method. Our conclusion is that if
we use an effective database ranking algorithm we can merge results only
with statistics from databases, which are involved in merging. It is give us
both scalable and effective merging method.
62
0.080
0.100
0.120
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
Number of documents in the top-k
Pre
cisi
on
SingleLM 0.224 0.200 0.181 0.158 0.149 0.141
LM04 0.272 0.196 0.173 0.158 0.149 0.135
10LM04 0.248 0.180 0.155 0.144 0.131 0.124
5 10 15 20 25 30
Figure 7.8: The macro-average precision with the database ranking CORI
with the global statistics collected over the 10 selected databases
0.080
0.100
0.120
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
Number of documents in the top-k
Pre
cisi
on
SingleLM 0.224 0.200 0.181 0.158 0.149 0.141
LM04 0.264 0.22 0.184 0.168 0.1568 0.14533333
10LM04 0.256 0.216 0.18133333 0.168 0.1568 0.14666667
5 10 15 20 25 30
Figure 7.9: The macro-average precision with the database ranking IDEAL
with the global statistics collected over the 10 selected databases
63
7.3 Experiments with our approach
In the second series of experiments, we evaluated our technique. We tested
the best result merging method LM04 in combination with the ranking from
the preference-based language model. The detailed description of our ap-
proach is provided in Chapter 5. Here we repeat the main equations for the
merging score computation. The globally normalized similarity score sLMgnk
in method LM04 is computed as:
sLMgnk = log(λ · P (tk|Dij) + (1 − λ) · P (tk|G)) (7.3)
The preference-based similarity score sLMpbk is computed as:
sLMpbk = −P (tk|U) · log(P (tk|Dij)) (7.4)
Finally, both scores are combined into result merging scores sLMrm:
sLMrm =
|Q|∑k=1
β · sLMgnk + (1 − β) · sLMpb
k (7.5)
The influence of two main parameters was investigated:
• n — the number of the top documents, from which the preference model
U is composed;
• β — the smoothing parameter between sLMgnk and sLMpb
k scores;
The value of β we explicitly included in last two digits of the method’s
name. For example, the LM04PB02 is the combination of LM04 score
and preference-based score with β = 0.2. Notice, that codes 04 and 02 in
LM04PB02 refer to different parameters. The first one is λ — smoothing
parameter between the document and global language models in language
modeling result merging method LM04. The second one is β parameter,
which defines a trade-off between combined scores in the our approach. If we
substitute both combined scores into the formula for the final merging score
we obtain:
sLMrm =
|q|∑k=1
β·(log(λ·P (tk|Dij)+(1−λ)·P (tk|G)))+(1−β)·(P (tk|U)·log(P (tk|Dij)
(7.6)
64
When β = 0 the method reduces to the ranking by cross-entropy between
the preference-based and document language models, we call it PB. When
β = 1 we obtain a pure LM04 ranking. For retrieval of the pseudo-relevant
top-n documents, we used TFIDF retrieval algorithm.
7.3.1 Optimal size of the top-n
At first, we conducted experiments for the separate PB ranking in order to
find the optimum n for estimating our preference-based model. A reasonable
assumption is that n should lie in the [5 . . . 30] interval. The lower bound
was set to avoid overfitting and the higher bound was set with respect to
the average number of relevant documents in databases. On Figure 7.10 and
Figure 7.11 we present the results from experiments with different n for the
preference-based language model estimation.
The large variance in results is in accord with our expectations. So far,
there is no method, which will guarantee the accurate best n estimation for
every database rankings. The number of the relevant documents in the top
is crucial for the n tuning. The best database for the CORI ranking gives us
the best average precision for the model with n = 30, while for the IDEAL
ranking the best choice is n = 10.
We concluded that in Minerva system the appropriate choice for the
preference-based model estimation is n = 10. It shows the best performance
with the IDEAL ranking and reasonably good with the CORI ranking as well.
For some queries we have databases with more than 10 relevant documents,
but for others we have less than three relevant documents in the best data-
base. Therefore, it is dangerous to take large n since we can introduce many
irrelevant documents into the preference-based language model estimation.
7.3.2 Optimal smoothing parameter β
After we fixed the n parameter for the preference-based language model es-
timation, we conducted experiments with different values of the β parameter.
We carried out experiments for β = 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99
and obtained the best combination with β = 0.6. On Figure 7.12 and Fig-
ure 7.13 we present results for LM04PB06 method and show the separate
65
0.080
0.130
0.180
0.230
0.280
Ave
rag
e P
reci
sio
n
PBn5 0.288 0.196 0.171 0.154 0.141 0.129
PBn10 0.272 0.196 0.173 0.158 0.149 0.135
PBn15 0.232 0.168 0.149 0.138 0.128 0.117
PBn20 0.272 0.200 0.173 0.154 0.146 0.135
PBn25 0.272 0.196 0.173 0.158 0.149 0.135
PBn30 0.288 0.200 0.179 0.158 0.144 0.131
top-5 top-10 top-15 top-20 top-25 top-30
Figure 7.10: The macro-average precision with the database ranking CORI
with the different size of top-n for the preference-based model estimation
0.080
0.130
0.180
0.230
0.280
Ave
rag
e P
reci
sio
n
PBn5 0.256 0.184 0.157 0.142 0.131 0.119
PBn10 0.248 0.200 0.168 0.146 0.130 0.119
PBn15 0.256 0.184 0.157 0.142 0.131 0.119
PBn20 0.208 0.168 0.133 0.128 0.114 0.097
PBn25 0.248 0.180 0.155 0.146 0.131 0.119
PBn30 0.256 0.176 0.157 0.146 0.133 0.120
top-5 top-10 top-15 top-20 top-25 top-30
Figure 7.11: The macro-average precision with the database ranking IDEAL
with the different size of top-n for the preference-based model estimation
66
0.100
0.120
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
0.300
Ave
rag
e P
reci
sio
n
PB 0.272 0.196 0.173 0.158 0.149 0.135
LM04PB06 0.288 0.196 0.176 0.158 0.149 0.135
LM04 0.272 0.196 0.173 0.158 0.149 0.135
SingleLM 0.224 0.200 0.181 0.158 0.149 0.141
Top-5 Top-10 Top-15 Top-20 Top-25 Top-30
Figure 7.12: The macro-average precision with the database ranking CORI
with the top-10 documents for the preference-based model estimation with
β = 0.6 and LM04 result merging method
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
Ave
rag
e P
reci
sio
n
PB 0.248 0.192 0.165333333 0.146 0.1296 0.118666667
LM04PB06 0.272 0.22 0.186666667 0.17 0.1568 0.144
LM04 0.264 0.22 0.184 0.168 0.1568 0.145333333
SingleLM 0.224 0.2 0.181333333 0.158 0.1488 0.141333333
Top-5 Top-10 Top-15 Top-20 Top-25 Top-30
Figure 7.13: The macro-average precision with the database ranking IDEAL
with the top-10 documents for the preference-based model estimation with
β = 0.6 and LM04 result merging method
67
performance of combined methods for comparison.
The single PB ranking, which is purely based on the pseudo-relevance
feedback, shows the unstable performance under the different database rank-
ings. It is effective with CORI ranking and poor with the IDEAL ranking.
It is the inherent property of the pseudo-relevance feedback, with “lucky”
choice of the top-n documents for the model estimation it increases the per-
formance and with “unlucky” choice decreases it. The performance of PB
method is totally depends on the first database in the ranking, on which
we yield the preference-based language model. The average precision of the
LM04PB06 is slightly better than that of the LM04 on the small top-5 and
top-15 categories. In other cases, it shows the same performance.
We conclude that the LM04PB06 combination of the cross-entropy rank-
ing PB with the LM04 language model with β = 0.6 is slightly more effective
than the single LM04 method.
7.4 Summary
In current chapter, we give a detailed description of our experiments
with the result merging strategies selected for the evaluation in the Minerva
system. In Section 7.1, we provided the testbed description. The results from
testing several known result merging techniques are presented in Section 7.2.
We found that the LM04 result merging method is the most effective and
robust. In Section 7.3 we described the experimental results with proposed
approach. We found that the combination of the pseudo-relevance feedback
base PB method with the best LM04 result merging method gives a small
improvement on some intervals and at least as effective as the best of the
combined methods.
68
Chapter 8
Conclusions and future work
8.1 Conclusions
In this thesis, we investigate the effectiveness of the different result merg-
ing methods for the P2P Web search engine Minerva.
We select several merging methods, which are feasible to use in a hetero-
geneous, dynamic, distributed environment. The experimental framework
for these methods was implemented with Java 1.4.2. and Oracle 9.2.0. We
carry out experiments with the different database rankings and study the
effectiveness of result merging methods with the different size of top-k. The
language modeling ranking method LM04 produces the most robust and
accurate results under the different conditions.
We propose the new result merging method that combines two types of
similarity scores. The first score type is computed with the language model-
ing merging method LM04. The second score type is a cross-entropy value
between the preference-based language model and the document language
model. The novelty of our approach is that the preference-based language
model is obtained from the pseudo-relevance feedback on the best peer in the
peers ranking. The combination is tuned with the heuristically set parame-
ter. In every tested setup, the new method was at least as effective as the
best of the individual merging methods or slightly better.
The main observations are the following:
• All merging algorithms are very close in absolute retrieval effectiveness;
69
• Language modeling methods are more effective than TF · IDF based
methods;
• The effectiveness of the database selection step influences the quality
of result merging;
• The pseudo-relevance feedback information from the topically orga-
nized collections improves the retrieval quality.
8.2 Future work
There are several ways to enhance the result merging in the Minerva. The
effectiveness and efficiency are the two important dimensions for improve-
ment.
The effectiveness of the many text-statistics based methods is similar
and does not significantly improve the final ranking. We can exploit other
sources of evidence and incorporate them into the fused document score.
The linkage-based algorithms (e.g. PageRank) can be added to the retrieval
algorithm. The problem here is how to compute it in a completely distributed
environment.
The efficiency is mainly depends on smart top-k result computation al-
gorithm. We can improve it by introducing the additional communications
between the peers during query processing.
70
Bibliography
[Bau99] Christoph Baumgarten. A probabilistic solution to the selec-
tion and fusion problem in distributed information retrieval. In
SIGIR ’99: Proceedings of the 22nd annual international ACM
SIGIR conference on Research and development in information
retrieval, pages 246–253. ACM Press, 1999.
[BMWZ04] M. Bender, S. Michel, G. Weikum, and C. Zimmer. The min-
erva project: Database selection in the context of p2p search.
2004.
[BP98] Sergey Brin and Lawrence Page. The anatomy of a large-
scale hypertextual Web search engine. Computer Networks and
ISDN Systems, 30(1–7):107–117, 1998.
[Cal00] J. Callan. W.B. Croft, editor, Advances in information re-
trieval,, chapter Distributed information retrieval, pages 127–
150. 2000.
[CAPMN03] Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P.
Martin, and Thu D. Nguyen. Planetp: Using gossiping to build
content addressable peer-to-peer information sharing commu-
nities. In HPDC, pages 236–249, 2003.
[CHT99] Nick Craswell, David Hawking, and Paul B. Thistlewaite.
Merging results from isolated search engines. In Australasian
Database Conference, pages 189–200, 1999.
[CLC95] J. P. Callan, Z. Lu, and W. Bruce Croft. Searching Distributed
Collections with Inference Networks . In E. A. Fox, P. Ing-
71
wersen, and R. Fidel, editors, Proceedings of the 18th Annual
International ACM SIGIR Conference on Research and Devel-
opment in Information Retrieval, pages 21–28, Seattle, Wash-
ington, 1995. ACM Press.
[Cra01] Nicholas Eric Craswell. Methods for Distributed Information
Retrieval. PhD thesis, January 01 2001.
[Cro00] W. Bruce Croft. Combining approaches to ir (invited talk). In
DELOS Workshop: Information Seeking, Searching and Query-
ing in Digital Libraries, 2000.
[CS00] Anne Le Calve and Jacques Savoy. Database merging strategy
based on logistic regression. Inf. Process. Manage., 36(3):341–
359, 2000.
[Dia96] Ted Diamond. Information retrieval using dynamic evidence
combination. PhD thesis, 1996.
[GCGMP97] Luis Gravano, Kevin Chen-Chuan Chang, Hector Garcia-
Molina, and Andreas Paepcke. Starts: Stanford proposal for
internet meta-searching (experience paper). In SIGMOD Con-
ference, pages 207–218, 1997.
[GGMT99] Luis Gravano, Hector Garcıa-Molina, and Anthony Tomasic.
GlOSS: text-source discovery over the Internet. ACM Trans-
actions on Database Systems, 24(2):229–264, 1999.
[GWG96] Susan Gauch, Guijun Wang, and Mario Gomez. ProFusion:
Intelligent Fusion from Multiple, Distributed Seach Engines.
Journal of Universal Computing, Springer-Verlag, 2(9), Sept.
1996.
[Kir97] S. T. Kirsch. Distributed search patent. u.s. patent 5,659,732,
1997.
[Kle99] Jon M. Kleinberg. Authoritative sources in a hyperlinked en-
vironment. Journal of the ACM, 46(5):604–632, 1999.
72
[LC04a] Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval us-
ing language models. In SIGIR ’04: Proceedings of the 27th
annual international conference on Research and development
in information retrieval, pages 186–193. ACM Press, 2004.
[LC04b] Jie Lu and Jamie Callan. Merging retrieval results in hierar-
chical peer-to-peer networks. In SIGIR ’04: Proceedings of the
27th annual international conference on Research and devel-
opment in information retrieval, pages 472–473. ACM Press,
2004.
[LCC00] Leah S. Larkey, Margaret E. Connell, and James P. Callan. Col-
lection selection and results merging with topically organized
u.s. patents and trec data. In CIKM, pages 282–289, 2000.
[Lee97] Jong-Hak Lee. Analyses of multiple evidence combination. In
SIGIR, pages 267–276, 1997.
[LG98] Steve Lawrence and C. Lee Giles. Inquirus, the neci meta search
engine. In WWW7: Proceedings of the seventh international
conference on World Wide Web 7, pages 95–105. Elsevier Sci-
ence Publishers B. V., 1998.
[Mon02] Mark Montague. Metasearch: Data Fusion for Document Re-
trieval. PhD thesis, 2002.
[MYL02] Weiyi Meng, Clement T. Yu, and King-Lup Liu. Building ef-
ficient and effective metasearch engines. ACM Comput. Surv.,
34(1):48–89, 2002.
[NF03] H. Nottelmann and N. Fuhr. From retrieval status values to
probabilities of relevance for advanced IR applications. Infor-
mation Retrieval, 6(4), 2003.
[OKK02] B. Uygar Oztekin, George Karypis, and Vipin Kumar. Ex-
pert agreement and content based reranking in a meta search
environment using mearf. In WWW, pages 333–344, 2002.
73
[PC98] Jay M. Ponte and W. Bruce Croft. A language modeling ap-
proach to information retrieval. In Research and Development
in Information Retrieval, pages 275–281, 1998.
[PFC+00] Allison L. Powell, James C. French, James P. Callan, Mar-
garet E. Connell, and Charles L. Viles. The impact of database
selection on distributed searching. In SIGIR, pages 232–239,
2000.
[RAS01] Yves Rasolofo, Faiza Abbaci, and Jacques Savoy. Approaches
to collection selection and results merging for distributed infor-
mation retrieval. In CIKM, pages 191–198, 2001.
[SC03] Luo Si and Jamie Callan. A semisupervised learning method to
merge search engine results. ACM Trans. Inf. Syst., 21(4):457–
491, 2003.
[SE97] E. Selberg and O. Etzioni. The MetaCrawler architecture for
resource aggregation on the Web. IEEE Expert, (January–
February):11–14, 1997.
[Ser05] Pavel Serdyukov. Query routing in a peer-to-peer web search
engine. Master’s thesis, 2005.
[SF94] Joseph A. Shaw and Edward A. Fox. Combination of multiple
searches. In Proceedings of the 3th Text REtrieval Conference
(TREC-3), pages 105–109, 1994.
[SJCO02] Luo Si, Rong Jin, James P. Callan, and Paul Ogilvie. A lan-
guage modeling framework for resource selection and results
merging. In CIKM, pages 391–397, 2002.
[SMK+01] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and
Hari Balakrishnan. Chord: A scalable Peer-To-Peer lookup
service for internet applications. In Proceedings of the 2001
ACM SIGCOMM Conference, pages 149–160, 2001.
74
[SMW+03] Torsten Suel, Chandan Mathur, Jowen Wu, Jiangong Zhang,
Alex Delis, Mehdi Kharrazi, Xiaohui Long, and Kulesh Shan-
mugasundaram. Odissea: A peer-to-peer architecture for scal-
able web search and information retrieval. In WebDB, pages
67–72, 2003.
[SP99] Jacques Savoy and Justin Picard. Report on the trec-8 experi-
ment: Searching on the web and in distributed collections. In
Proceedings of the 8th Text REtrieval Conference (TREC-8),
1999.
[SP01] Chris Sherman and Gary Price. The invisible web: Uncovering
information sources search engines can’t see, 2001.
[SR00] Jacques Savoy and Yves Rasolofo. Report on the trec-9 ex-
periment: Link-based retrieval and distributed collections. In
Proceedings of the 9th Text REtrieval Conference (TREC-9),
2000.
[TVGJL95] Geoffrey G. Towell, Ellen M. Voorhees, Narendra Kumar
Gupta, and Ben Johnson-Laird. Learning collection FUsion
strategies for information retrieval. In International Confer-
ence on Machine Learning, pages 540–548, 1995.
[TZ04] Tao Tao and ChengXiang Zhai. A mixture clustering model for
pseudo feedback in information retrieval. 2004.
[VC99a] Christopher C. Vogt and Garrison W. Cottrell. Fusion via a
linear combination of scores. Inf. Retr., 1(3):151–173, 1999.
[VC99b] Christopher Charles Vogt and Garrison W. Cottrell. Adaptive
combination of evidence for information retrieval. PhD thesis,
1999.
[VF95] Charles L. Viles and James C. French. Dissemination of col-
lection wide information in a distributed information retrieval
system. In Proceedings of the 18th Annual International ACM
75
SIGIR Conference on Research and Development in Informa-
tion Retrieval, pages 12–20, 1995.
[VGJL94] Ellen M. Voorhees, Narendra Kumar Gupta, and Ben Johnson-
Laird. The collection fusion problem. In Text REtrieval Con-
ference, pages 0–, 1994.
[Voo95] E. Voorhees. Siemens trec-4 report: Further experiments with
database merging, 1995.
[WC02] Shengli Wu and Fabio Crestani. Data fusion with estimated
weights. In CIKM, pages 648–651, 2002.
[WCG03] Shengli Wu, Fabio Crestani, and Forbes Gibb. New methods
of results merging for distributed information retrieval. In Dis-
tributed Multimedia Information Retrieval, pages 84–100, 2003.
[WGDW03] Y. Wang, L. Galanis, and D.J. De Witt. Galanx: An efficient
peer-to-peer search engine system. 2003.
[YL97] Budi Yuwono and Dik Lun Lee. Server ranking for distributed
text retrieval systems on the internet. In Database Systems for
Advanced Applications, pages 41–50, 1997.
[YR98] Ronald R. Yager and Alexander Rybalov. On the fusion of doc-
uments from multiple collection information retrieval systems.
J. Am. Soc. Inf. Sci., 49(13):1177–1184, 1998.
[ZL01] Chengxiang Zhai and John Lafferty. Model-based feedback in
the language modeling approach to information retrieval. In
CIKM ’01: Proceedings of the tenth international conference on
Information and knowledge management, pages 403–410. ACM
Press, 2001.
76
Appendix A
Test queries
In the Table A.1 we summarized the test queries from the topic distillation
task for the TREC 2002 and 2003 Web Track datasets (http://trec.nist.gov).
Queries are selected with respect to two requirements: they have at least 10
relevant documents and related to the “Health and Medicine” or “Nature
and Ecology” topic.
In the Table A.2 we present the relevant documents distributions for the
TF and LM04PB06 merging methods. This is typical for the tested merging
methods when results for some specific queries are enhanced and for others
are made worse. On Figure A.1 and Figure A.2 we placed a graphical
interpretation of the same data. The residual in performance between the
two methods is presented on Figure A.3.
77
N Query number in
TREC
Stemmed query Topic Total number of
relevant documents
1 552 food cancer patient HM 26
2 557 clean air clean water NE 48
3 560 symptom diabet HM 28
4 561 erad boll weevil NE 17
5 563 smoke drink pregnanc HM 23
6 564 mother infant nutrit HM 24
7 569 invas anim plant NE 18
8 574 whale dolphin protect NE 27
9 575 nuclear wast storag transport NE 46
10 576 chesapeak bay ecolog NE 14
11 578 regul zoo NE 20
12 584 birth defect HM 50
13 583 florida endang speci NE 29
14 586 women health cancer HM 38
15 589 mental ill adolesc HM 112
16 594 food prevent high cholesterol HM 27
17 TD5 pest control safeti NE 13
18 TD14 agricultur biotechnolog NE 12
19 TD31 deaf children HM 13
20 TD32 wildlif conserv NE 86
21 TD33 food safeti HM 28
22 TD35 arctic explor NE 45
23 TD36 global warm NE 12
24 TD43 forest fire NE 25
25 TD44 ozon layer NE 12
Table A.1: The topic-oriented set of the 25 experimental queries (topics are
coded as “HM” for the Health and Medicine, and “NE” for the Nature and
Ecology)
78
QueryN top-5 top-10 top-15 top-20 top-25 top-30
TF | LMPB TF | LMPB TF | LMPB TF | LMPB TF | LMPB TF | LMPB
1 1 | 5 1 | 5 2 | 5 3 | 6 3 | 6 5 | 6
2 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0
3 5 | 2 5 | 2 5 | 4 8 | 5 10 | 6 10 | 8
4 4 | 4 7 | 7 8 | 9 9 | 9 10 | 10 10 | 10
5 1 | 1 1 | 2 1 | 2 2 | 2 2 | 2 2 | 2
6 1 | 1 2 | 3 2 | 6 4 | 6 6 | 6 7 | 7
7 1 | 0 1 | 1 1 | 2 1 | 2 3 | 2 3 | 2
8 0 | 0 0 | 1 0 | 1 1 | 1 1 | 1 2 | 1
9 2 | 2 3 | 3 4 | 3 4 | 4 4 | 5 6 | 7
10 0 | 0 1 | 0 1 | 1 1 | 2 2 | 3 2 | 3
11 2 | 3 3 | 3 4 | 4 5 | 4 6 | 4 8 | 5
12 1 | 3 2 | 4 5 | 4 8 | 6 8 | 9 9 | 9
13 0 | 0 0 | 1 0 | 1 1 | 3 1 | 3 1 | 3
14 3 | 3 5 | 5 8 | 8 10 | 11 12 | 12 12 | 12
15 4 | 3 4 | 5 5 | 6 5 | 6 5 | 8 5 | 8
16 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1
17 0 | 2 2 | 3 3 | 3 3 | 3 3 | 3 3 | 3
18 2 | 0 2 | 2 3 | 3 3 | 3 3 | 3 3 | 3
19 0 | 0 0 | 1 1 | 1 1 | 2 1 | 3 1 | 3
20 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0
21 0 | 0 0 | 0 0 | 0 0 | 1 0 | 1 0 | 2
22 0 | 0 1 | 0 1 | 0 1 | 0 1 | 1 2 | 2
23 0 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1
24 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1 1 | 1
25 2 | 2 4 | 4 4 | 4 6 | 6 8 | 7 9 | 9
Table A.2: The number of relevant documents for the TF and LM04PB06
methods with the IDEAL database ranking (LM04PB06 name is shortened
to LMPB for convinience)
79
0
2
4
6
8
10
12
14
5 10 15 20 25 30
Number of documents in top-k
Nu
mb
er o
f re
leva
nt
do
cum
ents
N1food cancer patient
N2clean air clean water
N3symptom diabet
N4 erad boll weevil
N5 smoke drink pregnanc
N6 mother infant nutrit
N7 invas anim plant
N8 whale dolphin protect
N9 nuclear wast storag transport
N10 chesapeak bay ecolog
N11 regul zoo
N12 florida endang speci
N13 women health cancer
N14 mental ill adolesc
N15 food prevent high cholesterol
N16 pest control safeti
N17 agricultur biotechnolog
N18 deaf children
N19 wildlif conserv
N20 food safeti
N21 arctic explor
N22 global warm
N23 forest fire
N24 ozon layer
N25 birth defect
Figure A.1: Relevant documents distribution for the TF method with the
IDEAL ranking
0
2
4
6
8
10
12
14
5 10 15 20 25 30
Number of documents in top-k
Nu
mb
er o
f re
leva
nt
do
cum
ents
N1food cancer patient
N2clean air clean water
N3symptom diabet
N4 erad boll weevil
N5 smoke drink pregnanc
N6 mother infant nutrit
N7 invas anim plant
N8 whale dolphin protect
N9 nuclear wast storag transport
N10 chesapeak bay ecolog
N11 regul zoo
N12 florida endang speci
N13 women health cancer
N14 mental ill adolesc
N15 food prevent high cholesterol
N16 pest control safeti
N17 agricultur biotechnolog
N18 deaf children
N19 wildlif conserv
N20 food safeti
N21 arctic explor
N22 global warm
N23 forest fire
N24 ozon layer
N25 birth defect
Figure A.2: Relevant documents distribution for the LM04PB06 method
with the IDEAL ranking
80
-5
-4
-3
-2
-1
0
1
2
3
4
5
top-5 top-10 top-15 top-20 top-25 top-30
Nu
mb
er o
f re
leva
nt
do
cum
ents
N1food cancer patient
N2clean air clean water
N3symptom diabet
N4 erad boll weevil
N5 smoke drink pregnanc
N6 mother infant nutrit
N7 invas anim plant
N8 whale dolphin protect
N9 nuclear wast storag transport
N10 chesapeak bay ecolog
N11 regul zoo
N12 florida endang speci
N13 women health cancer
N14 mental ill adolesc
N15 food prevent high cholesterol
N16 pest control safeti
N17 agricultur biotechnolog
N18 deaf children
N19 wildlif conserv
N20 food safeti
N21 arctic explor
N22 global warm
N23 forest fire
N24 ozon layer
N25 birth defect
Figure A.3: Residual between the number of relevant documents of the
CE06LM04 and TF methods with the IDEAL database ranking
81