distributed information retrieval · peer-to-peer network broker-based architecture crawling...

81
Motivations Architectures Broker-Based Architecture Applications Advanced DIR Distributed Information Retrieval Fabio Crestani and Ilya Markov University of Lugano, Switzerland Fabio Crestani and Ilya Markov Distributed Information Retrieval 1

Upload: others

Post on 26-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Distributed Information Retrieval

Fabio Crestani and Ilya Markov

University of Lugano, Switzerland

Fabio Crestani and Ilya Markov Distributed Information Retrieval 1

Page 2: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Deep WebFederated SearchMetasearchAggregated Search

Outline

1 MotivationsDeep WebFederated SearchMetasearchAggregated Search

2 Architectures

3 Broker-Based Architecture

4 Applications

5 Advanced DIR

Fabio Crestani and Ilya Markov Distributed Information Retrieval 2

Page 3: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Deep WebFederated SearchMetasearchAggregated Search

Motivations

Why do we need DIR?

There are limits to what a search engines can find on the web

Not everything that is on the web is or can be harvestedThe ”one size fits all” approach of web search engine hasmany limitationsOften there is more than one type of answer to the same query

Thus: Deep Web, Federated Search, MetaSearch, AggregatedSearch

Fabio Crestani and Ilya Markov Distributed Information Retrieval 3

Page 4: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Deep WebFederated SearchMetasearchAggregated Search

Deep Web

There is a lot of information on the web that cannot beaccessed by search engines (deep or hidden web).

There are many different reasons why this information is notaccessible to crawlers.

This is often very valuable information!

Web search engines can only be used to identify the resource(if possible), then the user has to deal directly with it.

Even if this information could be crawled there are goodreasons not too ...

Fabio Crestani and Ilya Markov Distributed Information Retrieval 4

Page 5: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Deep WebFederated SearchMetasearchAggregated Search

Federated Search

Federated Search is another name for DIR.

Federated search systems do not crawl a resource, but pass auser query to the search facilities of the resource itself.

Why would this be better?

Preserves the property rights of the resource owner.Search facilities are optimised to the specific resource.Index is always up-to-date.The resource is curated and of high quality.

Examples of federate search systems: PubMed, FedStats,WestLaw, and Cheshire.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 5

Page 6: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Deep WebFederated SearchMetasearchAggregated Search

Metasearch

Even the largest search engine cannot crawl effectively theentire web.

Different search engines crawl different disjoint portions of theweb.

Different search engines use different ranking functions.

Metasearch engines do not crawl the web, but pass a userquery to a number of search engines and then present thefused results set.

Examples of federate search systems: Dogpile, MataCrawler,AllInOneNews, and SavvySearch

Fabio Crestani and Ilya Markov Distributed Information Retrieval 6

Page 7: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Deep WebFederated SearchMetasearchAggregated Search

Aggregated Search

Often there is more that one type of information relevant to aquery (e.g. web page, images, map, reviews, etc).These type of information are indexed and ranked by separatesub-systems.Presenting this information in an aggregated way is moreuseful to the user.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 7

Page 8: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

Outline

1 Motivations

2 ArchitecturesPeer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

3 Broker-Based Architecture

4 Applications

5 Advanced DIR

Fabio Crestani and Ilya Markov Distributed Information Retrieval 8

Page 9: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

A Taxonomy of DIR Systems

A taxonomy of DIR architectures can be build consideringwhere the indexes are kept.This suggest 4 different types of architectures: broker-based,peer-to-peer, crawling, and meta-data harvesting.

Global collection

Distributed indexes Centralised index

Crawling Meta-data harvestingP2P Broker

Hybrid

Fabio Crestani and Ilya Markov Distributed Information Retrieval 9

Page 10: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

Peer-to-Peer Networks

Indexes are located with the resources.

Some part of the indexes are distributed to other resources.

Queries are distributed across the resources and results aremerged by the peer that originated the query.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 10

Page 11: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

Broker-Based Architecture

Indexes are located with the resources.

Queries are forwarded to resources and results are merged bythe broker.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 11

Page 12: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

Crawling

Resources are crawled and documents are harvested

Indexes are centralised.

Queries are are carried out in a centralised way anddocuments are fetched from resources or from a storage.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 12

Page 13: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

Metadata Harvesting

Indexes are located with the resources, but metadata areharvested according to some protocol (off-line phase), like forexample the OAI-PMH .Queries are carried out at the broker level (on-line phase) toidentify relevant documents by the metadata, that are thenrequested from the resources.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 13

Page 14: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Peer-to-Peer NetworkBroker-Based ArchitectureCrawlingMetadata HarvestingHybrid

Indexing Harvesting

It is possible to crawl the indexes, instead of the metadataaccording to some protocol (off-line phase), like for examplethe OAI-PMH .Queries are carried out at the broker level (on-line phase) toidentify relevant documents by the documents’ full content,that are then requested from the resources.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 14

Page 15: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Outline

1 Motivations

2 Architectures

3 Broker-Based ArchitectureResource DescriptionResource SelectionResults Merging

4 Applications

5 Advanced DIR

Fabio Crestani and Ilya Markov Distributed Information Retrieval 15

Page 16: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Description

Cooperative

Uncooperative

Fabio Crestani and Ilya Markov Distributed Information Retrieval 16

Page 17: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Stanford Protocol Proposal for Internet and RetrievalSearch (STARTS)

Query LanguageFilter expressionsRanking expressions

Retrieved DocumentsUnnormalized scoreSourceStatistics (term frequency, term weight, document frequency,document size, document count)

Source metadata and content summaryVocabularyStatistics (term frequency, document frequency, number ofdocuments)

Luis Gravano, Chen-Chuan K. Chang, Hector Garcıa-Molina, and Andreas Paepcke. STARTS: Stanford proposalfor internet meta-searching. SIGMOD Rec., 26(2):207–218, 1997.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 17

Page 18: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Query-Based Sampling

Query −→Resource

Documents (2-10) ←−Questions

How to select queries?

What are the stopping criteria?

Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst.,19(2):97–130, 2001.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 18

Page 19: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Sampling Queries

Other Resource Description (ord)

Learned Resource Description (lrd)

RandomDocument Frequency (df )Collection Frequency (ctf )Average Term Frequency (ctf /df )

Query-Logs

Jamie Callan and Margaret Connell.Query-based sampling of text databases.ACM Trans. Inf. Syst., 19(2):97–130, 2001.

Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer.Using query logs to establish vocabularies in distributed information retrieval.Inf. Process. Manage., 43(1):169–180, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 19

Page 20: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Stopping Criteria

300-500 documents

n documents out of N resources

Collection sizeVocabulary size

Newly downloaded terms are not likely to appear in futurequeries

Q - set of training queriesθk - language model of a sample at k-th iteration

p(Q|θk) =∏Nq

i=1

∏|q|j=1 p(t = qij |θk), `(θk ,Q) = log(p(Q|θk))

φk = `(θk ,Q)− `(θk−1,Q) = log( p(Q|θk )p(Q|θk−1) )

Fabio Crestani and Ilya Markov Distributed Information Retrieval 20

Page 21: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Stopping Criteria Bibliography

Leif Azzopardi, Mark Baillie, and Fabio Crestani.Adaptive query-based sampling for distributed ir.In Proceedings of the ACM SIGIR, pages 605–606. ACM, 2006.

Mark Baillie, Leif Azzopardi, and Fabio Crestani.An adaptive stopping criteria for query-based sampling ofdistributed collections.In String Processing and Information Retrieval (SPIRE), 2006.

James Caverlee, Ling Liu, and Joonsoo Bae.Distributed query sampling: a quality-conscious approach.In Proceedings of the ACM SIGIR, pages 340–347. ACM, 2006.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 21

Page 22: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Description Summary

Cooperative (STARTS)

Uncooperative (QBS)

Sampling Queries

Stopping Criteria

Fabio Crestani and Ilya Markov Distributed Information Retrieval 22

Page 23: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Description References

Luis Gravano, Chen-Chuan K. Chang, Hector Garcıa-Molina, and Andreas Paepcke.

Starts: Stanford proposal for internet meta-searching.SIGMOD Rec., 26(2):207–218, 1997.

Jamie Callan and Margaret Connell.

Query-based sampling of text databases.ACM Trans. Inf. Syst., 19(2):97–130, 2001.

Leif Azzopardi, Mark Baillie, and Fabio Crestani.

Adaptive query-based sampling for distributed ir.In Proceedings of the ACM SIGIR, pages 605–606. ACM, 2006.

Mark Baillie, Leif Azzopardi, and Fabio Crestani.

An adaptive stopping criteria for query-based sampling of distributed collections.In String Processing and Information Retrieval (SPIRE), 2006.

James Caverlee, Ling Liu, and Joonsoo Bae.

Distributed query sampling: a quality-conscious approach.In Proceedings of the ACM SIGIR, pages 340–347. ACM, 2006.

Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer.

Using query logs to establish vocabularies in distributed information retrieval.Inf. Process. Manage., 43(1):169–180, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 23

Page 24: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Selection

Lexicon-Based

Document-Surrogate

Fabio Crestani and Ilya Markov Distributed Information Retrieval 24

Page 25: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Lexicon-Based Approaches

Fabio Crestani and Ilya Markov Distributed Information Retrieval 25

Page 26: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Collection Retrieval Inference Network (CORI)

Collection =⇒ Super-DocumentBayesian inference network on super-documentsAdapted Okapi

T =dft,i

dft,i + 50 + 150 · cwi/avg cw

I =log(C+0.5

cft)

log(C + 1.0)

p(t|Ci ) = b + (1− b) · T · I

Collections are ranked according to p(Q|Ci )James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. InProceedings of the ACM SIGIR, pages 21–28. ACM, 1995.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 26

Page 27: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Glossary-of-Servers Server (GlOSS)

Goodness(q, l ,C) =∑

d∈Rank(q,l,C)

sim(q, d)

Rank(q, l ,C) = {d ∈ C |sim(q, d)>l}

CooperativeDocument and term statistics

Luis Gravano, Hector Garcıa-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACMTrans. Database Syst., 24(2):229–264, 1999.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 27

Page 28: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Document-SurrogateApproaches

Fabio Crestani and Ilya Markov Distributed Information Retrieval 28

Page 29: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Relevant Document Distribution Estimation (ReDDE)

Idea

1 sampled document is relevant to a query ⇐⇒ |C ||SC | similar

documents in a collection c are relevant to a query.

R(C ,Q) ≈∑d∈SC

p(R|d)|C ||SC |

Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. InProceedings of the ACM SIGIR, pages 298–305. ACM, 2003.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 29

Page 30: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Relevant Document Distribution Estimation (ReDDE)

Ranked sampled documents =⇒ Rankeddocuments in a centralized retrieval system

Idea

A document dj appears before a document di in a sample⇐⇒|Cj ||SCj |

documents appear before di in a centralized retrieval system.

Rankcentralized(di ) =∑

dj :Ranksample(dj )<Ranksample(di )

|Cj ||SCj|

Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. InProceedings of the ACM SIGIR, pages 298–305. ACM, 2003.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 30

Page 31: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Relevant Document Distribution Estimation (ReDDE)

R(C ,Q) ≈∑d∈SC

p(R|d)|C ||SC |

Rankcentralized(di ) =∑

dj :Ranksample(dj )<Ranksample(di )

|Cj ||SCj|

p(R|d) =

{α if Rankcentralized(d) < β ·∑i |Ci |0 otherwise.

Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. InProceedings of the ACM SIGIR, pages 298–305. ACM, 2003.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 31

Page 32: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Centralized-Rank Collection Selection (CRCS)

R(C ,Q) ≈∑d∈SC

p(R|d)|C ||SC |

Linear

R(d) =

{γ − Ranksample(d) if Ranksample(d) < γ

0 otherwise.

ExponentialR(d) = α exp(−β · Ranksample(d))

p(R|d) =R(d)

|Cmax |

Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. InECIR, pages 160–172, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 32

Page 33: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Selection Comparison

0 5 10 15 20

Cutoff (nonrelevant)

0.0

0.2

0.4

0.6

0.8

1.0

R v

alu

e

CORI

ReDDE

CRCS(l)

CRCS(e)

0 5 10 15 20

Cutoff (GOV2)

0.0

0.2

0.4

0.6

0.8

1.0

R v

alu

e

Fig. 3. R values for the cori, redde and crcs algorithms on the trec123-FR-DOE-81col (left) and 100-col-gov2 (right) testbeds.

Table 2. Performance of di!erent methods for the Trec4 (trec4-kmeans) testbed.Trec topics 201–250 (long) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.3000 0.2380 0.2133† 0.1910† 0.3480 0.2980 0.2587 0.2380ReDDE 0.2160 0.1620 0.1373 0.1210 0.3480 0.2860 0.2467 0.2190CRCS(l) 0.2960 0.2260 0.2013 0.1810† 0.3520 0.2920 0.2533 0.2310

CRCS(e) 0.3080 0.2400 0.2173† 0.1910† 0.3880 0.3160 0.2680 0.2510

evance judgments. We only report the results for cuto!=1 and cuto!=5. Theformer shows the system outputs when only the best collection is selected whilefor the latter, the best five collections are chosen to get searched for the query.We do not report the results for larger cuto! values because cuto!=5 has shownto be a reasonable threshold for dir experiments on the real web collections[Avrahami et al., 2006]. The P@x values show the calculated precision on thetop x results.

We select redde as the baseline as it does not require training queries andits e!ectiveness is found to be higher than cori and older alternatives [Si andCallan, 2003a]. The following tables compare the performance of discussed meth-ods on di!erent testbeds. We used the t-test to calculate the statistical signifi-cance of di!erence between approaches. For each table, † and ‡ respectively in-dicate significant di!erence at the 99% and 99.9% confidence intervals betweenthe performance of redde and other methods.

Results in Table 2 show that on the trec4 testbed, methods produce similarprecision values when five collections are selected per query. The numbers alsosuggest that redde is not successful in selecting the best collection. It producespoorer results than the other methods and the di!erence is usually significantfor P@15 and P@20.

On the uniform testbed (Table 3), crcs(e) significantly outperforms the otheralternatives for both cuto! values. Cori, redde, and crcs(l) show similar per-

Table 3. Performance of collection selection methods for the uniform (trec123-100col-bysource) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.2520 0.2140 0.1960 0.1710 0.3080 0.3060 0.2867 0.2730ReDDE 0.1920 0.1660 0.1413 0.1280 0.2960 0.2820 0.2653 0.2510CRCS(l) 0.2120 0.1760 0.1520 0.1330 0.3440 0.3240 0.3067 0.2860

CRCS(e) 0.3800‡ 0.3060‡ 0.2613‡ 0.2260† 0.3960 0.3700† 0.3480† 0.3310†

Table 4. Performance of collection selection methods for the representative (trec123-2ldb-60col) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.2160 0.2040 0.1773 0.1730 0.3520 0.3500 0.3347 0.3070ReDDE 0.3320 0.3080 0.2960 0.2850 0.3480 0.3220 0.3147 0.3010CRCS(l) 0.3160 0.2980 0.2867 0.2740 0.3160 0.3160 0.2973 0.2810CRCS(e) 0.2960 0.2760 0.2467 0.2340 0.3400 0.3500 0.3333 0.3090

formance on this testbed. Comparing the results with the Rk values in Figure 1confirms our previous statement that selecting collections with a high numberof relevant documents does not necessarily lead to an e!ective retrieval.

For the representative testbed as shown in Table 4, there is no significantdi!erence between methods for cuto!=5. Crcs(l), crcs(e) and redde producecomparable performance when only the best collection is selected. The precisionvalues for cori when cuto!=1 are shown in italic to indicate that they aresignificantly worse than redde at the 99% confidence interval.

On the relevant testbed (Table 5), all precision values for cori are signifi-cantly inferior to that of redde for both cuto! values. Redde in general pro-duces higher precision values than crcs methods. However, none of the gaps aredetected statistically significant by the t-test at the 99% confidence interval.

The results for crcs(l), redde and cori are comparable on the non-relevanttestbed (Table 6). Crcs(e) significantly outperforms the other methods in mostcases. On the gov2 testbed (Table 7), cori produces the best results when

Table 5. Performance of collection selection methods for the relevant (trec123-AP-WSJ-60col) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.1440 0.1280 0.1160 0.1090 0.2440 0.2340 0.2333 0.2210ReDDE 0.3960 0.3660 0.3360 0.3270 0.3920 0.3900 0.3640 0.3490CRCS(l) 0.3840 0.3580 0.3293 0.3120 0.3800 0.3640 0.3467 0.3250CRCS(e) 0.3080 0.2860 0.2813 0.2680 0.3480 0.3420 0.3280 0.3170

Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. InECIR, pages 160–172, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 33

Page 34: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Selection Comparison

Table 3. Performance of collection selection methods for the uniform (trec123-100col-bysource) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.2520 0.2140 0.1960 0.1710 0.3080 0.3060 0.2867 0.2730ReDDE 0.1920 0.1660 0.1413 0.1280 0.2960 0.2820 0.2653 0.2510CRCS(l) 0.2120 0.1760 0.1520 0.1330 0.3440 0.3240 0.3067 0.2860

CRCS(e) 0.3800‡ 0.3060‡ 0.2613‡ 0.2260† 0.3960 0.3700† 0.3480† 0.3310†

Table 4. Performance of collection selection methods for the representative (trec123-2ldb-60col) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.2160 0.2040 0.1773 0.1730 0.3520 0.3500 0.3347 0.3070ReDDE 0.3320 0.3080 0.2960 0.2850 0.3480 0.3220 0.3147 0.3010CRCS(l) 0.3160 0.2980 0.2867 0.2740 0.3160 0.3160 0.2973 0.2810CRCS(e) 0.2960 0.2760 0.2467 0.2340 0.3400 0.3500 0.3333 0.3090

formance on this testbed. Comparing the results with the Rk values in Figure 1confirms our previous statement that selecting collections with a high numberof relevant documents does not necessarily lead to an e!ective retrieval.

For the representative testbed as shown in Table 4, there is no significantdi!erence between methods for cuto!=5. Crcs(l), crcs(e) and redde producecomparable performance when only the best collection is selected. The precisionvalues for cori when cuto!=1 are shown in italic to indicate that they aresignificantly worse than redde at the 99% confidence interval.

On the relevant testbed (Table 5), all precision values for cori are signifi-cantly inferior to that of redde for both cuto! values. Redde in general pro-duces higher precision values than crcs methods. However, none of the gaps aredetected statistically significant by the t-test at the 99% confidence interval.

The results for crcs(l), redde and cori are comparable on the non-relevanttestbed (Table 6). Crcs(e) significantly outperforms the other methods in mostcases. On the gov2 testbed (Table 7), cori produces the best results when

Table 5. Performance of collection selection methods for the relevant (trec123-AP-WSJ-60col) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.1440 0.1280 0.1160 0.1090 0.2440 0.2340 0.2333 0.2210ReDDE 0.3960 0.3660 0.3360 0.3270 0.3920 0.3900 0.3640 0.3490CRCS(l) 0.3840 0.3580 0.3293 0.3120 0.3800 0.3640 0.3467 0.3250CRCS(e) 0.3080 0.2860 0.2813 0.2680 0.3480 0.3420 0.3280 0.3170

Table 6. Performance of collection selection methods for the non-relevant (trec123-FR-DOE-81col) testbed. Trec topics 51–100 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.2520 0.2100 0.1867 0.1690 0.3200 0.2980 0.2707 0.2670ReDDE 0.2240 0.1900 0.1813 0.1750 0.3480 0.2980 0.2707 0.2610CRCS(l) 0.2040 0.1860 0.1813 0.1800 0.3360 0.3220 0.2973 0.2860

CRCS(e) 0.3200† 0.2820† 0.2533† 0.2240 0.3880 0.3600 0.3387† 0.3210†

Table 7. Performance of collection selection methods for the gov2 (100-col-gov2)testbed. Trec topics 701–750 (short) were used as queries

Cuto!=1 Cuto!=5

P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20

CORI 0.1592† 0.1347† 0.1143† 0.0969† 0.2735 0.2347 0.2041 0.1827ReDDE 0.0490 0.0327 0.0286 0.0235 0.2163 0.1837 0.1687 0.1551CRCS(l) 0.0980 0.0755 0.0667 0.0531 0.1959 0.1510 0.1442 0.1286CRCS(e) 0.0857 0.0714 0.0748 0.0643 0.2776 0.2469 0.2272 0.2122

cuto!=1 while in the other scenarios there is no significant di!erence betweenthe methods.

Overall, we can conclude that crcs(e) selects better collections and its highperformance remains robust. In none of the reported experiments, the precisionvalues for crcs(e) were significantly poorer than any other method at the 99%confidence interval. However, in many cases, the performance of crcs(e) wassignificantly better than the second best method at the 99% or 99.9% confidenceintervals. Cori and redde showed variable performance on di!erent testbeds,each outperforming the other on some datasets.

6 Conclusions

We have introduced a new collection selection method for uncooperative dir en-vironments. We have shown that our proposed crcs method can outperform thecurrent state-of-the-art techniques. We investigated the robustness of di!erentcollection selection algorithms and showed that the performance of redde andcori changes significantly on di!erent testbeds while crcs produces robustresults. We also introduced a new testbed for dir experiments based on thetrec gov2 dataset. Our proposed testbed is about 36 times larger than themost well-known dir testbeds and more than 6 times larger than the largestdir testbed ever reported. Moreover, unlike traditional dir testbeds that docu-ments are assigned artificially into collections, collections in this testbed containdownloaded web documents arranged by server.

Experiments reported in this paper are based on the assumption that allcollections are using the same retrieval model with equal e!ectiveness. However,in practice, collections often use di!erent retrieval models and have di!erent

Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. InECIR, pages 160–172, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 34

Page 35: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Selection Summary

Lexicon-BasedCORIGlOSS

Document-SurrogateReDDECRCS

Others

Fabio Crestani and Ilya Markov Distributed Information Retrieval 35

Page 36: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Resource Selection References

James P. Callan, Zhihong Lu, and W. Bruce Croft.

Searching distributed collections with inference networks.

In Proceedings of the ACM SIGIR, pages 21–28. ACM, 1995.

Luis Gravano, Hector Garcıa-Molina, and Anthony Tomasic.

Gloss: text-source discovery over the internet.

ACM Trans. Database Syst., 24(2):229–264, 1999.

Luo Si and Jamie Callan.

Relevant document distribution estimation method for resource selection.

In Proceedings of the ACM SIGIR, pages 298–305. ACM, 2003.

Milad Shokouhi.

Central-rank-based collection selection in uncooperative distributedinformation retrieval.

In ECIR, pages 160–172, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 36

Page 37: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Estimating Collection Size

Fabio Crestani and Ilya Markov Distributed Information Retrieval 37

Page 38: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Capture-Recapture

X - event that a randomly sampled document is already in a sample

Xn - the same for n randomly sampled documents

Two samples S1 and S2

E[X ] =|S ||C | , E[Xn] = n · E[X ] = n · |S ||C |

|S1 ∩ S2| ≈|S1||S2||C | =⇒ ˆ|C | =

|S1||S2||S1 ∩ S2|

Take two samples

Count the number of common documents

King-Lup Liu, Clement Yu, and Weiyi Meng. Discovering the representative of a search engine. In Proceedings ofthe ACM CIKM, pages 652–654. ACM, 2002.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 38

Page 39: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Sample-Resample

Randomly pick a term t from a sample

A - event that some sampled document contains t

B - event that some document from the resource contains t

P(A) =dft,S|S | , P(B) =

dft,C|C |

P(A) ≈ P(B) =⇒ ˆ|C | = dft,C ·|S |dft,S

Send the query t to the resource to estimate dft,CLuo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. InProceedings of the ACM SIGIR, pages 298–305. ACM, 2003.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 39

Page 40: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging

Fabio Crestani and Ilya Markov Distributed Information Retrieval 40

Page 41: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging

The results merging process:1 Selected resources return their top-ranked documents to the

broker.2 The broker merges the documents and returns a fused list to

the user.

Not to be confused with data fusion, where the results comefrom a single resource and are then ranked by multipleretrieval models.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 41

Page 42: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging Issues

The results merging process involves a number of issues:1 Duplicate detection and removal.2 Normalising and merging relevance scores.

Different solutions have been proposed for these issue,depending in the DIR environment.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 42

Page 43: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging in Cooperative Environments

The results merging in cooperative environments is muchsimpler and has different solutions:

1 Fetch documents from each resource, reindex and rankaccording to the broker IR model.

2 Get information about the way the document score iscalculated and normalise score.

At the highest level of collaboration it is possible to ask theresources to adopt the same retrieval model!

Fabio Crestani and Ilya Markov Distributed Information Retrieval 43

Page 44: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Collection Retrieval Inference Network (CORI)

The idea

Linear combination of the score of the database and the score ofthe document.

Normalised scores

Normalised collection section score: C ′i = (Ri−Rmax )(Rmax−Rmin)

Normalised document score: D ′j =(Dj−Dmax )

(Dmax−Dmin)

Heuristic linear combination: D ′′j =(D′j +0.4∗D′j ∗C

′i

1.4

J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings ofthe ACM SIGIR, pages 21–28. ACM, 1995.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 44

Page 45: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging in Uncooperative Environments

In uncooperative environments resources might provide scores:

But the broker does not have any information on how thesescore are computed.Score normalisation requires some way of comparing scores.

Alternatively the resources might provide only rank positions:

But the broker does not have any information on the relevanceof each document in the rank lists.Merging the ranks requires some way of comparing rankpositions.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 45

Page 46: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Semi-Supervised Learning (SSL)

The idea

Train a regression model for each collection that maps resourcedocument scores to normalised scores.Requires that some returned document are found in the CollectionSelection Index (CSI).

Two cases:

1 Resources use identical retrieval models

2 Resources use different retrieval models

L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of theACM SIGIR, pages 29-26. ACM, 2002.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 46

Page 47: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

SSL with Identical Retrieval Models

The idea

SSL uses documents found in CSI to train a single regression model toestimate the normalised score (D ′

i,j) from resource document scores (Di,j)and the score of the same document computed from the CSI (Ei,j).

Normalised scores

Having: D1,1 C1D1,1

D1,2 C1D1,2

. . . . . .Dn,m CnDn,m

× [a b] =

E1,1

E1,2

. . .En,m

Train:

D ′i,j = a ∗ Ei,j + b ∗ Ei,j ∗ Ci

Fabio Crestani and Ilya Markov Distributed Information Retrieval 47

Page 48: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

SSL with Different Retrieval Models

The idea

SSL uses documents found in CSI to train a different regressionmodels for each resource.

Normalised scores

Having: D1,1 1D1,2 1. . . . . .Dn,m 1

× [ai bi ] =

E1,1

E1,2

. . .En,m

Train:

D ′i ,j = ai ∗ Ei ,j + bi

Fabio Crestani and Ilya Markov Distributed Information Retrieval 48

Page 49: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Sample-Agglomerate Fitting Estimate (SAFE)

The idea

For a given query the results from the CSI is a subranking of theoriginal collection, so curve fitting to the subranking can be usedto estimate the original scores.It does not require the presence of overlap documents in CSI.

M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions onInformation Systems, 27(3): 129, 2009.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 49

Page 50: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Sample-Agglomerate Fitting Estimate (SAFE)

Normalised scores

1 The broker ranks the documents available in the CSI for thequery.

2 For each resource the sample documents (with non zeroscore) are used to estimate the merging score, where eachsample document is assumed to be representative of a fraction|Sc|/|c| of the resource.

3 Use regression to fit a curve on the adjusted scores to predictthe score of the document returned by the resource.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 50

Page 51: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

More Results Merging in Uncooperative Environments

There are amy other approaches to results merging:

STARTS uses the returned term frequency, documentfrequency, and document weight information to calculate themerging score based on similarities between documents.

CVV calculates the merging score according to the collectionscore and the position of a document in the returnedcollection rank list.

Another approach download small parts of the top returneddocuments and used a reference index of term statistics forreranking and merging the downloaded documents.

L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internetmeta-searching. In Proceedings of the ACM SIGMOD, pages 207-218, 1997.B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings ofthe Conference on Database Systems for Advanced Applications, pages 41-50, 1997.N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings ofthe Australasian Database Conference, pages 189-200, 1999.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 51

Page 52: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Data Fusion in Metasearch

In data fusion methods documents in a single collection areranked with different search engines

The goal is to generate a single accurate ranking list from theranking lists of different retrieval models.

There are no collection samples and no CSI.

The idea

Use the voting principle: a document returned by many searchsystems should be ranked higher than the other documents.If available, also take the rank of documents into account.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 52

Page 53: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Metasearch Data Fusion Methods

Many methods have been proposed:

Data Fusion

Round Robin.

CombMNZ, CombSum, CombMax, CombMin.

Logistic regression (covert rank to estimated probabilities ofrelevance).

A comparison between score-based and rank-based methodssuggests that rank-based methods are generally less effective.

E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages 105-108,1994.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 53

Page 54: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging in Metasearch

We cannot use data fusion methods when collections areoverlapping, but are not the same.

We cannot use data fusion methods when the retrieval modelare different.

Web metasearch most typical example.

The idea

Normalise the document scores returned by multiple search enginesusing a regression function that compares the scores of overlappeddocuments between the returned ranked lists.

In the absence of overlap between the results, mostmetasearch merging techniques become ineffective.

S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages1067-1072, 2004

Fabio Crestani and Ilya Markov Distributed Information Retrieval 54

Page 55: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging References (1)

J.P. Callan, Z. Lu, and W.B. Croft.

Searching distributed collections with inference networks.

In Proceedings of the ACM SIGIR, pages 21–28. ACM, 1995.

L. Si and J. Callan.

Using sampled data and regression to merge search engine results.

In Proceedings of the ACM SIGIR, pages 29-26. ACM, 2002.

M. Shokouhi and J. Zobel.

Robust result merging using sample-based score estimates.

In ACM Transactions on Information Systems, 27(3): 129, 2009.

L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke.

STARTS: Stanford proposal for internet meta-searching.

In Proceedings of the ACM SIGMOD, pages 207-218, 1997.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 55

Page 56: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource DescriptionResource SelectionResults Merging

Results Merging References (2)

B. Yuwono and D. Lee.

Server ranking for distributed text retrieval systems on the internet.

In Proceedings of the Conference on Database Systems for AdvancedApplications, pages 41-50, 1997.

N. Craswell, D. Hawking, and P. Thistlewaite.

Merging results from isolated search engines.

In Proceedings of the Australasian Database Conference, pages 189-200,1999.

E. Fox and J. Shaw.

Combination of multiple searches.

In Proceedings of TREC, pages 105-108,1994.

S. Wu and F. Crestani.

Shadow document methods of results merging.

In Proceedings of the ACM SAC, pages 1067-1072, 2004

Fabio Crestani and Ilya Markov Distributed Information Retrieval 56

Page 57: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Outline

1 Motivations

2 Architectures

3 Broker-Based Architecture

4 Applications

5 Advanced DIR

Fabio Crestani and Ilya Markov Distributed Information Retrieval 57

Page 58: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Vertical

Fabio Crestani and Ilya Markov Distributed Information Retrieval 58

Page 59: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Vertical

Specialized subcollection focused on a specific domain (e.g.,news, travel, and local search) or a specific media type (e.g.,images and video).

Vertical Selection

The task of selecting the relevant verticals, if any, in response to auser’s query.

DIR Solution

Resource Selection

0-1 verticals

Fabio Crestani and Ilya Markov Distributed Information Retrieval 59

Page 60: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Vertical Selection References

Fernando Diaz.Integration of news content into web results.In Proceedings of the ACM WSDM, pages 182–191. ACM,2009.

Jaime Arguello, Fernando Diaz, Jamie Callan, andJean-Francois Crespo.Sources of evidence for vertical selection.In Proceedings of the ACM SIGIR, pages 315–322. ACM, 2009.

Fernando Diaz and Jaime Arguello.Adaptation of offline vertical selection predictions in thepresence of user feedback.In Proceedings of the ACM SIGIR, pages 323–330. ACM, 2009.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 60

Page 61: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Blog Distillation

The task of identifying blogs with a recurring central interest.

Blog Feed ⇐⇒ PostsFederated Collection ⇐⇒ Documents

Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell.Retrieval and feedback models for blog feed search.In Proceedings of the ACM SIGIR, pages 347–354. ACM, 2008.

Jangwon Seo and W. Bruce Croft.Blog site search using resource selection.In Proceeding of the ACM CIKM, pages 1053–1062. ACM, 2008.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 61

Page 62: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Personalized Metasearch

Fabio Crestani and Ilya Markov Distributed Information Retrieval 62

Page 63: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Personalized Metasearch

Broker provides a single search interface over all of user’sonline resources.

Different collections

individual folders, addressbooks, calendarsthe Web

Paul Thomas and David Hawking.Experiences evaluating personal metasearch.In IIiX, pages 136–138, 2008.

Paul Thomas and David Hawking.Server selection methods in personal metasearch: a comparative empirical study.Inf. Retr., 12(5):581–604, 2009.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 63

Page 64: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Others

Expert Search

The task of identifying experts with a given expertise

Experts ⇐⇒ Related documents

Resource Selection

Desktop Search

Different file and document types

Resource Selection

Results Fusion

Fabio Crestani and Ilya Markov Distributed Information Retrieval 64

Page 65: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Applications Summary

Vertical Selection

Blog Distillation

Personalized Metasearch

Expert Search

Desktop Search

Your application

Fabio Crestani and Ilya Markov Distributed Information Retrieval 65

Page 66: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Outline

1 Motivations

2 Architectures

3 Broker-Based Architecture

4 Applications

5 Advanced DIR

Fabio Crestani and Ilya Markov Distributed Information Retrieval 66

Page 67: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Evolving Collections

Fabio Crestani and Ilya Markov Distributed Information Retrieval 67

Page 68: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Updating Resource Description

Query-Based Sampling

Given

N collections

n documents can be sampled at each time step

Distribution methods

Uniform

Popularity-based

Size-based

Fabio Crestani and Ilya Markov Distributed Information Retrieval 68

Page 69: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Updating Methods Comparison

1 2 3 4 5 6 7 8

Crawl

0.5

0.6

0.7

0.8

P@

5 (C

O=3

)

QLCUSS100 doc

1 2 3 4 5 6 7 8

Crawl

0.6

0.7

0.8

P@

5 (C

O=5

)

QLCUSS100 doc

1 2 3 4 5 6 7 8

Crawl

0.4

0.5

0.6

P@

10 (

CO

=3)

QLCUSS100 doc

1 2 3 4 5 6 7 8

Crawl

0.4

0.5

0.6

0.7P

@10

(C

O=5

)

QLCUSS100 doc

Figure 5: The P@5 and P@10 values produced by running 448 queries on di!erent collection representationsets. The CO values represent the number of collections that are selected for a query.

uments sampled after the first crawl, and (2) the strongerbaseline which creates a new representation set at each timestep, by sampling 100 documents (and discarding the previ-ous documents. i.e. Dt

C = {dt1, . . . , d

t100}).

Vocabulary of representation sets. Figure 4 compares thelanguage models of representation sets with that of theircorresponding collections in each crawl. As expected, thesecond baseline – using fresh representation sets at eachcrawl composed of 100 documents – remains consistentlyclose to the actual collections. Conversely, by not updat-ing the representation sets results in a gradual deteriorationillustrated by the old-baseline. In comparison, the threeupdating policies display a consistent estimation close tothe 100-doc baseline. This would indicate that by using anupdating policy, the representation sets reflect the contentchanges for each collection. Note that QL, CU and SS donot evict the old sampled documents from collection repre-sentation sets. Some of these documents may be deleted orupdated in collections over time. Therefore, the languagemodels of representation sets for these methods are morenoisy than that of the 100-doc baseline. Overall, there ap-peared to be minimal di!erence between updating policiesin terms of the representation set vocabulary, except afterthe first crawl where the SS policy experienced a sharp in-crease in KL divergence. Across the subsequent crawls, theKL divergence for the SS policy began to converge towardsthe other polices.

One interesting observation during crawls seven and eightwas that the KL divergence increased sharply for all threeupdating policies. This trend also corresponds to the largedecrease in the number of documents in the testbed overall

(see Table 1). This large deletion in content from a num-ber of collections a!ected the corresponding representationsets, with those documents still contained in the representa-tion set but not the collection increasing the KL divergence.This suggests that update policies should also consider howto deal with (remove) antiquated documents from the rep-resentations in order to be more accurate.

5.1 Retrieval performanceAdopting an updating policy improved performance in

comparison to both baselines (Fig. 5). Overall, the SSpolicy provided the most accurate and consistent perfor-mance. When comparing the precision of SS against the100-doc baseline, the policy was found to provide a signif-icant improvement over time with the exception of crawl 2(p < 0.001). This indicates that initially updating does nota!ect the performance, but over time the representationssets will gradually go out-of-date and not reflect the under-lying collection content. It is not clear to what extent thisimprovement was a result of either the SS algorithm sam-pling more documents from the larger collections or becausethe larger collections are more dynamic in comparison to thesmaller sized collections, or a combination of both factors.What it does indicate though is that a standard uniformsampling across collections is not an optimal strategy.

In comparison to SS, the QL and CU updating policieswere not as consistent. QL recorded higher precision val-ues on average than CU, although both approaches showedrelatively worse performance than SS. When comparing QLagainst the baseline, the policy was found to significantlyimprove over the baseline during crawls 3 to 5 and crawl 7(p < 0.05). Interestingly, using a naive policy, CU, resulted

SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR

517

Fabio Crestani and Ilya Markov Distributed Information Retrieval 69

Page 70: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Evolving Collections References

Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano.Modeling and managing content changes in text databases.In ICDE, pages 606–617, 2005.

Milad Shokouhi, Mark Baillie, and Leif Azzopardi.Updating collection representations for federated search.In Proceedings of the ACM SIGIR, pages 511–518, ACM, 2007.

Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano.Modeling and managing changes in text databases.ACM Trans. Database Syst., 32(3):14, 2007.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 70

Page 71: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Query

Fabio Crestani and Ilya Markov Distributed Information Retrieval 71

Page 72: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Query Expansion

GlobalExpansion based on all retrieved documentsThe same expanded query to each resource

LocalExpansion for each resource based only on its documents

Local but generalGet expansion terms from each resource and select the besttermsThe same expanded query to each resource

ClusterCluster resource independently of a queryUse Global or Local but general approach for each cluster

Fabio Crestani and Ilya Markov Distributed Information Retrieval 72

Page 73: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Query Expansion

Table 3: The impact of sample size on the e!ectiveness of query expansion methods for TREC topics 51-100.The expansion parameters are tuned using the TREC topics 101–150 on 300 document summary sets (bestcase for 300 document summaries). CRCS is used to select three servers per query.

Method P@5 P@10 MRR P@5 P@10 MRRSample size = 300 documents Sample size = 2900 documents

releva

nt BSNE 0.3040 0.2720 0.4687 0.3040 0.2720 0.4687

Local 0.3120 0.2780 0.4499 0.3160 0.2780 0.4756Fuse 0.3160 0.2700 0.4498 0.3240 0.2780 0.4702Cluster 0.3040 0.2700 0.4556 0.3000 0.2780 0.4575Global 0.2960 0.2740 0.4233 0.3120 0.2800 0.4447

nonre

l. BSNE 0.4760 0.4180 0.6238 0.4760 0.4180 0.6238Local 0.4480 0.3940 0.5810 0.4680 0.4180† 0.6153Fuse 0.4440 0.3840 0.6186 0.4840‡ 0.4240‡ 0.6440Cluster 0.4760 0.4180 0.6346 0.4720 0.4160 0.6344Global 0.4880 0.4280 0.6508 0.4680 0.4220 0.6341

trec

123 BSNE 0.4280 0.3740 0.5727 0.4280 0.3740 0.5727

Local 0.3960 0.3500 0.4996 0.4000 0.3520 0.5466Fuse 0.4200 0.3540 0.5842 0.4040 0.3640 0.5406Cluster 0.4160 0.3860 0.5449 0.4440† 0.3760 0.5814Global 0.4120 0.3800 0.5119 0.4640† 0.3840 0.5834†

repre

sent. BSNE 0.3840 0.3600 0.5671 0.3840 0.3600 0.5671

Local 0.3560 0.3300 0.4965 0.3840 0.3500 0.5108Fuse 0.3880 0.3680 0.5292 0.4160† 0.3660 0.5493Cluster 0.3840 0.3580 0.5699 0.4000 0.3580 0.5699Global 0.4080 0.3580 0.5331 0.3840 0.3660 0.5299

BONE 0.4800 0.4400 0.6119 0.4800 0.4400 0.6119BOQE 0.5240 0.4960 0.6394 0.5240 0.4960 0.6394

comparable across di!erent experiments, we use the original300 samples for server selection and result merging. There-fore, any changes in search e!ectiveness is solely due to thequality of query expansion candidates. It can be seen fromthe table that increasing the sample size does not alwaysimprove. Here, † and ‡ represent statistically significant dif-ferences between experiments with small and large samples.Contrary to earlier work [15], we find this is particularly thecase for the global method which performs poorly on twoof the testbeds despite more data to estimate query expan-sions from. However, the performance of the local and clus-ter methods generally improved with larger samples. Thisis intuitive, because larger samples provide richer sources oftext for pseudo-relevance feedback and are more likely to berepresentative of servers’ holdings. For the local methodsthis is more important because this is the only informationused for expansion.

Impact of Selection. A broker has a choice of algorithmsfor selection. We have so far been using CRCS for selectionas it has been shown to be one of the best-performing al-gorithms [18]; a further set of experiments considered theimpact of using CORI [3], another popular algorithm forselection. By doing so, di!erent servers will be selected,and it will be interesting to see whether this will improve ordegrade the query expansion methods.

Figure 2 illustrates P@5 for each selection algorithm.What is immediately striking is that using the CORI selec-tion algorithm results in a significant loss in performance:this was the case across almost any combination of param-eters, testbeds, and expansion methods.

Table 4: The average performance of query expan-sion methods across di!erent testbeds for TRECtopics 101–150 and TREC topics 51-100.

TREC Topics 101-150

Method P@5 P@10 MRR

BSNE 0.3220 0.2790 0.4858Local 0.2810 0.2810 0.4820Fuse 0.3100 † 0.2775 0.4898Cluster 0.3250† 0.2845 0.4899Global 0.3140 0.2865† 0.4863

TREC Topics 51-100

BSNE 0.3980 0.3560 0.5580Local 0.3970 0.3505 0.5418Fuse 0.4050 0.3530 0.5471Cluster 0.4030 0.3550 0.5637Global 0.4100 0.3605 0.5471

432

Paul Ogilvie and Jamie Callan.The effectiveness of query expansion for distributed information retrieval.In Proceedings of the ACM CIKM, pages 183–190. ACM, 2001.

Milad Shokouhi, Leif Azzopardi, and Paul Thomas.Effective query expansion for federated search.In Proceedings of the ACM SIGIR, pages 427–434. ACM, 2009.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 73

Page 74: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Overlap

Resource SelectionResults Fusion

Remove duplicates from the result listGive higher score to a document appeared more than once

Fabio Crestani and Ilya Markov Distributed Information Retrieval 74

Page 75: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource Selection (Overlap Estimate)

Given

Collections C1 and C2

K overlap documents between them

Samples S1 and S2

D duplicate documents within them

Estimated number of overlap documents

K =|C1||C2| · D|S1||S2|

Fabio Crestani and Ilya Markov Distributed Information Retrieval 75

Page 76: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Resource Selection (Relax)

2

15

4

13

C3 C4

C1 C2

(A)

R= 15

R= 20R= 8

R= 162

15

4

13

C3 C4

C1 C2

R= 16 R= 15

R= 20R= 8

(B)

2

1

1

C3 C4

C1 C2

(C)

R= 11 R= 12

R= 20R= 4

1

C3 C4

C1 C2

(D)

R= 9

R= 3 R= 20

R= 12

C3 C4

C1 C2

(E)

R= 8

R= 20

R= 12

R= 2

Figure 1: The Relax selection on a sample graph. Each vertex (Cn) in this graph represents a federatedcollection. (A) The graph initialization where R represents the estimated number of relevant documents in eachcollection. (B) The graph after initialization where C4 is selected as the most relevant collection according toits R value. The weight we(u, v) of an edge between u and v is computed according to the estimated number ofdocuments common between u and v. (C)–(E) The status of the graph after selecting each collection (vertex).

estimated size of collection. The number of sampled docu-ments from collection C that are ranked in the top ! resultsby the central retrieval model is represented by r. Finally,|S| is the number of sampled documents from collection C.

Relax uses the estimated values for R as the weights of thecollections.

In the next step, the weight we(u, v) of an edge betweentwo given collections u and v is calculated according to theapproximated number of common relevant documents be-tween u and v as follows:

we(u, v) = |Ru ! Rv| =|Ru " Rv| # K

|Cu " Cv| (4)

Here, |Cu " Cv| represents the total number of documents

in both collections. Ru and Rv are respectively the esti-mated number of relevant documents in collections u and v.K is the estimated number of common documents betweencollections u and v that is calculated by Eq. (3).

At each stage, the collection with the highest number ofrelevant documents is selected. The weights of other collec-tions are updated by subtracting the estimated number oftheir overlapped relevant documents with the selected collec-tion (that is, by relaxing). In summary, our Relax selectionmethod (Algorithm 1) is as follows:

1. Documents are downloaded from each collection usingthe query-based sampling technique.

2. The size of collections and the number of overlappedrelevant documents between each pair of collectionsare estimated.

3. The federated environment is represented by a graph,where each vertex is a collection and the weight of each

edge is computed using the number of common rele-vant documents between the connected pairs (Fig. 1).

4. The collection with the highest estimated number ofrelevant documents is selected.

5. Relax updates the graph by relaxing all collectionsand removing unnecessary edges.

6. Stop if there are no more edges or enough collectionshave been chosen. Otherwise, go to step 4.

Figure 1 shows a simple example of four overlapped col-lections. At the first step (A), the number of relevant doc-uments in each collection R is estimated. At the next stage(B), the collection with the highest number of relevant doc-uments (C4) is selected. The graph is relaxed by subtract-ing the estimated number of common relevant documentsbetween the top collection and the connected collections.After each update, the edges connected to the most recentselected collection are removed. This process continues untilthere is no edge in the graph (C)–(E). That is, Relax se-lects collections according to the number of their unvisitedrelevant documents.

5. OVERLAP FILTERING FOR REDDEAnother strategy for avoiding duplicates in the final re-

sults is to remove collections with a high degree of overlapfrom the resource selection rankings. That is, initially thedegree of overlap between collection pairs is estimated. Thenfor each query, collections are ranked using a resource selec-tion method such as ReDDE [Si and Callan, 2003a]. Eachcollection at rank µ is compared with the other collectionsat the higher ranks. Collection Cµ is pruned from the origi-nal rank list if it has a large estimated overlap with at least

SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR

498

Fabio Crestani and Ilya Markov Distributed Information Retrieval 76

Page 77: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Results Fusion

Remove duplicates from the result list

Give higher score to a document appeared more than once

Fusion Methods

Document d appears in m collections with scores {si}Shadow Document: assumes that d also appears in n −m

collections with a score∑m

i=1 sim

score(d) =∑m

i=1 si + k(n −m)∑m

i=1 sim

Multi-Evidence: score(d) = f (m)∑m

i=1 sim , where f (x) is a

nondecreasing function

Fabio Crestani and Ilya Markov Distributed Information Retrieval 77

Page 78: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Results FusionInf Retrieval (2007) 10:297–319 307

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

1 2 3 4 5

Ave

rage

pre

cisi

on a

t 8 d

ocum

ent l

evel

s (5

,10,

15,2

0,25

,30,

50,1

00)

Range of overlap rate (1:0-0.2; 2:0.2-0.4; 3:0.4-0.6; 4:0.6-0.8; 5:0.8-1.0)

MEMSDM

Round-robinBayesian

BordaCombMNZ

Fig. 1 Performances of six methods with different overlap rates ([0,1] normalization)

is not surprising since all these methods exploit the multiple evidence principle to boostthe rankings of those commonly retrieved documents, while Round-robin does not. BecauseCombMNZ relies most on this principle, it suffers the most when the overlap rate is low. Inthe same situation, SDM, MEM, Borda fusion and Bayesian fusion do not suffer so muchsince a balanced mechanism are introduced.

In the following, let us discuss the experimental result in two groups, each with a differentnormalization method. First with [0,1] normalization, Fig. 1 shows the average precision ofsix methods at eight document levels (5, 10, 15, 20, 25, 30, 50 and 100) with different overlaprates. Both SDM and MEM are close in performance in most situations. They perform betterthan Round-robin when the overlap rate among databases is no less than 40% (the differencesare not always significant), they are close to Round-robin when the overlap rate is between20% and 40% and they become not as good as Round-robin when the overlap rate is lessthan 20% (the differences are not significant in most cases). CombMNZ is not as good asSDM and MEM in all cases (the maximal overlap rate used is about 95% in the experiment).However, we do not try to find out in which condition CombMNZ is better than SDM andMEM. If it exists, it must be very close to 100%. Bayesian fusion does not work as well asSDM and MEM. Its performance is slightly better than Round-robin when the overlap rateis more than 80%. Borda fusion is not as good as Round-robin in all situations.

Figure 2 shows the average precision of four methods at eight document levels (5, 10, 15,20, 25, 30, 50 and 100) with linear regression. SDM and MEM perform slightly better thanRound-robin when the overlap rate is above 80%. When the overlap rate is between 60% and80%, the performances of SDM and Round-robin are close. CombMNZ’s performance isclose to Round-robin’s when the overlap rate is over 80%. In all other situations, Round-robinoutperforms all three other methods.

In summary, both SDM and MEM perform better with [0,1] linear normalization thanwith linear regression normalization in most situations and on average. The only exception

Springer

Fabio Crestani and Ilya Markov Distributed Information Retrieval 78

Page 79: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Overlapping Collections References

Milad Shokouhi and Justin Zobel.Federated text retrieval from uncooperative overlapped collections.In Proceedings of the ACM SIGIR, pages 495–502. ACM, 2007.

Yaniv Bernstein, Milad Shokouhi, and Justin Zobel.Compact features for detection of near-duplicates in distributed retrieval.In SPIRE, pages 110–121, 2006.

Shengli Wu and Sally McClean.Result merging methods in distributed information retrieval with overlappingdatabases.Inf. Retr., 10(3):297–319, 2007.

Shengli Wu and Fabio Crestani.Shadow document methods of results merging.In Proceedings of the ACM symposium on Applied computing, pages 1067–1072.ACM, 2004.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 79

Page 80: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

More DIR Summary

Evolving Collections

Query Expansion

Overlapping Collections

Multilingual Search

Distributed Multimedia Information Retrieval

Luo Si, Jamie Callan, Suleyman Cetintas, and Hao Yuan.An effective and efficient results merging strategy for multilingual informationretrieval in federated search environments.Inf. Retr., 11(1):1–24, 2008.

Jamie Callan, Fabio Crestani, and Mark Sanderson.Distributed Multimedia Information Retrieval: Sigir 2003 Workshop onDistributed Information Retrieval, Toronto, Canada, August 2003: Revised,Selected, and Invited Papers (Lecture Notes in Computer Science, 2924).SpringerVerlag, 2004.

Fabio Crestani and Ilya Markov Distributed Information Retrieval 80

Page 81: Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

MotivationsArchitectures

Broker-Based ArchitectureApplications

Advanced DIR

Q & A

Fabio Crestani and Ilya Markov Distributed Information Retrieval 81