metasearch vs harvesting and indexing

MetaSearch vs Harvesting and Indexing Lukas Koster Library of the University of Amsterdam -- 2009

Upload: lukas-koster

Post on 19-Jan-2015




0 download


A comparison between metasearch/federated search and harvesting & indexing in libraries.


Page 1: MetaSearch vs Harvesting and Indexing

MetaSearch vsHarvesting andIndexing

Lukas KosterLibrary of the University of Amsterdam--http://commonplace.net2009

Page 2: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

So many databases to search

Page 3: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

MetaSearch – Federated Search




SearchTranslate search syntax




ConversionMergingDeduplicationRanking(First 30 per DB)


Database Connectors

MetaSearch tool Databases

Searching and Data fetching: One integrated interdependent on-the-fly procedure

Search Engine

Page 4: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Technical bottlenecks




SearchTranslate search syntax




ConversionMergingDeduplicationRanking(First 30 per DB)


Database Connectors

MetaSearch tool Databases


Access Authorisation

Search Engine

Page 5: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Technical bottlenecks

Changes in Remote database server IP address Remote database server hostname Remote database server configuration Remote database authentication Firewall Database system Network

Page 6: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

MetaSearch limitations

Differences in searches, indexes Author Subject Multiple languages

Speed (slowness) Limited number of searchable databases Not all results in first set Relevance

Page 7: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Author searches

Variations in author name storage formats Henry James James, Henry James, H. H.James Which Henry James? Or is it: Henry, J./James Henry ?

Variations in supported search formats Only one? All of the above?

Page 8: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Variations in author names

Page 9: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Subject searches

Different qualification, keyword schemes per database LoC subject Headings Dutch Basic Classification Local subject schemes

Different use of subjects per database Cooking Cookery Food

Different use of subjects within one database


Page 10: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Multilingual searches

All words searches Subject searches

English “cooking” Japanese “???”

Title searches Translations (We need FRBR!)

Author searches (historical names) See: Erasmus

Page 11: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

All processing on the fly

Issues, dependent on each other: Speed (slowness) Limited number of searchable databases Not all results in first set Relevance

Page 12: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Speed (slowness)

Dependent on 1. Search term transformation

2. Response time of external databases

3. Speed of internet connection

4. Conversion of results to presentation format

5. Merging of results

6. Deduplication of results

7. Relevance ranking

Page 13: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Limited number of databases

Searching too many databases takes too long

Local processing time influenced by Merging (takes time) Deduplication (takes time) Ranking (takes time)

Page 14: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Not all results in first set

Merging, deduplication, ranking of all results takes too long

Only first 30 or so of each database are processed initially

Get more: next 30 per database are fetched and processed

Page 15: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009


Dependent on default sort order (relevance?, date?) of each external database

Dependent on default ranking mechanism of each database

Local ranking initially performed on first batches of 30 records per database

After additional fetching records, ranking is done again: Initial top results may go down

Page 16: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009


1. Don’t rank

2. Don’t deduplicate

3. Don’t merge (in advance)

If you don’t merge, there is no point in deduplicating or ranking!!

1. “Does not make much sense anyway”

2. “Does not work always anyway”3. “So, you have separate lists that

you can merge later on”

Page 17: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Search with MetaSearch

Page 18: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Translate search syntax on the fly

Page 19: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Fetching results

Page 20: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Conversion of results on the fly

Page 21: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Conversion of results on the fly

Page 22: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Conversion of results on the fly

Page 23: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Results with MetaSearch

Page 24: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Results with MetaSearch

Page 25: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Harvesting and indexing



Central index

H&I tool Databases


Searching and Data fetching: Two completely separate procedures

Search Engine

Page 26: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

Advantages of H&I

Speed No maximum number of

searchable databases All results in first set No differences in searches,

indexes Relevance Fewer technical bottlenecks

Central index always available in case of connection problem

Page 27: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

H&I: Aquabrowser

Page 28: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

H&I: Primo

Page 29: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

MetaSearch = “Just in time”

Bookshop – Central Book Deposit

Always order on request Risk of logistics problems

Page 30: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

H&I = “just in case”

Bookshop with large stock Customers always find something Maybe not the most recent stuff

Page 31: MetaSearch vs Harvesting and Indexing

MetaSearch vs Harvesting and Indexing - Lukas Koster - 2009

