uwe schindlerges 2007 – may 2-4, 2007 data information service based on open archives initiative...

14
Uwe Schindler GES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1 , Benny Bräuer 2 , Michael Diepenbroek 1 1 PANGAEA ® Group at MARUM, University of Bremen, Bremen, Germany 2 Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany

Upload: corey-blake

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Data Information Servicebased on

Open Archives Initiative Protocolsand

Apache Lucene

Uwe Schindler1, Benny Bräuer2, Michael Diepenbroek1

1PANGAEA® Group at MARUM, University of Bremen, Bremen, Germany2Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany

Page 2: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Metadata Portals & Grid

• WDC-MARE with its information system PANGAEA® currently provides data portals for several EU/international projects:

• Not all data are stored centralized, so all datasets provided in portals must be consolidated from different sources!

• Features:– Data stays at the data providers– Metadata is harvested by the portal– Search queries are handled by the centralized catalogue

(Google-like search speed!)– Scientist gets link to data at the provider

Metadata portal software is sufficient for C3-Grid, too!

Page 3: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Metadata in C3-Grid

• Goal: build up an infrastructure for earth system community in Germany

• Problem: we need an architecture which makes it possible to:– Collect metadata files from data

providers– Store them in a “central index”– Provide a fast, generic access to this data

for our users

Solution Data Information Service

Page 4: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Open Archives Protocol

• The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative.

• Almost all digital libraries support it (most famous ones: Fedora, arXiv and the CERN Document Server)

• Portals by Scientific Commons, OAIster, SUB

• uses it during web crawling (if available)

• Very simple to implement (XML over HTTP-REST)

• Repository software for databases or file system metadata providers is widely available (C3 uses mostly DLESE jOAI software on the data provider side)

Page 5: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Metadata in C3-Grid

• Goal: build up an infrastructure for earth system community in Germany

• Problem: we need an architecture which makes it possible to:– Collect metadata files from data providers– Store them in a “central index”– Provide a fast, generic access to this data

for our users

Solution Data Information Service

Page 6: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Central indexing requirements

1. Open for any XML metadata format

2. Any mappings to document fields should be done by XPath

3. Possibility to map incompatible XML schemas during harvesting by XSLT on-the-fly

4. On-the-fly validation of (transformed) documents during harvesting

5. No relational database, only a full text search engine, that contains everything needed for operation

6. Range queries on specific fields (date/time or numeric)

7. Web service interface / programming API for the end user interface that is accessible from any language (Java/JSP, PHP, Perl,...)

Page 7: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

features

• Ranked searching - best results returned first • Many powerful query types: phrase queries, wildcard

queries, proximity queries, range queries for date time values and numbers

• Fielded searching. All fields are searchable as a whole, each field separately (e.g. for author, parameter), or mixed.

• Any combination of boolean operators between search terms (AND, OR, NOT, exact phrase)

• Sorting by any field • Multiple-index searching with merged results • Simultaneous searching and updates due to high-

performance indexing

Page 8: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Generic Framework

<<centralBuffer>>

DOM tree

validate againstschema

<<centralBuffer>>

DOM tree

transform byXSL

apply XPath

field

apply XPath

field

add documentto index

serializeDOM

XMLblob

accept Document asDOM tree

LuceneIndex

LuceneIndex

LuceneIndex

VirtualIndex

VirtualIndex

DataProvider

DataProvider

FileSystem

OAI-PMHHarvester

OAI-PMHHarvester

DirectoryHarvester

Index Builder

Sea

rch

Inte

rfac

eS

earc

h In

terf

ace

Page 9: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Metadata in C3-Grid

• Goal: build up an infrastructure for earth system community in Germany

• Problem: we need an architecture which makes it possible to:– Collect metadata files from data providers– Store them in a “central index”– Provide a fast, generic access to this

data for our users

Solution Data Information Service

Page 10: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Search Interface

• Supports all standard Lucene search features

• Additional support for fast range queries to enable bounding boxes, etc.:– implemented by redundant storage of

“numerical terms” in different precisions– recursive reduction of distinct terms (every

numerical value is a term) on range query– search time no longer dependent on index

size• Accessible via Java API or AXIS web

service

Page 11: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Metadata in C3-Grid

• Goal: build up an infrastructure for earth system community in Germany

• Problem: we need an architecture which makes it possible to:– Collect metadata files from data providers– Store them in a “central index”– Provide a fast, generic access to this data

for our users

Solution Data Information Service

Page 12: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

C3 Implementation

Fig. by T. Langhammer, ZIB

web service frontend

Portal

CERAPANGAEA®

Other Data

Provider

Google-style andrange queries

DIS

Metadata1.xml, Metadata2.xml,Metadata3.xml, Metadata4.xml,

...

Field Term Documentidentifier ABC:123 2

identifier XYZ:223 6

identifier MI6:007 12abstract region 2,23,112abstract pressure 3,23abstract humid 4,33,215min_lat 030.43 1min_lat -023.23 2data_uri http://... 4

Apache Lucene index

document cache

indexingof

selectedfields

OAI-PMH

full-textindex

harvestingbackend

Page 13: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Future

metadataof data

metadataof

workflows

workflowquery

data query

assemble

workflow

processing

Page 14: Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer

Uwe Schindler GES 2007 – May 2-4, 2007

Thank You!

Software will be available soon as open source on Sourceforge.net!

News: http://wiki.pangaea.de/wiki/Portal