uwe schindlerges 2007 – may 2-4, 2007 data information service based on open archives initiative...
TRANSCRIPT
Uwe Schindler GES 2007 – May 2-4, 2007
Data Information Servicebased on
Open Archives Initiative Protocolsand
Apache Lucene
Uwe Schindler1, Benny Bräuer2, Michael Diepenbroek1
1PANGAEA® Group at MARUM, University of Bremen, Bremen, Germany2Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany
Uwe Schindler GES 2007 – May 2-4, 2007
Metadata Portals & Grid
• WDC-MARE with its information system PANGAEA® currently provides data portals for several EU/international projects:
• Not all data are stored centralized, so all datasets provided in portals must be consolidated from different sources!
• Features:– Data stays at the data providers– Metadata is harvested by the portal– Search queries are handled by the centralized catalogue
(Google-like search speed!)– Scientist gets link to data at the provider
Metadata portal software is sufficient for C3-Grid, too!
Uwe Schindler GES 2007 – May 2-4, 2007
Metadata in C3-Grid
• Goal: build up an infrastructure for earth system community in Germany
• Problem: we need an architecture which makes it possible to:– Collect metadata files from data
providers– Store them in a “central index”– Provide a fast, generic access to this data
for our users
Solution Data Information Service
Uwe Schindler GES 2007 – May 2-4, 2007
Open Archives Protocol
• The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative.
• Almost all digital libraries support it (most famous ones: Fedora, arXiv and the CERN Document Server)
• Portals by Scientific Commons, OAIster, SUB
• uses it during web crawling (if available)
• Very simple to implement (XML over HTTP-REST)
• Repository software for databases or file system metadata providers is widely available (C3 uses mostly DLESE jOAI software on the data provider side)
Uwe Schindler GES 2007 – May 2-4, 2007
Metadata in C3-Grid
• Goal: build up an infrastructure for earth system community in Germany
• Problem: we need an architecture which makes it possible to:– Collect metadata files from data providers– Store them in a “central index”– Provide a fast, generic access to this data
for our users
Solution Data Information Service
Uwe Schindler GES 2007 – May 2-4, 2007
Central indexing requirements
1. Open for any XML metadata format
2. Any mappings to document fields should be done by XPath
3. Possibility to map incompatible XML schemas during harvesting by XSLT on-the-fly
4. On-the-fly validation of (transformed) documents during harvesting
5. No relational database, only a full text search engine, that contains everything needed for operation
6. Range queries on specific fields (date/time or numeric)
7. Web service interface / programming API for the end user interface that is accessible from any language (Java/JSP, PHP, Perl,...)
Uwe Schindler GES 2007 – May 2-4, 2007
features
• Ranked searching - best results returned first • Many powerful query types: phrase queries, wildcard
queries, proximity queries, range queries for date time values and numbers
• Fielded searching. All fields are searchable as a whole, each field separately (e.g. for author, parameter), or mixed.
• Any combination of boolean operators between search terms (AND, OR, NOT, exact phrase)
• Sorting by any field • Multiple-index searching with merged results • Simultaneous searching and updates due to high-
performance indexing
Uwe Schindler GES 2007 – May 2-4, 2007
Generic Framework
<<centralBuffer>>
DOM tree
validate againstschema
<<centralBuffer>>
DOM tree
transform byXSL
apply XPath
field
apply XPath
field
add documentto index
serializeDOM
XMLblob
accept Document asDOM tree
LuceneIndex
LuceneIndex
LuceneIndex
VirtualIndex
VirtualIndex
DataProvider
DataProvider
FileSystem
OAI-PMHHarvester
OAI-PMHHarvester
DirectoryHarvester
Index Builder
Sea
rch
Inte
rfac
eS
earc
h In
terf
ace
Uwe Schindler GES 2007 – May 2-4, 2007
Metadata in C3-Grid
• Goal: build up an infrastructure for earth system community in Germany
• Problem: we need an architecture which makes it possible to:– Collect metadata files from data providers– Store them in a “central index”– Provide a fast, generic access to this
data for our users
Solution Data Information Service
Uwe Schindler GES 2007 – May 2-4, 2007
Search Interface
• Supports all standard Lucene search features
• Additional support for fast range queries to enable bounding boxes, etc.:– implemented by redundant storage of
“numerical terms” in different precisions– recursive reduction of distinct terms (every
numerical value is a term) on range query– search time no longer dependent on index
size• Accessible via Java API or AXIS web
service
Uwe Schindler GES 2007 – May 2-4, 2007
Metadata in C3-Grid
• Goal: build up an infrastructure for earth system community in Germany
• Problem: we need an architecture which makes it possible to:– Collect metadata files from data providers– Store them in a “central index”– Provide a fast, generic access to this data
for our users
Solution Data Information Service
Uwe Schindler GES 2007 – May 2-4, 2007
C3 Implementation
Fig. by T. Langhammer, ZIB
web service frontend
Portal
CERAPANGAEA®
Other Data
Provider
Google-style andrange queries
DIS
Metadata1.xml, Metadata2.xml,Metadata3.xml, Metadata4.xml,
...
Field Term Documentidentifier ABC:123 2
identifier XYZ:223 6
identifier MI6:007 12abstract region 2,23,112abstract pressure 3,23abstract humid 4,33,215min_lat 030.43 1min_lat -023.23 2data_uri http://... 4
Apache Lucene index
document cache
indexingof
selectedfields
OAI-PMH
full-textindex
harvestingbackend
Uwe Schindler GES 2007 – May 2-4, 2007
Future
metadataof data
metadataof
workflows
workflowquery
data query
assemble
workflow
processing
Uwe Schindler GES 2007 – May 2-4, 2007
Thank You!
Software will be available soon as open source on Sourceforge.net!
News: http://wiki.pangaea.de/wiki/Portal