oai: past, present and future

51
OAI: Past, Present and Future Michael L. Nelson [email protected] several slides stolen from Herbert Van de Sompel Open Archives Meeting Institute of Mechanical Engineers London 07/11/01

Upload: more

Post on 25-Feb-2016

104 views

Category:

Documents


2 download

DESCRIPTION

OAI: Past, Present and Future. Michael L. Nelson [email protected] several slides stolen from Herbert Van de Sompel Open Archives Meeting Institute of Mechanical Engineers London 07/11/01. Outline. Past original goals, participants Present - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OAI: Past, Present and Future

OAI: Past, Present and Future

Michael L. Nelson [email protected] slides stolen from Herbert Van de Sompel

Open Archives MeetingInstitute of Mechanical Engineers

London

07/11/01

Page 2: OAI: Past, Present and Future

Outline

• Past– original goals, participants

• Present– evolution of goals, terms, definitions, current status

• Future– observations, use in the U.S., next steps

Page 3: OAI: Past, Present and Future

Background

• I met Herbert Van de Sompel in April 1999...– we spoke of a demonstration project he had in

mind and had received sponsorship from Paul Ginsparg and Rick Luce

– We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc.

• most DLs had grown up along single disciplines– little to no interoperability, “gardens” of DLs

Page 4: OAI: Past, Present and Future

The Rise and Fall of Distributed Searching

• wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice– Davis & Lagoze, JASIS 51(3), pp. 273-80– Powell & French, Proc 5th ACM DL, pp. 264-265

• distributed searching of N nodes still viable, but only for small values of N

• NCSTRL: N > 100; bad• NTRS/NIX: N<=20; ok (but could be better)

Page 5: OAI: Past, Present and Future

The Rise and Fall of Distributed Searching

• Other problems of distributed searching (from STARTS)

– source-metadata problem• how do you know which nodes to search?

– query-language problem• syntax varies and drifts over time between the various nodes

– rank-merging problem• how do you meaningfully merge multiple result sets?

• Temptations:– centralize all functions

• “everything will be done at X”– standardize on a single product

• “everyone will use system Y”

Page 6: OAI: Past, Present and Future

Universal Preprint Service

• A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives– based on NCSTRL+; a modified version of Dienst

• support for “clustering”• support for “buckets”

• Demonstrated at Santa Fe NM, October 21-22, 1999– http://ups.cs.odu.edu/– D-Lib Magazine, 6(2) 2000 (2 articles)

• http://www.dlib.org/dlib/february00/02contents.html– UPS was soon renamed the Open Archives Initiative (OAI)

http://www.openarchives.org/

Page 7: OAI: Past, Present and Future

UPS ParticipantsArchive / DL Records in DL Buckets in UPS Buckets Linked to

Full Content

arXiv

www.arxiv.org

128943 85204 85204

CogPrints

cogprints.soton.ac.uk

743 742 659

NACA

naca.larc.nasa.gov

3036 3036 3036

NCSTRL

www.ncstrl.org

29680 25184 9084

NDLTD

www.ndltd.org

1590 1590 951

RePEc

netec.mcc.ac.uk

71359 71359 13582

Totals: 235361 187115 112516

totals ca. July 1999

Page 8: OAI: Past, Present and Future

• Getting metadata out of archives– not all archives support metadata extraction

• some archives have undocumented metadata extraction procedures

– not all archives support rich criteria for extraction • single dump concept only

• Intellectual property and use rights not always clear– many policies akin to “don’t ask, don’t tell”

Metadata Harvesting

Page 9: OAI: Past, Present and Future

• Quality problems with:– record duplication– crucial missing fields– internal errors– ambiguous references to people and places,

publications

• Different formats!arXiv (local)CogPrints (local)NACA referRePEc ReDIFNDLTD MARCNCSTRL RFC-1807

Metadata Formatting and Quality

unproven intuition : n digital librariesresults in O(n) metadata formats

Page 10: OAI: Past, Present and Future

Buckets: Information Surrogates in UPS

• Limitations on intellectual property, file size, transmission time, system load, etc. caused us to focus on metadata only

• Metadata was collected into “buckets”, with pointers back to the data files (still at the original sites)

Page 11: OAI: Past, Present and Future

Value Added Services Attached

to the Buckets SFX Reference Linking Service, developed at Univ of Ghent, Belgium. - provides a layer of indirection between reference services available at a local site and the object itself

SFX “buttons” are attached to the buckets themselves - communication occurs between SFX server and the bucket

Adding other services to the buckets is easy...

Page 12: OAI: Past, Present and Future

• Data Providers– publishing into an archive– providing methods for metadata “harvesting”

• provide non-technical context for sharing information also

• Service Providers– harvest metadata from providers– implement user interface to data

• Even if provided by the same DL, these are distinct functions

Data and Service Providers

Page 13: OAI: Past, Present and Future

ProviderInputinterface

Nativeend-userinterface

ProviderInputinterface

Nativeend-userinterface

Nativeharvestinginterface

No machine based way to extract metadata…

Machine and user interfacesfor extracting metadata….

Data and Service Providers• Self-describing archives

– Much of the learning about the constituent UPS archives occurred out of band…

– Given an unknown archive, we should be able to algorithmically determine the nature of the archive

Page 14: OAI: Past, Present and Future

Data ProviderInputinterface

Nativeharvestinginterface

Data ProviderInputinterface

Nativeend-userinterface

Nativeharvestinginterface

Service Provider

Nativeend-userinterface

Input and harvesting interfaces optional

Native end-userinterface optional(e.g., RePEc)

Data and Service Providers

Page 15: OAI: Past, Present and Future

Result… OAI• The OAI was the result of the demonstration and discussion during the

Santa Fe meeting• Initial focus was on federating collections of scholarly e-print materials…• …however, interest grew and the scope and application of OAI expanded

to become a generic bulk metadata transport protocol

• Note:– OAI is only about metadata -- not full text!– OAI is neutral with respect to the nature of the metadata or the resources the

metadata describes• read: commercial publishers have an interest in OAI too...

Page 16: OAI: Past, Present and Future

OAI Timeline Highlights• October 21-22, 1999 - initial UPS meeting• February 15, 2000 - Santa Fe Convention published in D-Lib Magazine

– precursor to the OAI metadata harvesting protocol• June 3, 2000 - workshop at ACM DL 2000 (Texas)• August 25, 2000 - OAI steering committee formed, DLF/CNI support• September 7-8, 2000 - technical meeting at Cornell University

– defined the core of the current OAI metadata harvesting protocol• September 21, 2000 - workshop at ECDL 2000 (Portugal)• November 1, 2000 - Alpha test group announced (~15 organizations)• January 23, 2001 - OAI protocol 1.0 announced, OAI Open Day in the U.S. (Washington DC)

– purpose: freeze protocol for 12-16 months, generate critical mass• February 26, 2001 - OAI Open Day in Europe (Berlin)• July 3, 2001 - OAI protocol 1.1 announced

– to reflect changes in the W3C’s XML latest schema recommendation• September 8, 2001 - workshop at ECDL 2001 (Darmstadt)

Page 17: OAI: Past, Present and Future

Open Archives Initiative

The protocol is openlydocumented, and metadatais “exposed” to at least somepeer group (note: rights management can still apply!)

Archive defined as a“collection of stuff” --not the archivist’s definition of “archive”. “Repository” used in most OAI documents.

OAI is happeningat break-neck speed...

Page 18: OAI: Past, Present and Future

Open Archives Initiative Open Archival Information System

http://www.dlib.org/dlib/april01/04editorial.htmlhttp://www.dlib.org/dlib/may01/05letters.htmlhttp://ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html

exposure of metadata for harvesting insuring long-term preservation of archival materials

OAIS

OAIS w/an OAI interface

Page 19: OAI: Past, Present and Future

OAI Metadata Harvesting Protocol• Then:

– OAI harvesting protocol originally a subset of the Dienst (NCSTRL) protocol

• and originally called the “Santa Fe Convention”

– originally defined an OAI-specific metadata format• Now:

– OAI metadata format dropped in favor of unqualified Dublin Core

• other formats possible, but DC is required as lowest common denominator

– No longer dependent on Dienst• defined independently (though still easily mappable)

Page 20: OAI: Past, Present and Future

Overview of OAI VerbsVerb Function

Identify description of archive

ListMetadataFormats metadata formats supported by archive

ListSets sets defined by archive

ListIdentifiers OAI unique ids contained in archive

ListRecords listing of N records

GetRecord listing of a single record

archivalmetadata

harvestingverbs

most verbs take arguments: dates, sets, ids, metadata formatsand resumption token (for flow control)

Page 21: OAI: Past, Present and Future

supporting protocol requests

herbert van de sompel

service providerharvester

data providerrepository

Identify

Identify / Time / Request• Repository identifier

• Base-URL• Admin e-mail

• OAI protocol version• Description

repos i tory

Page 22: OAI: Past, Present and Future

supporting protocol requests

herbert van de sompel

service providerharvester

data providerrepository

ListMetadataFormats * identifier=oai:mlib:123a

ListMetadataFormats / Time / Request REPEAT

• Format prefix• Format XML schema

/REPEAT

repos i tory

Page 23: OAI: Past, Present and Future

supporting protocol requests

herbert van de sompel

service providerharvester

data providerrepository

ListSets * resumptionToken

ListSets / Time / Request REPEAT

• SetSpec• SetName

/REPEAT

repos i tory

Page 24: OAI: Past, Present and Future

harvesting requests

herbert van de sompel

service providerharvester

data providerrepository

* from=a * until=b * set=klmListRecords * metadataPrefix=dc * resumptionToken

ListRecords / Time / Request REPEAT

• Identifier• Datestamp

• Metadata/REPEAT

repos i tory

Page 25: OAI: Past, Present and Future

harvesting requests

herbert van de sompel

service providerharvester

data providerrepository

ListIdentifiers / Time / Request REPEAT

• Identifier• Datestamp

/REPEAT

repos i tory

* from=a * until=b * set=klmListIdentifiers * resumptionToken

Page 26: OAI: Past, Present and Future

harvesting requests

herbert van de sompel

service providerharvester

data providerrepository

GetRecord * identifier=oai:mlib:123a * metadataPrefix=dc

GetRecord / Time / Request• Identifier

• Datestamp• Metadata

repos i tory

Page 27: OAI: Past, Present and Future

Flow Control

• ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of:– resumptionToken – an opaque, archive-defined data

string that when passed back to the archive allows the response to begin where it left off

• each archive defines their own resumptionToken syntax; it may have visible semantics or not

– 503 http status code – “retry after”• up to the harvester to understand this code and respect it, and

up to the archive to enforce it

Page 28: OAI: Past, Present and Future

resumptionToken

harvester RDBMS

ListRecords

Records 1-100, resumptionToken=AXad31

ListRecords, resumptionToken=AXad31

Records 101-200, resumptionToken=pQ22-x

ListRecords, resumptionToken=pQ22-x

Records 201-277

scenario: harvesting277 records in 3 separate100 record “chunks”

Page 29: OAI: Past, Present and Future

OAI Demos• Data providers

– not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool

• http://purl.org/net/oai_explorer• 30+ registered data providers

– http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl– many being used for internal purposes; not registered

• Service providers– Arc, the first known SP harvesting from OAI data providers

• http://arc.cs.odu.edu/ • 3 registered service providers

– http://www.openarchives.org/service_provider/oai_sp.htm– several more known to be in testing or creation

Page 30: OAI: Past, Present and Future

Field of Dreams• It should be easy to be a data provider, even if it makes more work

for the service provider.– if enough data providers exist, the service providers will come (DPs >> SPs)

• Open-source / freely available tools– “drop-in” data providers:

• industrial strength: http://www.eprints.org/• personal size: http://kepler.cs.odu.edu/

– tools to make your existing DL a data provider:• http://www.openarchives.org/tools/tools.htm• also: OAI-implementers mailing list / mail archive!

– service providers:• only bits and pieces currently publicly available...

Page 31: OAI: Past, Present and Future

OAI Observation: Front-End Only

• No input/registry mechanism– OAI harvesting protocol is always a front-end for

something else• filesystem, Dienst, RDBMS, LDAP, etc.

– convenient for pre-existing DLs, but does not address “new” DLs

• e.g., “we want to do OAI”

• Bounds the scope of OAI– responsibilities and domain of OAI are still be discussed– tension between functionality and simplicity

Page 32: OAI: Past, Present and Future

OAI Observation: No T&C

• No terms & conditions provisions in protocol– assumes all metadata has uniform access rights

• how to restrict metadata to certain hosts?– introducing T&C would increase the scope of

application, but at the expense of simplicity• how expensive do we want to make a “just-a-front-

end protocol” ?• maybe T&C is a good application for sets?

Page 33: OAI: Past, Present and Future

OAI Observation: No T&C• Possible to use multiple OAI servers in a

DMZ-like configuration…

Public OAI Server

Private OAI Server

Source database

OAI requestsfrom trusted hosts

OAI requestsfrom arbitrary hosts

could even use a separate copy of the database…

Page 34: OAI: Past, Present and Future

OAI Observation: No T&C

• Possible to use OAI harvesting protocol in closed, restricted systems

OAI 1 OAI 2

OAI 3OAI 4

all OAI requests originate from these 4 DLs

Page 35: OAI: Past, Present and Future

OAI Observation: Monolithic• An OAI server has no protocol-defined

concept of “other” OAI servers– backups, mirrors, etc. have to be resolved

outside of the scope of OAI• scope vs. complexity again

– fully connected graph of DLs harvesting from each other is unnecessary

• cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System

– 3rd party harvesting interfaces raise more T&C and data coherency issues

Page 36: OAI: Past, Present and Future

302 Load Balancing• Interactive users on main DL machine should not be

impacted by metadata harvesting– don’t take deliveries through the front door– not part of the protocol; defined outside the protocol

OAIServer

naca.larc.nasa.gov/oai/

if load > 0.05redirect request

OAIServer

buckets.dsi.internet2.edu/naca/oai/

harvesterhttp://blah/oai/?verb=ListIdentifiers

HTTP Status Code 302

http://blah/oai/?verb=ListIdentifiers

<?xml version=“1.0” encoding=“UTF-8”?>…<ListIdentifiers>…</ListIdentifiers>

Page 37: OAI: Past, Present and Future

OAI Observation: Data Coherency• In the interest of OAI implementer simplicity,

several issues are left for the service provider to interpret – what is an update vs. addition?

• in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out

– deletions?• it is currently optional for OAI systems to mark records

as deleted or not…– still left to the harvester to interpret

Page 38: OAI: Past, Present and Future

OAI Observation: Harvest Model• Frequency of harvests

– all-at-once harvests?• initial harvest• resolving data coherency

– frequent incremental harvests?• far more efficient for both service and data providers

• Webcrawling vs. digital library models– webcrawlers: little to no a priori information about target– DLs: frequent harvesting of a small number of known targets

• Realization: we know very little about how harvesting behavior…– are we optimizing for all-at-once, when incremental will be more

common?

Page 39: OAI: Past, Present and Future

Potentially Good Ideas(but we’re not sure yet)

• Sets– intuition: we’ll be glad we included them– arXiv the first to implement sets

• their DL is roughly built on “sets”, so it was an easy mapping for them

• a few other repositories have since adopted sets

• Flow control– harvesting == denial of service attack ?– is “resumptionToken” solution not enough? too much?

• need data providers with large collections and enough service providers to generate a load

Page 40: OAI: Past, Present and Future

Potentially Good Ideas(but we’re not sure yet)

• Metadata– Q: “Which format should I use?”

• A: any/all of them…

– lowest common denominator: unqualified Dublin Core

– Again, little known about actual behavior• will DC be actually be useful? or too lossy?• will communities create/adopt specific formats?• will native (presumably richer) formats be harvested?

we very much want this to happen...“The Return of MARC” ?!

Page 41: OAI: Past, Present and Future

XML Observations• Not too much of a problem for data providers

– XML is easier to write than read• Service providers…

– XML can be pretty picky… a large “ListRecords” result can be invalidated with a single error

• harvest in chunks? individual records?– author contributed metadata particularly a problem (e.g.

control characters from copy-n-paste)– one advantage of resumptionToken is that it

compartmentalizes bad data

Page 42: OAI: Past, Present and Future

Current NTRS / NIX Architecture• NASA-wide page that federates N center/project

specific servers through distributed searching

user

. . .

search for “cfd applications”

search for“cfd applications”

search for“cfd applications”search for“cfd

applications”search for“cfd applications”

each node independently maintained

NTRS/NIX http://techreports.larc.nasa.gov/cgi-bin/NTRShttp://nix.nasa.gov/

Page 43: OAI: Past, Present and Future

Current NTRS / NIX Architecture• Or users can interact directly with the nodes

of NTRS/NIX…

user

. . .

search for“cfd applications” search for“cfd

applications”

NTRS/NIX

Page 44: OAI: Past, Present and Future

Proposed Strategy: Data Providers• Reduce the high interoperability expectations of

distributed searching… • Each current node of NTRS, NIX and other NASA DLs

become an OAI “data provider”– LTRS & NACA already have test OAI interfaces

• LTRS http://techreports.larc.nasa.gov/ltrs/oai/• NACA http://naca.larc.nasa.gov/oai/

– each node is free to run their own software / architecture / system / etc., but the method of metadata exposure is standardized

• very low interoperability requirements• each node can continue to have a “user interface”

Page 45: OAI: Past, Present and Future

Proposed Strategy: Service Providers

• NTRS, NIX and other well known, “destination DLs” become OAI service providers– no longer relying on distributed searching– harvest metadata from their constituent data

providers– provide their value added services on local copies

of the metadata• data remains resident at the local data providers

Page 46: OAI: Past, Present and Future

NTRS OAI Architecture

user

. . .

search for “cfd applications”

local copy ofmetadata

metadata harvested offline, through OAI interface

each node independently maintained

individual nodes canstill support direct userinteraction

NTRS

LTRS ATRS GTRS CASITRS

all searching, browsing, etc. performed on the metadata here

content (reports) remain archived at the local sites

Page 47: OAI: Past, Present and Future

Additional Models• First step

– OAI interfaces for data providers– DLs use OAI interfaces to move from distributed searching to

metadata harvesting• Other possibilities

– hierarchical harvesting• exposing metadata to other, possibly non-NASA DLs• harvesting from other, possibly non-NASA DLs

– multi-genre DLs– re-apply the OAI protocol for harvesting / replicating content

(not just metadata)– 3rd party service providers

Page 48: OAI: Past, Present and Future

NASA DLs in the Larger STI Realm

NTRS

LTRS ATRS CASITRS…

DOEDODUniversitiesPublishers . . .International

NTRS could also be a data provider from the point of view of other DLs; allowing theharvesting of NASAreport metadata.

NTRS could also harvestmetadata from other DLs,and provide access to non-NASA content.

We hope to influencethe direction of the science.gov effort to useOAI.

this could be a fully connected graph

Page 49: OAI: Past, Present and Future

New Kinds of DLs

• Drawing from the same pool of DPs– different interfaces, capabilities and collection policies for:

• public affairs• K-12 education• science & research• authors / librarians / managers

– NTRS and NIX could harvest from the same sources… • be the same DL, but with different interfaces?• be replaced with a new, all-encompassing DL?

– DL creators can now focus on collection management• “ala carting” their collections and sub collections• instead of fussing over syntax synchronization of remote search services

Page 50: OAI: Past, Present and Future

A Generic Harvesting Protocol• The actual uses of OAI depend on your relative position

and concerns:– What is metadata vs. data?– Who is a SP vs. a DP?

• Multiple OAI interfaces make many things possible:– restricted / public interfaces– Arc-like description of harvested archives– updates of log files, authority lists, etc.

• Additional services can be built on top of OAI– content replication– awareness services

Page 51: OAI: Past, Present and Future

OAI Impact

• Lightweight interoperability protocol– an OAI layer is added to your existing DL

• Separation of responsibilities– service providers– data providers

• http://www.openarchives.org/