resourcesync - overview and real-world use cases for discovery, harvesting, and synchronization of...

67
An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web http://www.openarchives.org/rs #resourcesync ResourceSync ANSI/NISO Z39.99-2017 Martin Klein Gretchen Gueguen Mark Matienzo Petr Knoth

Upload: martin-klein

Post on 21-Apr-2017

129 views

Category:

Internet


1 download

TRANSCRIPT

An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web

http://www.openarchives.org/rs #resourcesync

ResourceSync ANSI/NISO Z39.99-2017

Martin Klein

Gretchen Gueguen

Mark Matienzo

Petr Knoth

ResourceSync was funded by the Sloan Foundation & JISC

Martin Klein Los Alamos National Laboratory

@mart1nkle1n

http://www.openarchives.org/rs #resourcesync

ResourceSync ANSI/NISO Z39.99-2017

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

Background - OAI-PMH

•  Recurrent metadata exchange from a Data Provider to Service Providers

•  XML metadata only

•  Repository centric

•  Devised 1999-2002, prior to REST, prior to dominance of web search engines

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

Revisit the Problem Domain - ResourceSync

•  Synchronization of resources from a Source to Destinations

•  Web resources, anything with an HTTP URI & representation

•  Resource centric

•  Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications

•  Updated in 2017 to v1.1

One to One Synchronization

One to Many – Master Copy

Many to One - Aggregator

Selective Synchronization

Metadata Harvesting

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

ResourceSync Capabilities

•  Resource List •  Inventory, baseline synchronization

•  Change List •  Resource change events that occurred in a temporal interval,

incremental synchronization

•  Resource Dump •  Change Dump •  Notifications (separate specification) •  Archives (beta draft)

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

Sitemap

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2017-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2017-01-02T14:00:00Z</lastmod> <changefreq>daily</changefreq> </url> … </urlset>

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

Resource List

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2017-01-03T09:00:00Z” /> <url> <loc>http://example.com/res1</loc> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" type="application/pdf" /> <rs:ln rel="describedby" href="http://example.com/res1_dublin_core_md.xml" type="application/xml" /> </url> <url> … </url> </urlset>

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

Change List

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="changelist" from="2017-01-02T09:00:00Z" until="2017-01-03T09:00:00Z" /> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change="created" datetime="2017-01-02T13:00Z" /> </url> <url> <loc>http://example.com/res3</loc> <rs:md change="updated" datetime="2017-01-02T15:00Z" /> </url> </urlset>

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

ResourceSync Change Notifications

•  Notifications about change events to resources •  Source notifies subscribed Destinations (cf. recurrent pull) •  Push-based approach via WebSub •  Similar, sitemap-based payload •  Decrease synchronization latency between Source and Destination •  Change Notification Specification v1.0

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

EHRI Use Case

•  Aggregation of information about Holocaust collections •  held by 1,800+ organizations worldwide •  into a central service •  EAD as exchange format

•  Diversity of data sources and locations

•  databases, spreadsheets (“home collections”)

https://ehri-project.eu/ http://portal.ehri-project.eu

https://twitter.com/EHRIproject

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

EHRI Use Case

•  Special ResourceSync implementation •  Bridges gap between local systems and ResourceSync

capability documents on a web server •  Filters local resources by subject, time period, etc •  Set up by EHRI technical staff, run by contributing party

•  Baseline synchronization: Resource Lists •  Incremental synchronization: Change Lists •  Together with EAD files moved from local system to web server

•  Dropbox, FTP, USB stick

•  Service: partners expose EADs, server collects and offers value-added services e.g., graph database

https://ehri-project.eu/ http://portal.ehri-project.eu

https://twitter.com/EHRIproject

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

CLARIAH Use Case

•  Various institutions host evolving collections •  Make collection items uniformly available via RDF graph •  Central registry holds description of all collections

•  Researchers use Virtual Research Environment to •  Discover collections (via registry) •  Collect graphs from respective institution •  Keep graphs up to date

https://www.clariah.nl/ https://twitter.com/CLARIAH_NL

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

CLARIAH Use Case

•  Baseline synchronization •  Download graph from DB •  Serialized as one or more files, one RDF triple per line

(+ s p o graph_name) •  + stands for “add” •  URIs of files listed in Resource List

•  Incremental synchronization •  Changes logged in one or more files, one change per line

(+/- s p o graph_name) •  + stands for “add”, “-” for delete •  URIs of files listed in Change List

https://www.clariah.nl/ https://twitter.com/CLARIAH_NL

ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017

ResourceSync Tools

•  Source implementation •  Python •  DANS & LANL & CORE •  Connectors to file system, Solr index •  OAI-PMH converter (planned) •  https://github.com/resourcesync/py-resourcesync

•  Client implementation •  Python •  https://github.com/resync/resync

•  Notification implementation

•  PubSubHubbub •  https://github.com/resync/resourcesync_push

Hyku & DPLAResourceSync Implementations

Gretchen Gueguen, Data Services CoordinatorDigital Public Library of America,[email protected]

Project Background● IMLS National Leadership Grant

(30 months)● Foster a national digital

platform through community-based repository infrastructure

● Leverage & contribute to Hydra, both in code and community

Primary Project Goals1. Develop turnkey (“easy to install, easy to maintain”)

Hydra-based application that leverages and improves on core code components

2. Develop metadata aggregation & enrichment tools

3. Work toward a hosted service in the cloud

Metadata Aggregation @DPLA

Metadata Aggregation @DPLAMethods for Data Aggregation:

● OAI PMH (21 providers)● Custom APIs/other (8 providers)● Direct file transfer (3 providers)

Biggest Drawbacks:

● Re-synchronizing entire data sets● Relying on http requests

ResourceSync and Hyku● ResourceSync publishing support built into MVP

● Test application with 50,000 records to start○ Limit for a single list. To add more, we would need to make a list of

lists.

● Resource lists and change lists are supported

● Resource or change dumps not currently supported

● Content negotiation for JSON-LD, N-Triples, and Turtle

ResourceSync and DPLAHarvester developed for Hyku endpoint

● Development for this specific endpoint means that it’s not a full test of all ResourceSync capabilities

● We suspect that we will prefer the Dump to the List○ Using the List means making HTTP calls for each item in order to do

the content negotiation○ Dump allows us to just download specifically what we need○ We will still be downloading records that weren’t updated but given

the typical size of the diff for each provider this single download may still be preferable to 100,000 HTTP requests

● Future implementations may require us to build on this initial harvester if the specifics are different

Next Steps

Hyku:

● Possibly support Dump● Increase test set over

50K

DPLA:

● Harvest from 3 DPLA providers implementing ResourceSync by end of year

IIIF & ResourceSync:Supporting discovery

Mark A. Matienzo, Stanford University Libraries@anarchivist / https://orcid.org/0000-0003-3270-1306DPLAFest — Chicago, Illinois — April 20, 2017

International Image Interoperability Framework

A communitythat develops Shared APIs

implements them in Softwareand exposes interoperable Content

http://iiif.io/

IIIF Communityhttp://iiif.io/community

● IIIF Consortium○ Currently 38 state/national

libraries, universities, museums, tech firms

○ Provides sustainability and steering for the initiative

● Wider community○ 80+ CH institutions, companies,

and projects using IIIF standards○ iiif-discuss list = 670+ members○ IIIF Slack = 300+ members

● Community & Technical Specification Groups

Shared APIshttp://iiif.io/api/

● Image API○ Transfer image pixels, regions, etc.○ Image manipulation

● Presentation API○ Presentation of an object (pixels +

navigation and metadata)○ Easily share and re-use, mix and

match content○ Annotate content

● Search API○ Search annotations

● Authentication API○ Provide interoperability for

access-restricted content

Software Implementations

https://github.com/IIIF/awesome-iiif

IIIF ContentAll kinds of image resources:

artworks, photographs,manuscripts, newspapers

Investigating AV and 3D

“Discovery”in IIIF

Finding interoperable resources

Two main concerns:

● How can users find IIIF resources?

● How can users then get those resources into an environment where they can use them?

Scoping the problemWhat resources

can be discovered?

Types of resources in IIIF:

● Content (Image API)● Description (Presentation API)

The Image API does not provide description of image content, just technical and rights metadata.

Discovery requires Description resources to provide information about Content resources.

Presentation APIA Manifest providesjust enough metadata (descriptive, structural, etc.) to drive a viewer.

A Collection groups Manifests or other Collections.

http://iiif.io/api/presentation/2.1/

Community work

IIIF Discovery Technical Specification Group

iiif.io/community/groups/discovery/

IIIF Discovery TSG scope:

● Crawling and harvesting● Content indexing● Change notification● Import to viewers

Presentation API constraints

Informing decisions

The Presentation API does not include semantic descriptions, but can reference them using seeAlso.

IIIF (including the Presentation API) has a resource-centric view of the web, not a service-centric view (cf Sitemaps/ResourceSync vs OAI-PMH).

Examples

Basic Sitemaps at NC State

● Example demonstrates use of Simple sitemaps without any extensions, including ResourceSync

● Intended to expand upon existing practice of publishing sitemaps from digital collections

Sitemap entry for manifests

<url> <loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004/manifest</loc> <lastmod>2016-12-13T15:38:19Z</lastmod></url>

Sitemap entry for landing page

<url> <loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004</loc> <lastmod>2017-03-27T19:33:52Z</lastmod></url>

Sample of NCSU Sitemaps

Courtesy Jason Ronallo, North Carolina State University

Prototyping at Europeana

Exploring Sitemaps and extensions for discovery of

IIIF resources for harvesting

● Partnership with University College Dublin and National Library of Wales

● ResourceSync satisfied key needs identified within requirements

● ResourceSync accommodated additional metadata prototyped in an IIIF Sitemap Extension

● Follows several synchronization paradigms

Uses Sitemaps and IIIF Extension

<url> <loc>http://newspapers.library.wales/view/3320640</loc> <iiif:Manifest xmlns:iiif="http://iiif.io/api/presentation/2/"> http://dams.llgc.org.uk/iiif/newspaper/issue/3320640/manifest.json </iiif:Manifest> <dct:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3320639.json</dct:isPartOf> <lastmod>2014-11-08</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority></url>

Example of NLW Sitemap Entry

Courtesy Nuno Freire, Europeana

Uses Sitemaps and ResourceSync and DCMES as Extensions

<url> <loc>https://digital.ucd.ie/view/ucdlib:38491</loc> <rs:ln rel="alternate" href="https://digital.ucd.ie/view/ucdlib:38491" type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <rs:ln rel="collection” href="https://digital.ucd.ie/view/ucdlib:38488” type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <lastmod>2014-08-24T04:09:09.716Z</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority></url>

Example of UCD Resource List Entry

Courtesy Nuno Freire, Europeana

Uses Sitemaps, ResourceSync, and Sitemap Image Extension

Sample of UCD Resource List

Courtesy John Howard, University College Dublin

ConclusionsStrengths

● ResourceSync addresses core requirements for exposing IIIF resources for harvesting

● Can build on publication of existing sitemaps easily

● Leverages Many-to-One, Selective Synchronization, and Metadata Harvesting paradigms

● Can adopt additional extensions to implement needed features

● Plenty of opportunity to contribute; need more prototypes

Challenges

● IIIF community’s needs for discovery are not necessarily what other sitemap consumers want (e.g. Google)

● Identifying the primary resource influences structure

● Unclear whether search engines support custom extensions, and what ranking impact would be

Thank You!Mark A. Matienzo, Stanford University Libraries@anarchivist / https://orcid.org/0000-0003-3270-1306DPLAFest — Chicago, Illinois — April 20, 2017

Seamlessaccesstotheworld’sopenaccessresearchpapersvia

ResourceSync

PetrKnoth

UseCase1:ResourceSyncasaseamlesslayeroverheterogenousAPIs

UseCase1:WhatisCORE?

OA Repositories OA Journals

Mostly OAI-PMH

COREaggregatesand

providesfreeaccessto

millionsofresearch

articlesaggregated

fromthousandsofOA

repositoriesand

journals.

UseCase1:WhatisCORE?

OA Repositories OA Journals

Mostly OAI-PMH

COREaggregatesand

providesfreeaccessto

millionsofresearch

articlesaggregated

fromthousandsofOA

repositoriesand

journals.

» Enrichmentand

harmonisationof

aggregateddata

» Products/services:› Portal› API› Datadumps

› Recommendation

systemforlibraries

› Repositorydashboard› B2Bandanalyticalservices

UseCase1:WhatisCORE?

OA Repositories OA Journals

Mostly OAI-PMH

COREaggregatesand

providesfreeaccessto

millionsofresearch

articlesaggregated

fromthousandsofOA

repositoriesand

journals.

» 70million+

metadatarecords

» Over6millionfull

textshostedon

CORE

» ~1.5million

monthlyactive

users

» Aggregatingfrom

2,500repositories

and10kOA

journals

UseCase1:Keyissue

Keyplayersdonotprovideinteroperabilityformachine

accesstometadataandcontentofresearchpapers.

35%

23% 18%

12%

12%

Accessingfull-textbyharves5ngthewebsite

Majorsearch

engines

Recongnised

servicesupon

approval

75%

12%

13%

Restric5ngaccesstofull-text

Don'trestrict

accessinanyway

Specifyacrawl

delay

Allowaccessto

specificrobots

39%

11% 39%

11%

Referenceofanar5cle’sfull-textonmetadata

Directlinktofull-

text

Interface

supporBngfull-text

transfer

50% 42%

8%

Accessingcontentstandards

OAI

OwnAPI

Z39.50

36%

24% 4%

32%

4%

Filesformat

PDF

HTML

Plaintext

HTML

JSON

54% 31%

15%

AutomateddownloadsofOAfull-text

Website

API

FTP

UseCase1:Approach

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector

Mostly OAI-PMH

A range of bespoke APIs

+ many others

Provideseamlessaccessovernon-standardisedAPIs.

What protocol?

UseCase1:Approach

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector

Mostly OAI-PMH

A range of bespoke APIs

+ many others

Provideseamlessaccessovernon-standardisedAPIs.

What protocol? » WhynotOAI-PMH?

› slowandveryinefficient

forbigrepositories.

› Standardisedformetadatatransferbut

notforcontenttransfer.

› Verydifficultto

representtherichnessof

metadatafromabroad

rangeofdataproviders.

UseCase1:ResourceSyncasaseamlessaccesslayer» Veryscalableimplementationon

boththeserverand

clientside

» Interpretationofmetadatahappens

usingexistingpipeline

attheaggregator.

» 1.5millionOA

publicationsfrom

Elsevier,Springerand

othersalready

exposed.» Availableat:https://publisher-connector.core.ac.uk/resourcesync

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector

Mostly OAI-PMH

A range of bespoke APIs

+ many others

ResourceSync

UseCase2:ExposingenricheddataforTextandDataMining(TDM)viaResourceSync

UseCase2:SubscribingtoResourceSync

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector

Mostly OAI-PMH

A range of bespoke APIs

ResourceSync

+ many others

» Otheraggregatorscan

subscribetothePublisher

connectortomakeuseoftheir

ingestionpipelinesand

enrichmenttechnologies

UseCase2:ContentingestioninOpenMinTeD

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector ResourceSync

Mostly OAI-PMH

OMTD-SHARE (over REST)

A range of bespoke APIs

+ many others

» COREandOpenAIREarecontentsourcesintheOpenMinTeD

TDMplatform(EUinfrastructureproject)beingdevelopedto

enabletheminingofscholarlyliterature.

UseCase2:ExposingenricheddataforTDM

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector ResourceSync

Mostly OAI-PMH

A range of bespoke APIs

+ many others

ResourceSync

» Butotherswantsimilarsolutions…typically,theywanttobe

abletosyncandhostthedata.

UseCase3:MakerepositoriesandjournalsadoptResourceSync

UseCase3:ReplaceOAI-PMHwithResourceSync

OA Repositories OA Journals

Key publishers (OA + hybrid OA)

Publisher connector ResourceSync

Mostly OAI-PMH

OMTD-SHARE (over REST)

A range of bespoke APIs

+ many others

ResourceSync

ResourceSync

» Willbeagamechanger…

» AdvocatedbyCOARNext

GenerationRepositoriesWG

Keycontributionsandconsiderations

What’snewaboutourimplementationofResourceSync?

» Scalestomanymillionsofresourcesasrequiredby

aggregators(asopposedtoexistingimplementationsfor

repositoriesthatarescalablefortensofthousandsof

resources)

» Real-timeupdatingofResourceListsandChangeLists

(avoidingunnecessarybatchprocesses).

» Combinationofreal-timeupdatesandscalability

Architecturalchoices

» Basedontheprincipleofchangesbeingcommunicated

toacontrollerastheyhappen(ratherthanhavingtobe

detectedpriortoResourceList/ChangeListupdates)

» UsesElasticsearchasadatabase» Hashingmechanismtodistributesizeofeach

ResourceListlinkandaclevermechanismforiterative

updatingofResourceLists

Conclusions» ResourceSync:› broadrangeofusesinscholarlycommunication.

› solvesproblemswithaggregatingcontentoverOAI-PMH,faster&

moreefficientaggregation=>fresherdatainaggregatorscompared

toOAI-PMH

» WeusedResourceSyncto”liberate”over1.5millionOApapers(and

growing)fromkeypublishers

» COREsoontoprovideaccesstoover8millionOAfulltextsvia

ResourceSync.

» COREactivelycontributestotheadoptionofResourceSyncinthe

repositoriescommunity(aspartofOpenMinTeDandCOARNGR)

An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web

http://www.openarchives.org/rs #resourcesync

ResourceSync ANSI/NISO Z39.99-2017

@mart1nkle1n @G_AmSpinnrade @anarchivist @petrknoth