resourcesync - overview and real-world use cases for discovery, harvesting, and synchronization of...
Post on 21-Apr-2017
128 Views
Preview:
TRANSCRIPT
An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web
http://www.openarchives.org/rs #resourcesync
ResourceSync ANSI/NISO Z39.99-2017
Martin Klein
Gretchen Gueguen
Mark Matienzo
Petr Knoth
ResourceSync was funded by the Sloan Foundation & JISC
Martin Klein Los Alamos National Laboratory
@mart1nkle1n
http://www.openarchives.org/rs #resourcesync
ResourceSync ANSI/NISO Z39.99-2017
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
Background - OAI-PMH
• Recurrent metadata exchange from a Data Provider to Service Providers
• XML metadata only
• Repository centric
• Devised 1999-2002, prior to REST, prior to dominance of web search engines
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
Revisit the Problem Domain - ResourceSync
• Synchronization of resources from a Source to Destinations
• Web resources, anything with an HTTP URI & representation
• Resource centric
• Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications
• Updated in 2017 to v1.1
One to One Synchronization
One to Many – Master Copy
Many to One - Aggregator
Selective Synchronization
Metadata Harvesting
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
ResourceSync Capabilities
• Resource List • Inventory, baseline synchronization
• Change List • Resource change events that occurred in a temporal interval,
incremental synchronization
• Resource Dump • Change Dump • Notifications (separate specification) • Archives (beta draft)
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
Sitemap
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”> <url> <loc>http://example.com/res1</loc> <lastmod>2017-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2017-01-02T14:00:00Z</lastmod> <changefreq>daily</changefreq> </url> … </urlset>
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
Resource List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2017-01-03T09:00:00Z” /> <url> <loc>http://example.com/res1</loc> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" type="application/pdf" /> <rs:ln rel="describedby" href="http://example.com/res1_dublin_core_md.xml" type="application/xml" /> </url> <url> … </url> </urlset>
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="changelist" from="2017-01-02T09:00:00Z" until="2017-01-03T09:00:00Z" /> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change="created" datetime="2017-01-02T13:00Z" /> </url> <url> <loc>http://example.com/res3</loc> <rs:md change="updated" datetime="2017-01-02T15:00Z" /> </url> </urlset>
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
ResourceSync Change Notifications
• Notifications about change events to resources • Source notifies subscribed Destinations (cf. recurrent pull) • Push-based approach via WebSub • Similar, sitemap-based payload • Decrease synchronization latency between Source and Destination • Change Notification Specification v1.0
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
EHRI Use Case
• Aggregation of information about Holocaust collections • held by 1,800+ organizations worldwide • into a central service • EAD as exchange format
• Diversity of data sources and locations
• databases, spreadsheets (“home collections”)
https://ehri-project.eu/ http://portal.ehri-project.eu
https://twitter.com/EHRIproject
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
EHRI Use Case
• Special ResourceSync implementation • Bridges gap between local systems and ResourceSync
capability documents on a web server • Filters local resources by subject, time period, etc • Set up by EHRI technical staff, run by contributing party
• Baseline synchronization: Resource Lists • Incremental synchronization: Change Lists • Together with EAD files moved from local system to web server
• Dropbox, FTP, USB stick
• Service: partners expose EADs, server collects and offers value-added services e.g., graph database
https://ehri-project.eu/ http://portal.ehri-project.eu
https://twitter.com/EHRIproject
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
CLARIAH Use Case
• Various institutions host evolving collections • Make collection items uniformly available via RDF graph • Central registry holds description of all collections
• Researchers use Virtual Research Environment to • Discover collections (via registry) • Collect graphs from respective institution • Keep graphs up to date
https://www.clariah.nl/ https://twitter.com/CLARIAH_NL
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
CLARIAH Use Case
• Baseline synchronization • Download graph from DB • Serialized as one or more files, one RDF triple per line
(+ s p o graph_name) • + stands for “add” • URIs of files listed in Resource List
• Incremental synchronization • Changes logged in one or more files, one change per line
(+/- s p o graph_name) • + stands for “add”, “-” for delete • URIs of files listed in Change List
https://www.clariah.nl/ https://twitter.com/CLARIAH_NL
ResourceSync - @mart1nkle1n DPLAfest, Chicago, April 20 2017
ResourceSync Tools
• Source implementation • Python • DANS & LANL & CORE • Connectors to file system, Solr index • OAI-PMH converter (planned) • https://github.com/resourcesync/py-resourcesync
• Client implementation • Python • https://github.com/resync/resync
• Notification implementation
• PubSubHubbub • https://github.com/resync/resourcesync_push
Hyku & DPLAResourceSync Implementations
Gretchen Gueguen, Data Services CoordinatorDigital Public Library of America,gretchen@dp.la
Project Background● IMLS National Leadership Grant
(30 months)● Foster a national digital
platform through community-based repository infrastructure
● Leverage & contribute to Hydra, both in code and community
Primary Project Goals1. Develop turnkey (“easy to install, easy to maintain”)
Hydra-based application that leverages and improves on core code components
2. Develop metadata aggregation & enrichment tools
3. Work toward a hosted service in the cloud
Metadata Aggregation @DPLA
Metadata Aggregation @DPLAMethods for Data Aggregation:
● OAI PMH (21 providers)● Custom APIs/other (8 providers)● Direct file transfer (3 providers)
Biggest Drawbacks:
● Re-synchronizing entire data sets● Relying on http requests
ResourceSync and Hyku● ResourceSync publishing support built into MVP
● Test application with 50,000 records to start○ Limit for a single list. To add more, we would need to make a list of
lists.
● Resource lists and change lists are supported
● Resource or change dumps not currently supported
● Content negotiation for JSON-LD, N-Triples, and Turtle
ResourceSync and DPLAHarvester developed for Hyku endpoint
● Development for this specific endpoint means that it’s not a full test of all ResourceSync capabilities
● We suspect that we will prefer the Dump to the List○ Using the List means making HTTP calls for each item in order to do
the content negotiation○ Dump allows us to just download specifically what we need○ We will still be downloading records that weren’t updated but given
the typical size of the diff for each provider this single download may still be preferable to 100,000 HTTP requests
● Future implementations may require us to build on this initial harvester if the specifics are different
Next Steps
Hyku:
● Possibly support Dump● Increase test set over
50K
DPLA:
● Harvest from 3 DPLA providers implementing ResourceSync by end of year
IIIF & ResourceSync:Supporting discovery
Mark A. Matienzo, Stanford University Libraries@anarchivist / https://orcid.org/0000-0003-3270-1306DPLAFest — Chicago, Illinois — April 20, 2017
International Image Interoperability Framework
A communitythat develops Shared APIs
implements them in Softwareand exposes interoperable Content
http://iiif.io/
IIIF Communityhttp://iiif.io/community
● IIIF Consortium○ Currently 38 state/national
libraries, universities, museums, tech firms
○ Provides sustainability and steering for the initiative
● Wider community○ 80+ CH institutions, companies,
and projects using IIIF standards○ iiif-discuss list = 670+ members○ IIIF Slack = 300+ members
● Community & Technical Specification Groups
Shared APIshttp://iiif.io/api/
● Image API○ Transfer image pixels, regions, etc.○ Image manipulation
● Presentation API○ Presentation of an object (pixels +
navigation and metadata)○ Easily share and re-use, mix and
match content○ Annotate content
● Search API○ Search annotations
● Authentication API○ Provide interoperability for
access-restricted content
Software Implementations
https://github.com/IIIF/awesome-iiif
IIIF ContentAll kinds of image resources:
artworks, photographs,manuscripts, newspapers
Investigating AV and 3D
“Discovery”in IIIF
Finding interoperable resources
Two main concerns:
● How can users find IIIF resources?
● How can users then get those resources into an environment where they can use them?
Scoping the problemWhat resources
can be discovered?
Types of resources in IIIF:
● Content (Image API)● Description (Presentation API)
The Image API does not provide description of image content, just technical and rights metadata.
Discovery requires Description resources to provide information about Content resources.
Presentation APIA Manifest providesjust enough metadata (descriptive, structural, etc.) to drive a viewer.
A Collection groups Manifests or other Collections.
http://iiif.io/api/presentation/2.1/
Community work
IIIF Discovery Technical Specification Group
iiif.io/community/groups/discovery/
IIIF Discovery TSG scope:
● Crawling and harvesting● Content indexing● Change notification● Import to viewers
Presentation API constraints
Informing decisions
The Presentation API does not include semantic descriptions, but can reference them using seeAlso.
IIIF (including the Presentation API) has a resource-centric view of the web, not a service-centric view (cf Sitemaps/ResourceSync vs OAI-PMH).
Examples
Basic Sitemaps at NC State
● Example demonstrates use of Simple sitemaps without any extensions, including ResourceSync
● Intended to expand upon existing practice of publishing sitemaps from digital collections
Sitemap entry for manifests
<url> <loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004/manifest</loc> <lastmod>2016-12-13T15:38:19Z</lastmod></url>
Sitemap entry for landing page
<url> <loc>https://d.lib.ncsu.edu/collections/catalog/bh1141pnc004</loc> <lastmod>2017-03-27T19:33:52Z</lastmod></url>
Sample of NCSU Sitemaps
Courtesy Jason Ronallo, North Carolina State University
Prototyping at Europeana
Exploring Sitemaps and extensions for discovery of
IIIF resources for harvesting
● Partnership with University College Dublin and National Library of Wales
● ResourceSync satisfied key needs identified within requirements
● ResourceSync accommodated additional metadata prototyped in an IIIF Sitemap Extension
● Follows several synchronization paradigms
Uses Sitemaps and IIIF Extension
<url> <loc>http://newspapers.library.wales/view/3320640</loc> <iiif:Manifest xmlns:iiif="http://iiif.io/api/presentation/2/"> http://dams.llgc.org.uk/iiif/newspaper/issue/3320640/manifest.json </iiif:Manifest> <dct:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3320639.json</dct:isPartOf> <lastmod>2014-11-08</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority></url>
Example of NLW Sitemap Entry
Courtesy Nuno Freire, Europeana
Uses Sitemaps and ResourceSync and DCMES as Extensions
<url> <loc>https://digital.ucd.ie/view/ucdlib:38491</loc> <rs:ln rel="alternate" href="https://digital.ucd.ie/view/ucdlib:38491" type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <rs:ln rel="collection” href="https://digital.ucd.ie/view/ucdlib:38488” type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/> <lastmod>2014-08-24T04:09:09.716Z</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority></url>
Example of UCD Resource List Entry
Courtesy Nuno Freire, Europeana
Uses Sitemaps, ResourceSync, and Sitemap Image Extension
Sample of UCD Resource List
Courtesy John Howard, University College Dublin
ConclusionsStrengths
● ResourceSync addresses core requirements for exposing IIIF resources for harvesting
● Can build on publication of existing sitemaps easily
● Leverages Many-to-One, Selective Synchronization, and Metadata Harvesting paradigms
● Can adopt additional extensions to implement needed features
● Plenty of opportunity to contribute; need more prototypes
Challenges
● IIIF community’s needs for discovery are not necessarily what other sitemap consumers want (e.g. Google)
● Identifying the primary resource influences structure
● Unclear whether search engines support custom extensions, and what ranking impact would be
Thank You!Mark A. Matienzo, Stanford University Libraries@anarchivist / https://orcid.org/0000-0003-3270-1306DPLAFest — Chicago, Illinois — April 20, 2017
Seamlessaccesstotheworld’sopenaccessresearchpapersvia
ResourceSync
PetrKnoth
UseCase1:ResourceSyncasaseamlesslayeroverheterogenousAPIs
UseCase1:WhatisCORE?
OA Repositories OA Journals
Mostly OAI-PMH
COREaggregatesand
providesfreeaccessto
millionsofresearch
articlesaggregated
fromthousandsofOA
repositoriesand
journals.
UseCase1:WhatisCORE?
OA Repositories OA Journals
Mostly OAI-PMH
COREaggregatesand
providesfreeaccessto
millionsofresearch
articlesaggregated
fromthousandsofOA
repositoriesand
journals.
» Enrichmentand
harmonisationof
aggregateddata
» Products/services:› Portal› API› Datadumps
› Recommendation
systemforlibraries
› Repositorydashboard› B2Bandanalyticalservices
UseCase1:WhatisCORE?
OA Repositories OA Journals
Mostly OAI-PMH
COREaggregatesand
providesfreeaccessto
millionsofresearch
articlesaggregated
fromthousandsofOA
repositoriesand
journals.
» 70million+
metadatarecords
» Over6millionfull
textshostedon
CORE
» ~1.5million
monthlyactive
users
» Aggregatingfrom
2,500repositories
and10kOA
journals
UseCase1:Keyissue
Keyplayersdonotprovideinteroperabilityformachine
accesstometadataandcontentofresearchpapers.
35%
23% 18%
12%
12%
Accessingfull-textbyharves5ngthewebsite
Majorsearch
engines
Recongnised
servicesupon
approval
75%
12%
13%
Restric5ngaccesstofull-text
Don'trestrict
accessinanyway
Specifyacrawl
delay
Allowaccessto
specificrobots
39%
11% 39%
11%
Referenceofanar5cle’sfull-textonmetadata
Directlinktofull-
text
Interface
supporBngfull-text
transfer
50% 42%
8%
Accessingcontentstandards
OAI
OwnAPI
Z39.50
36%
24% 4%
32%
4%
Filesformat
HTML
Plaintext
HTML
JSON
54% 31%
15%
AutomateddownloadsofOAfull-text
Website
API
FTP
UseCase1:Approach
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provideseamlessaccessovernon-standardisedAPIs.
What protocol?
UseCase1:Approach
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provideseamlessaccessovernon-standardisedAPIs.
What protocol? » WhynotOAI-PMH?
› slowandveryinefficient
forbigrepositories.
› Standardisedformetadatatransferbut
notforcontenttransfer.
› Verydifficultto
representtherichnessof
metadatafromabroad
rangeofdataproviders.
UseCase1:ResourceSyncasaseamlessaccesslayer» Veryscalableimplementationon
boththeserverand
clientside
» Interpretationofmetadatahappens
usingexistingpipeline
attheaggregator.
» 1.5millionOA
publicationsfrom
Elsevier,Springerand
othersalready
exposed.» Availableat:https://publisher-connector.core.ac.uk/resourcesync
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
ResourceSync
UseCase2:ExposingenricheddataforTextandDataMining(TDM)viaResourceSync
UseCase2:SubscribingtoResourceSync
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
ResourceSync
+ many others
» Otheraggregatorscan
subscribetothePublisher
connectortomakeuseoftheir
ingestionpipelinesand
enrichmenttechnologies
UseCase2:ContentingestioninOpenMinTeD
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector ResourceSync
Mostly OAI-PMH
OMTD-SHARE (over REST)
A range of bespoke APIs
+ many others
» COREandOpenAIREarecontentsourcesintheOpenMinTeD
TDMplatform(EUinfrastructureproject)beingdevelopedto
enabletheminingofscholarlyliterature.
UseCase2:ExposingenricheddataforTDM
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector ResourceSync
Mostly OAI-PMH
A range of bespoke APIs
+ many others
ResourceSync
» Butotherswantsimilarsolutions…typically,theywanttobe
abletosyncandhostthedata.
UseCase3:MakerepositoriesandjournalsadoptResourceSync
UseCase3:ReplaceOAI-PMHwithResourceSync
OA Repositories OA Journals
Key publishers (OA + hybrid OA)
Publisher connector ResourceSync
Mostly OAI-PMH
OMTD-SHARE (over REST)
A range of bespoke APIs
+ many others
ResourceSync
ResourceSync
» Willbeagamechanger…
» AdvocatedbyCOARNext
GenerationRepositoriesWG
Keycontributionsandconsiderations
What’snewaboutourimplementationofResourceSync?
» Scalestomanymillionsofresourcesasrequiredby
aggregators(asopposedtoexistingimplementationsfor
repositoriesthatarescalablefortensofthousandsof
resources)
» Real-timeupdatingofResourceListsandChangeLists
(avoidingunnecessarybatchprocesses).
» Combinationofreal-timeupdatesandscalability
Architecturalchoices
» Basedontheprincipleofchangesbeingcommunicated
toacontrollerastheyhappen(ratherthanhavingtobe
detectedpriortoResourceList/ChangeListupdates)
» UsesElasticsearchasadatabase» Hashingmechanismtodistributesizeofeach
ResourceListlinkandaclevermechanismforiterative
updatingofResourceLists
Conclusions» ResourceSync:› broadrangeofusesinscholarlycommunication.
› solvesproblemswithaggregatingcontentoverOAI-PMH,faster&
moreefficientaggregation=>fresherdatainaggregatorscompared
toOAI-PMH
» WeusedResourceSyncto”liberate”over1.5millionOApapers(and
growing)fromkeypublishers
» COREsoontoprovideaccesstoover8millionOAfulltextsvia
ResourceSync.
» COREactivelycontributestotheadoptionofResourceSyncinthe
repositoriescommunity(aspartofOpenMinTeDandCOARNGR)
An overview of capabilities and real-world use cases for discovery, harvesting, and synchronization of resources on the web
http://www.openarchives.org/rs #resourcesync
ResourceSync ANSI/NISO Z39.99-2017
@mart1nkle1n @G_AmSpinnrade @anarchivist @petrknoth
top related