subject repositories european collaboration in the international context 28-29 january 2010 workshop...

28
Subject Repositories European collaboration in the international context 28-29 January 2010 Workshop Technical infrastructure & interoperability Benoit Pauwels Université Libre de Bruxelles, Belgium 1

Post on 19-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Subject RepositoriesEuropean collaboration in the international context

28-29 January 2010

WorkshopTechnical infrastructure & interoperability

Benoit PauwelsUniversité Libre de Bruxelles, Belgium

1

2

• Theme 1: The Economists Online network of data providers• General infrastructure of the EO solution• DIDL/MODS: the EO metadata exchange format• RDF/XML Admin file: decentralized administration• Enrichment of metadata

• Theme 2: Economists Online and RePEc• Pulling metadata from RePEc• Pushing metadata to RePEc• Contribute to LogEC• Use CitEC

Workshop plan

3

• Theme (45’)

• Introduction (BP, 20’)• 3 topics for brainstorming (breakout groups,10’)• Breakout groups reporting back (all, 15’)

Workshop plan

4

• Theme 1: The Economists Online network of data providers

• General infrastructure of the EO solution• DIDL/MODS: the EO metadata exchange format• RDF/XML Admin file: decentralized administration• Enrichment of metadata

The Economists Online network of data providers

Meresco

Metadata

Harvester

Objects

HTTP

Crawler

Metadata

Lucene

EO portal Homemade - FOSS

Exporter engineHomemade - FOSS

Logs

OAI-PMH

OAI-PMH RSS

Other portals

SRU

RePEc

Meresco

Metadata

Harvester

Objects

HTTP

Crawler

Metadata

Lucene

EO portal Homemade - FOSS

Exporter engineHomemade - FOSS

Logs

OAI-PMH

OAI-PMH RSS

Other portals

SRU

RePEc

Metadata exchange format

DIDL / MODSNEEO specs

Usage metadata exchange format

SWUPOFI Comm Profile

7

Technical decisions

Desired EO functionality Technical decision

Facetted search&find experience Normalized/normalizable metadata

APA formatted citations Granular metadata

Publication list per author Unambiguous identification of authors

Full text indexing/searching Unambiguous links to full texts

Enrichment of metadata (JEL, datasets, citations, ReDIF)

Extensible metadata format

8

• XML container structure that can hold semantically distinct metadata• descriptive metadata• object files (by-ref)• splash page• enriched metadata

• JEL• full text (by-ref)• datasets (by-ref)• [ references ]• RePEc handle and metadata (by-ref)

DIDL• Based on existing container structure defined by SurfShare• “info:eu-repo” vocabularies (objectfile accessRights, version, ...)

Metadata exchange format

9

• Granular descriptive metadata MODS (3.2)

• Based on existing metadata structure defined by SurfShare• “info:eu-repo” vocabularies (publication type,

• Unambiguous identification of authors DAI – Digital Author Identifier

• National or institution-unique persistent identifier

• Solutions not specific to the NEEO project; continuous aim of standardization at a level that surpasses the project

Metadata exchange format

DIDL[1]

Item[1]

Descriptor/Identifier (persistent identifier)

Item[1..∞] (of type descriptiveMetadata)

Descriptor/type (« descriptiveMetadata »)

Component/Resource -- representation by value (XML)

Item[0..∞] (of type objectFile)

Component/Resource -- representation by ref. (URL)

Descriptor/modified

Descriptor/Identifier (persistent identifier)

Descriptor/modified

Descriptor/type (« objectFile »)

Descriptor/Identifier (persistent identifier)

Descriptor/modified

Item[0..1] (of type humanStartPage)

Component/Resource -- representation by ref. (URL)

Descriptor/type (« humanStartPage »)

EO Data model

• Publication is described as a complex (compound) object

– persistent identifier

• Aggregation of 3 types of components

– descriptiveMetadata (MODS)– objectFiles– humanStartPage

• Extensible– additional items can be stored within the

complex object

• MODS– contains Digital Author Identifier (DAI) of

EO author

11

• Implementations in NEEO

• DIDL application profile• MODS application profile• Vocabularies in DIDL and MODS• Technical guidelines for project partners

• Solutions: home-made or with external support

• ARNO: home-made• Dspace: home-made, AtMire• Eprints: home-made, ECS-University Of Southampton• Fedora: METS/MODS -> DIDL/MODS• DigiTool: METS/MARC -> DIDL/MODS

Metadata exchange format

12

• XML-RDF file

• FOAF + NEEO-specific vocabulary• maintained by each data provider on a local web server• information of institution : name, description, ...• OAI baseURL + OAI sets to harvest• EO authors: photograph, full name, affiliation, DAI

• HTTP get and validated by EO Gateway at regular intervals• Automated harvesting process• Made visible through portal

• New partner

• Create admin file• Ask for registration at [email protected] , declaring location and validating

admin file• If valid, you’re in

Decentralized registry service

Meresco

Metadata

Harvester

Objects

HTTP

Crawler

Metadata

Lucene

EO portal Homemade - FOSS

Exporter engineHomemade - FOSS

Logs

OAI-PMH

OAI-PMH RSS

Other portals

SRU

RePEc

Meresco

Metadata

Harvester

Objects

HTTP

Crawler

Metadata

Lucene

EO portal Homemade - FOSS

Exporter engineHomemade - FOSS

Logs

OAI-PMH

OAI-PMH RSS/Atom

Other portals

SRU

RePEc

SRU

Enrichment service

OA

I-PM

H

15

• “Automated” enrichment – JEL, full-text1. ES gets records to be enriched from EO, over SRU

1. Based on date of request for enrichment of certain type and version

2. Based on flag set in EO record

2. ES creates enrichment record(s)

3. ES makes enrichment records available to EO, over OAI-PMH

4. EO harvests enrichment records from ES and integrates into original record

5. EO reuses enrichment information in its services: index & present

• “Manual” enrichment – datasets 1) Partner enters permalink of publication on DVN platform

2) EO PMH-harvests DDI from DVN, and stores by-ref information

Metadata enrichment

DIDL[1]

Item[1]

Descriptor/Identifier (persistent identifier)

Item[1..∞] (of type descriptiveMetadata)

Item[0..∞] (of type objectFile)

Descriptor/modified

Item[0..1] (of type humanStartPage)

Item[0..∞] (of type text)

Item[0..∞] (of type enrichedMetadata)

Item[0..∞] (of type dataset)

EOIR / ES

PDF

HTML

TXT

Item[0..∞] (of type review)

Dataset DDI

Review

Descriptor/Identifier (persistent identifier)

Item[1..∞] (of type descriptiveMetadata)

Item[0..∞] (of type objectFile)

Descriptor/modified

Enriched publication

LinkedData / S

emanticWeb / ORE re

ady

17

» BO Group 1: DIDL/MODS» Scalable? Implementation by 100s of partners» Local experiences from existing partners: implementation issues you want to share? » Can this become a standard for exchange of metadata of IR contained publications?

Where does this stand next to (flavours of) DC, SWAP,...?

» BO Group 2: XML Admin file» Scalable? Implementation by 100s of partners» Local experiences from existing partners: implementation issues you want to share? » DAI?

» BO Group 3: Enrichment model» Extensibility: vocabulary for semantics of components» Manual enrichment: need for enriched submission form, making it easy for people

to make enriched publications» Automated (JEL, full text): sustainable?

Theme 1: The Economists Online network of data providers

18

• Theme 2: Economists Online and RePEc

• Pulling metadata from RePEc• Pushing metadata to RePEc• Contribute to LogEc• Use CitEc

Workshop plan

19

• RePEc archives contain RePEc series contain Working papers, Articles, Books, Book chapters, Software

• Manually maintained by research centres, journal publishers, university departments all over the world• +/- 900 archives, more than 4000 series• ReDIF metadata format

• Network accessible over FTP or HTTP

• Aggregation by RePEc services:• EconPapers• IDEAS• Central PMH-accessible aggregated archive of AMF formatted

metadata

RePEc model

20

Template-type: ReDIF-Paper 1.0Author-Name: Capron, HenriAuthor-Email: [email protected]: Meeusen, WimAuthor-Email: [email protected] Author-Name: Dumont, MichelAuthor-Person: pdu51Author-Name: Cincera, MicheleAuthor-Person: pci5Title: National innovation systems: pilot study of the Belgian innovation systemCreation-Date: 1998Publication-Status: Published as a report for the Belgian Federal Office for Scientific, Technical and Cultural Affairs (OSTC)File-URL: http://bib17.ulb.ac.be:8080/dspace/bitstream/2013/941/1/mc-0048.pdfFile-Format: application/pdfHandle: RePEc:dul:ecoulb:2013-941

RePEc model

21

• Very similar

BUT

• RePEc model: • Harvests only from “official” publisher repositories• Therefore: 1 work exists once in RePEc and it is guaranteed the one and only “official”

manifestation of the work

• IR model: • holds publications for which institution is typically not the publisher• 1 work 1 official manifestation + multiple author manifestations• one work can exist in:

o one or more repositorieso as different publication typeso with different descriptive metadatao with different object files attachedo with different object file metadata

Pushing and pulling metadata records from RePEc and IR into one system is bound to raise problems

RePEc model compared to IR model

22

• EO harvests AMF formatted metadata records from http://oai.repec.openlib.org/

• Overlap !!• Same records are harvested from IR and RePEc• Solution:

• XML Admin file contains directive <not-from-repec-series>• Permits to specify which RePEc series do not need to be

harvested from RePEc, since already delivered through IR• BUT:

• IR contains articles produced by its authors• These articles are contained in a journal RePEc series• Overlap in EO cannot be avoided

Pull metadata from RePEc

23

• EO sets up “RePEc:ner” archive, containing ReDIF-X formatted records• ReDIF-X

• All records are delivered as “ReDIF-Paper”, but with extra fields denoting the “real” publication status and version of text

• Overlap !!• Most institutions already maintain RePEc series: these records must not

be pushed by EO• XML Admin file controls which series to feed in this “ner” archive

• <feed-repec>• boolean: to feed or not to feed

• <feed-repec-series>• If not given: all records with fulltext that are not working

papers are mapped to one series for that institution• RePEc series OAI setspec of DIDL/MODS record

BUT• IR inherent problem of multiple copies/versions is pushed to RePEc

Push metadata to RePEc

24

Template-type: ReDIF-Paper 1.0Title: Block investments and the race for corporate control in BelgiumAuthor-Name: Chapelle, ArianeLanguage: enNote: info:eu-repo/semantics/publishedX-PublishedAs-Type: articleX-PublishedAs-Article-Year: 2004X-PublishedAs-Article-Journal: Corporate Ownership & ControlX-PublishedAs-Article-Volume: 2X-PublishedAs-Article-Issue: 1Order-URL: http://dipot.ulb.ac.be:8080/dspace/handle/2013/9943File-URL: http://dipot.ulb.ac.be:8080/dspace/bitstream/2013/9943/1/ac-0007.pdfFile-Format: application/pdfFile-Version: authorVersionHandle: RePEc:ulb:ecoulb:2013/9943

Push metadata to RePEc: ReDIF-X

25

LogEc

• Aim: track abstract views and download clicks of publications presented through RePEc services (EconPapers, IDEAS, ... Economists Online)

• NOT: tracking of usage at the level of the archives• Downloads of publications contained in RePEc archives, initiated

through a Google user do not show up in LogEc• How:

• EO logs clicks abstract views and download clicks of object files• On a monthly basis, EO transforms these log entries into

requested LogEc format, using “rstat.pl”

2009-10 EconomistsOnline RePEc:aah:aarhec:1987-21 a: 65.55.207.69 66.235.124.10 d: 66.235.124.10

• RePEc handle of publication is necessary

EO partners delivering content to RePEc directly (and that EO therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record

26

LogEc

DIDL[1]

Item[1]

Descriptor/Identifier (persistent identifier)

Item[1..∞] (of type descriptiveMetadata)

Item[0..∞] (of type objectFile)

Descriptor/modified

Item[0..1] (of type humanStartPage)

Item[0..∞] (of type descriptiveMetadata)

EORePEc

RePEc handle

Descriptor/modified

byRef

RePEc (AMF metadata)

27

CitEc

• Aim: citation analysis for RePEc publications• How:

• Analyze text: extract and parse list of references from publications• References are checked whether available in RePEc• Cites:

• references to other RePEc publications• Textual references

• CitedBy• Co-citations

• EO publications (from our IRs) are pushed to RePEc and are therefore pulled through the CitEc processing

• EO has access to the resulting CitEc data, and presents this through the EO portal (not yet, will be in Feb 2010)

• RePEc handle of publication is necessary

EO partners delivering content to RePEc directly (and that EO therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record

28

» BO Group 1 : Push/pull to/from RePEc» ReDIF-X data structure» Duplicates; different versions of identical publication

» BO Group 2: Publishing models» Advantages/disadvantages of RePEc publishing model as opposed to IR

publishing model» Push the two models together? Do we need to foresee specific services in the

gateway or portal to make these two live together in peace?

» BO Group 3: Future RePEc/EO services» What services should EO and RePEc jointly be looking at in the future in the

interest of the economics researcher ?

Theme 2: Economists Online and RePEc