comsode tools - pushing data to open ecosystem

19
COMSODE tools Pushing data to the open ecosystem Jindřich Mynarz EEA.sk ELAG 2015 Stockholm June 9, 2015

Upload: comsode-fp7-project

Post on 28-Jul-2015

123 views

Category:

Software


0 download

TRANSCRIPT

COMSODE toolsPushing data to the open ecosystem

Jindřich MynarzEEA.sk

ELAG 2015 StockholmJune 9, 2015

The gist of the talk

To save legacy library data and satisfy internal and external requirements on your data you need ETL.

“Libraries have to focus on making their data infrastructure more efficient if they want to keep up with the ever changing needs of their audience and invest in sustainable service development.” — Lukas Koster (source)

Building tools to publish & reuse open dataEU FP7 project (2013➝2015)

Project partners:● University of Milano-Bicocca,

Italy● Charles University in Prague,

Czech Republic● EEA, Czech Republic and

Slovakia

● ADDSEN, Slovakia● Spinque, the Netherlands● Ministry of Interior of the

Slovak Republic

Legacy library data

Save the data?● …or let it go?● What’s the cost of recovering the legacy?● To save legacy data you need automation

⇒ ETL● Unfortunately, paraphrasing Tolstoy, “tidy

datasets are all alike but every messy dataset is messy in its own way.” (source)

Confusion of tongues

● MARC used to be (or still is?) the lingua franca. What's next?

● Many data formats required to be supported

● MARC→Web impedance mismatch

● Export & import in systems integration

Open Data Node

“(Linked) open data plumbing”● Open Data Node (ODN) is a platform for

publishing (open) data & automating internal data flows that enables progressive enhancement of data.

● Main product of the COMSODE project● Free, open source, modular, integrated (e.

g., single sign-on)

Open Data Node networks

● Data replication (e.g., local copy of name authority file)

● Data synchronization (e.g., periodical harvesting of incremental updates via OAI-PMH)

● Data distribution (e.g., shared cataloguing)

Open Data Node workflow

1. Catalogue your internal data2. Create a data processing pipeline for the

datasets to be published3. Schedule the pipeline to be run to publish

the data

Internal catalogue

● Map out the data you have or external data you use; both open and closed.

● If data cannot be found, it is as if it did not exist, so make data discoverable and provide it with descriptive metadata (DCAT-AP).

● Based on CKAN.

● An extensible ETL tool with native RDF support for automating repetitive data exchange and transformation tasks.

● Allows you to define, execute, monitor, debug (examine intermediate data), schedule, and share (import/export) data transformations.

● Open source, dual-licensed to enable commercial extensions

Extract-Transform-Load pipeline

Data flow of an ETL process in UnifiedViews is defined as a pipeline composed of data processing units.

Data processing units

Extractors● Download

file● Load from

SQL database

● SPARQL endpoint extractor

Transformers● Zip/unzip● Find/replace● Parse and

serialize RDF● SPARQL

Update● XSLT● ISO 2709 to

MARCXML● SPARQL

SELECT to CSV

Loaders● Files upload● Load to

Virtuoso● Load to SQL

database

+ Quality Assessment

Public catalogue

● Public interface that enables users to discover & access your data.

● Links to data dumps, APIs (REST API, SPARQL endpoint), and applications based on the data.

● Provides metadata, such as licence, dataset maintainer’s contact, or last update date.

● Based on CKAN.

COMSODE methodology

● Guidelines on how to use ODN for those with little open data experience

● Defines phases, practices, roles, and artifacts.

● Phases:a. Development of open data publication planb. Preparation of publicationc. Realization of publicationd. Archiving

http://opendatanode.org/product/methodology-for-od-publishing

Open Data Node in use

● Reality check○ Eating our own dog food○ Testing the ODN’s versatility

● 150 datasets transformedby COMSODE partners

● Supporting 10 pilot projects, including:○ eDemokracia: Slovak nation-wide e-government

project○ Czech Trade Inspection Authority○ Slovak Environment Agency○ Slovak National Library

Slovak National LibraryCOMSODE pilot

Impact

● Improve your internal & external data flows.

● Libraries are required to publish data by the EU directive on the re-use of public sector information.○ If you release MARC, is the cost of access to the

data marginal?● Insiders have access, yet outsiders often

have more experience to build value upon the data.

In conclusion

♫ The pipelines, the pipelines are calling... ♫To save legacy library data and satisfy internal and external requirements on your data you need ETL.

http://opendatanode.org

Image credits from the Noun Project:Database by Dmitry Baranovskiy, Counter by Sergey Demushkin, Ventil by Sergey Demushkin, Spider Web by Denis, Scroll by EliRatus, Chest by Victor Escorsin, Pipes by Christopher T. Howlett, Adoption by Luis Prado, Plumber by Luis Prado, Filter by Muneer A.Safiah, Lock by Alex Auda Samora, Lego by Jon Trillana, Atom by Mister Pixel