linked data experiences at springer nature

51
1 Linked Data Experiences at Springer Nature Michele Pasin Lead Data Architect Knowledge Graph Team

Upload: michele-pasin

Post on 18-Jan-2017

1.375 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Linked Data Experiences at Springer Nature

�1

LinkedDataExperiencesatSpringerNature

MichelePasinLeadDataArchitectKnowledgeGraphTeam

Page 2: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNatureLeipzig,09/2016

�2

Outline

•Whoweare

•Whysemantictechnologies

•Ourworksofar

•TheScigraphproject

•Lookingahead

Page 3: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�3

WhoWeAre

Page 4: Linked Data Experiences at Springer Nature

�4

Formed in May 2015 through the merger of Nature Publishing Group, Palgrave Macmillan, Macmillan Education and Springer Science+Business Media

Page 5: Linked Data Experiences at Springer Nature

�5

4

5

1

14

2

13k employees in over 50 countries, EUR 1.5 billion turnover

Page 6: Linked Data Experiences at Springer Nature

�6

[Pre-Merger]SpringerScience+BusinessMediabrands

Page 7: Linked Data Experiences at Springer Nature

�7

[Pre-Merger]MacmillanScience&Educationbrands

Holtzbrinck Publishing Group

Page 8: Linked Data Experiences at Springer Nature

�8

Wepublishalotofscience!(since1815)

13M documents 7M articles, 4M chapters 4k journals, 700k books

Page 9: Linked Data Experiences at Springer Nature

�9

..andgeneratealotoftraffic

11.5M monthly visitors (nature.com)

260M visits per year 600M downloads per year (link.springer.com)

Page 10: Linked Data Experiences at Springer Nature
Page 11: Linked Data Experiences at Springer Nature

> Collaborative effort between Springer Nature and Digital Science

> Supporting internal use cases,but also contributing to an emerging web of linked science data

> Not just publications data but a wealth of other related information

Page 12: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�12

WhySemanticTechnologies

Page 13: Linked Data Experiences at Springer Nature

�13

WhyisSemanticsImportantToUs?

Challenges: Data Silos ● Data is fragmented

● Data gets duplicated

● Data is hardcoded into applications

Change Drivers ● Digital first workflow

● User-centric design

● Unified Springer Nature domain

Page 14: Linked Data Experiences at Springer Nature

Forexample:oursitesarecurrentlyorganisedaroundarTcles,journalsandissues…

Page 15: Linked Data Experiences at Springer Nature

However,scienTstsareinterestedinansweringquesTonsaboutrealworldthings…

Page 16: Linked Data Experiences at Springer Nature

Searchenginesdonotknowwehavecontentaboutthesethings…

1sthitfromnature.com…

Notlinkedto/from..

Page 17: Linked Data Experiences at Springer Nature

�17

PDF

XML

ePub

HTML

TIFF

Today: Content base Tomorrow: Knowledge Graph

We publish science We manage knowledge

Vision

Page 18: Linked Data Experiences at Springer Nature

The Knowledge Graph is about collecting information about objects in the real world

…so that we can do a better job of providing users with what they're looking for

Page 19: Linked Data Experiences at Springer Nature

reads / writes

is about

interested in

Three areas of knowledge we care about

Page 20: Linked Data Experiences at Springer Nature

Reads / Writes

Works forFunds

Lead researcher in

Produces

StudiesLocated at

In proceedings

Contains

Cites

Has learning resource

Attends

Has topicProduces

Page 21: Linked Data Experiences at Springer Nature

�21

Research/ Manuscript

Creation

Manuscript Submission

Peer Review/ Proposal Stage

Planning

Production

Publication

Distribution/ Sales

DiscoveryResearcher /

Author

Editorial / Publisher

Reviewer

Opportunities:Tools&ServicesAlongthePublishingLifeCycle

Page 22: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�22

OurWorkSoFar

Page 23: Linked Data Experiences at Springer Nature

OurWorkSoFar

2014

2013

2012

2015

2016

NPG Linked Data Platform

Nature Ontologies Portal

Springer Materials

Springer ConferencesScigraph

Content Hub

Scigraph prototype

Nero Project

Linnaeus Project

Springer Protocols

CURI Semantic Annotation Project

Page 24: Linked Data Experiences at Springer Nature

Deliverables (2012–2014) ● Prototype for external use

● SPARQL query service

● Two RDF dataset releases in 2012

– April 2012 (22m triples)

– July 2012 (270m triples)

● Live updates to query endpoint

Led to (2014–) ● Focus on internal use-cases

● Publish ontology pages

● Periodic data snapshots

NPGLinkedDataPlatform(2012)

Page 25: Linked Data Experiences at Springer Nature

Features ● Hybrid RDF + XML architecture

– MarkLogic for XML, RDF/XML

– Triplestore (TDB) for RDF validation

● Repo’s for binary assets

Layout ! Semantic RDF/XML includes in XML

● RDF objects serialized in list order

● Application XML for subject hierarchy

Indexes ● Indexes over all elements

● Range indexes for datatypes (e.g. dates)

NPGContentHub(2014):HybridArchitecture

Page 26: Linked Data Experiences at Springer Nature

SubjectPages(2014)

Page 27: Linked Data Experiences at Springer Nature

�27

NPGOntologiesPortal(2015):DataPublishing

Page 28: Linked Data Experiences at Springer Nature

�28

SpringerMaterials(2014)

Page 29: Linked Data Experiences at Springer Nature

�29

SpringerConferencesPortal(2015)

Page 30: Linked Data Experiences at Springer Nature

�30

ScigraphProject(2016):mainobjectives

Data Integration

> Consolidation of existing LD efforts via a single domain mode

> Ingestion and normalisation of third party datasets

Discoverability

> Better end user applications [B2C]

> Metadata delivery & validation [B2B]

> Data publishing [B2developers]

Page 31: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�31

Scigraph

what’sinit>dataarchitecture,taxonomies,ontologies

howitworks>ETL,naming,validation,identity

Page 32: Linked Data Experiences at Springer Nature

�32

DataLandscape

Citations / References160M

Articles7M

Chapters3.6M

Journals4K

Books700k

Subjects4K

ArticleTypes

Grants2M

Organizations60K

Conferences10K

Funders

Publishers

Universities

ScigraphCore

Persons1M

Relations

Publish states

Vocabularies

Page 33: Linked Data Experiences at Springer Nature

a DB/OO scheme

Arbitrary relations plus axioms, constraints and rules expressed in a logical languagea glossary

an axiomatized theory

a thesaurusa taxonomy

Taxonomy plus related terms;

captures synonymy, homonymy etc.

Complexity (ontological depth)

A controlled vocabulary with NL

definitions (e.g. lexicon)

- Publishers - Relations - Publish-states

A c.v. that captures broaderThan / narrowerThan relationships

- Subjects, - Article Types

Relational model: unconstrained use of arbitrary relations

Scigraph Core ontology

OntologiesandTaxonomies:overview

Page 34: Linked Data Experiences at Springer Nature

�34

TheCoreOntology

- Language: OWL 2, Profile: ALCHI(D) - Entities: ~73 classes, ~250 properties - Principles: Incremental Formalization/ Enterprise Integration / Model Coherence

http://www.nature.com/ontologies/core/

Page 35: Linked Data Experiences at Springer Nature

�35

TheCoreOntology:mappings

:Asset

:Thing

:Publication

:Concept

:Event

:Subject

:Type

:Agent

:ArticleType

:PublishingEvent

:AggregationEvent

:Component

:Document

:Serial

cidoc-crm:Information_Carrier

cidoc-crm:Conceptual_Object

dbpedia:Agentdc:Agentdcterms:Agentcidoc-crm:Agentvcard:Agentfoaf:Agent

event:Eventbibo:Eventschema:Eventcidoc-crm:TemporalEntity

cidoc-crm:Typevcard:Type

fabio:SubjectTerm

bibo:Documentcidoc-crm:Documentfoaf:Document

bibo:Periodicalfabio:Periodicalschema:Periodical

bibo:DocumentPart

fabio:Expressioncidoc-crm:InformationObject

= owl:equivalentClass

Page 36: Linked Data Experiences at Springer Nature

�36

SKOStaxonomies:Poolpartyintegration

Page 37: Linked Data Experiences at Springer Nature

�37

SKOStaxonomies:Subjects

- Structure: SKOS, ~2500 concepts, multi hierarchical tree, 6 branches, 7 levels of depth - Mappings: 100% of terms, using skos:broadMatch or skos:closeMatch, (Dbpedia and

MESH) - Document tagging: mostly manual, different workflows, often costly and inconsistent

Page 38: Linked Data Experiences at Springer Nature

�38

Semi-AutomatictaggingwithDimensions(fromUberResearch)

Page 39: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�39

Scigraph

what’sinit>dataarchitecture,taxonomies,ontologies

howitworks>ETL,naming,validation,identity

Page 40: Linked Data Experiences at Springer Nature

�40

NamingArchitecture:federatedmodel

> Dereference and 303 redirects: - http://name.scigraph.com/{things}/ - http://data.scigraph.com/{things}/

> Two patterns: schemas and instances - http://name.scigraph.com/ontologies/{domain}/ - http://name.scigraph.com/{domain}/{things}/

> Prefixes for schemas and instances - @prefix sg: <http://name.scigraph.com/ontologies/core/> .

> Entity names follow a robust convention - camel-case for naming terms, with an initial uppercase for classes and an initial lowercase for properties.

> Named graphs used to track provenance

Page 41: Linked Data Experiences at Springer Nature

�41

Scigraph-DataFlow

Peer Review DDS Core

Media UNSILO TARGET Uber Research DBPedia etc..

KNOWLEDGE GRAPH

JSON-LD API DDS Adapter TTL Loader RDF Loader ..

datasources

integrationlayer

real time services

Peer Review Service

Search Service(Content Hub)

applications Peer Review Oscar Search

data is delivered to applications via fast APIs

data is extracted and denormalised so to support

applications

data is normalised and mapped to SN ontologies

Page 42: Linked Data Experiences at Springer Nature

�42

ETLArchitecture:mainfeatures[inevolution]

Tech stack > Airflow framework (Airbnb) > Amazon S3 to make backups > GraphDB triplestore (staging and presentation) > Elastic search and APIs

Components & Principles > Graph must be ‘ephemeral’ > Data sources versioning algorithm > Identity Persistence service > Validation via SHACL (TopBraid API)

Page 43: Linked Data Experiences at Springer Nature

�43

ETLArchitecture

Personszip

XML

RDF

JSON

CSV

ArticlesDB

PublishersDataset

BooksAPI

Sources Data StoreAmazon S3

Data StagingTriplestore

Data PresentationTriplestore

LinkedData

Browser

Analytics

Reporting

APIs

✴ Extraction ✴ Validation✴ Identity Persistence✴ Updating / Replacing

named graphs

✴ Versioning service✴ (md5 checksum,

timestamps, origin version, etc...)

✴ Integration (union graph)

✴ Inference

Named Graphs

Page 44: Linked Data Experiences at Springer Nature

IdentityPersistence

Identity PersistenceModule

J1(xml)

J2(xml)

RDFExtractor

journals:76as67fda76sd67a

id: 1DOI: 123issn: ABC

id: 2 issn: ABC

J1(xml)

id: 1DOI: 123issn: ABC

ingest #1

ingest #2

ingest #3

Identity Registry

sgo:core Ontologysg:Journal a owl:Class ; sg:hasKeyProperty sg:doi . sg:hasKeyProperty sg:issn sg:hasKeyProperty sg:eissn ....

Page 45: Linked Data Experiences at Springer Nature

�45

DataValidation:fromSPINtoSHACL

> SPIN SPARQL syntax (2011, TopQuadrant)

> Example: “if a Journal instance has no short title, raise an Exception”

> Main drawback: hard to maintain and to read by non specialists

Page 46: Linked Data Experiences at Springer Nature

�46

DataValidation:fromSPINtoSHACL

> SHACL - Shapes Constraint Language (2016, TopQuadrant)

> Example: “all article instances should have a valid DOI”

> Example: “all grants instances should have max 1 start year and end year”

> Approach: polish data before entering the triplestore, use triplestore inference primarily for integration

Page 47: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�47

NextSteps

Page 48: Linked Data Experiences at Springer Nature

�48

LookingAhead

Summary ● Scigraph is our latest LD platform - public version live in late 2016

● SW tech allows for scalable enterprise-level metadata management

● It is crucial to distinguish between data Integration VS (real time) data delivery

● Still a work in progress… suggestions or feedback very welcome!

Ongoing Work ● Ontology: federated model, more advanced inferencing capabilities

● Build internal/external APIs (JSON-LD) by integrating also NoSQL

● Tools for analytics, reporting, visualisation, interactive exploration of the graph

● Entities extraction: scientific entities, places, people, events etc..

● We’re looking to collaborate… Crossref, W3C, building a Linked Science Web

Page 49: Linked Data Experiences at Springer Nature

Future:ascientificarticleX-ray?

Page 50: Linked Data Experiences at Springer Nature

�50

TheKnowledgeGraphteam

CORE TEAM

* Markus Kaindl: Product Owner * Ben Kirkley: Project Manager

* Michele Pasin: Lead Data Architect * Tony Hammond: Data Architect * Matias Piipari: Lead Engineer * Hilverd Reker: Software Engineer *Artur Konczak: Software Engineer

*<blankNode>: Data Scientist *<blankNode>: Data Engineer

DIGITAL SCIENCE

* Martin Szomszor: Data Scientist *Richard Koks: Data Scientist * Mario Diwersy: CTO, Uber Research

PROGRAM SPONSOR

* Henning Schoenenberger: Director Data & Metadata

Page 51: Linked Data Experiences at Springer Nature

LinkedDataExperiencesatSpringerNature-Leipzig,09/2016

�51

[email protected]