enterprise knowledge graphs
TRANSCRIPT
Enterprise Knowledge GraphsSören Auer
https://www.eccenca.com
Sören Auer 2
The three Big Data „V“ – Variety is often neglected
Quelle: Gesellschaft für Informatik
Linked Data Principles
Addressing the neglected third V (Variety)
1. Use URIs to identify the “things” in your data
2. Use http:// URIs so people (and machines) can look them up on the web
3. When a URI is looked up, return a description ofthe thing (in RDF format)
4. Include links to related things
http://www.w3.org/DesignIssues/LinkedData.html
3
[1] Auer, Lehmann, Ngomo, Zaveri: Introduction to Linked Data and Its Lifecycle on the Web. Reasoning Web 2013
Sören Auer
Linked (Open) Data: The RDF Data Model
4
RDF = Resource Description Framework
located in
label
industryheadquarters
full nameDHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
height 物流
label
RDF Data Model (a bit more technical)
– Graph consists of:• Resources (identified via URIs)• Literals: data values with data type (URI) or language (multilinguality integrated)• Attributes of resources are also URI-identified (from vocabularies)
– Various data sources and vocabularies can be arbitrarily mixed and meshed– URIs can be shortened with namespace prefixes; e.g. dbp: → http://dbpedia.org/resource/
gn:locatedIn
rdfs:label
dbo:industryex:headquarters
foaf:namedbp:DHL_International_GmbH
dbp:Post_Tower
"162.5"^^xsd:decimal
dbp:Bonn
dbp:Logistics
"Logistik"@de
"DHL International GmbH"^^xsd:string
ex:height"物流"@zh
rdfs:label
rdf:value
unit:Meter
ex:unit
Sören Auer 6
RDF mediates between different Data Models & bridges between Conceptual and Operational Layers
Id Title Screen
5624 SmartTV 104cm
5627 Tablet 21cm
Prod:5624 rdf:type ElectronicsProd:5624 rdfs:label “SmartTV”Prod:5624 hasScreenSize “104”^^unit:cm...
Electronics
Vehicle
Car Bus Truck
Vehicle rdf:type owl:ThingCar rdfs:subClassOf VehicleBus rdfs:subClassOf Vehicle...
Tabular/Relational Data
Taxonomic/Tree Data
Logical Axioms / Schema
Male rdfs:subClassOf HumanFemale rdfs:subClassOf HumanMale owl:disjointWith Female...
© Fraunhofer · Seite 7
Vocabulary ExampleVocabulary Schema Instantiation
PostTower rdf:type BuildingPostTower locatedIn dbpedia:BonnPostTower height "162.5"^^meter
located in
label
industryheadquarters
full nameDHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
height 物流
label
Class: CompanyProperty Expected typeinIndustry IndustryfullName Stringheadquarter Building
Class: BuildingProperty Expected typelocatedIn Industryheight unit:meter
RDF
Repr
esen
tati
onVi
sual
Rep
rese
ntat
ion
Company rdf:type rdfs:ClassBuilding rdf:type rdfs:Class
inIndustry rdf:type rdfs:PropertyinIndustry rdfs:domain CompanyinIndustry rdfs:range Industry
headquarter rdf:type rdfs:Propertyheadquarter rdfs:domain Companyheadquarter rdfs:range Building
DHL rdf:type CompanyDHL fullName "DHL Int. GmbH"DHL inIndustry LogisticsDHL headquarter PostTower
© Fraunhofer · Seite 8
Semantic Web Layer Cake 2001
http://www.w3.org/2001/10/03-sww-1/slide7-0.html
• Monolithic based on XML• Focus on heavyweight
Semantic (Ontologies, Logic, Reasoning)
© Fraunhofer
The Semantic Web Layer Cake 2015 – Bridging between Big & Smart Data
Unicode URIs
XML JSON CSV RDB HTML
RDF
RDF/XML JSON-LD CSV2RDF R2RML RDFa
RDF Data Shapes
RDF-Schema
Vocabularies
OntologienSKOS Thesauri
LogikSWRL Regeln
SPARQL
(Acc
ess c
ontro
l), S
igna
tur,
Encr
yptio
n (H
TTPS
/CER
T/DA
NE),
• Lingua Franca of Data integration with many technology interfaces (XML, HTML, JSON, CSV, RDB,…)
• Focus on lightweight vocabularies, rules,thesauri etc.
• Less “invasive”
© Fraunhofer
RDF - the Lingua Franca of Data Integration
• RDF is simple• We can easily encode and combine all kinds of data models (relational,
taxonomic, graphs, object-oriented, …)• RDF supports distributed data and schema• We can seamlessly evolve simple semantic representations (vocabularies)
to more complex ones (e.g. ontologies)• Small representational units (URI/IRIs, triples) facilitate mixing and
mashing• RDF can be viewed from many perspectives: facts, graphs, ER, logical
axioms, graphs, objects• RDF integrates well with other formalisms - HTML (RDFa), XML
(RDF/XML), JSON (JSON-LD), CSV, …• Linking and referencing between different knowledge bases, systems and
platforms facilitates the creation of sustainable data ecosystems
10
© Fraunhofer
Successful application domainsLinked Data & Semantic Integration
Search Engine Optimization & Web-Commerce Schema.org used by >20% of Web sites Major search engines exploit semantic desciptions
Pharma, Lifesciences Mature, comprehensive vocabularies and ontologies Billions of disease, drug, clinical trial descriptions
Digital Libraries Many established vocabularies (DublinCore, FRBR, EDM) Millions of aggregated from thousends of memory
institutions in Europeana, German Digital Library
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
The Web evolves into a Web of Data
Sören Auer 12
Linked Open Data
FacebookOpen Graph
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Knowledge Graphs – A definition
• Fabric of concept, class, property, relationships, entity descriptions
• Uses a knowledge representation formalism (typically RDF, RDF-Schema, OWL)
• Holistic knowledge (multi-domain, source, granularity):• instance data (ground truth),
• open (e.g. DBpedia, WikiData), private (e.g. supply chain data), closed data (product models),
• derived, aggregated data,• schema data (vocabularies, ontologies) • meta-data (e.g. provenance, versioning, documentation
licensing)• comprehensive taxonomies to categorize entities• links between internal and external data• mappings to data stored in other systems and databases
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Knowledge Graph Challenges & OpportunitiesKnowledge graphs typically cover• Multiple domains• Various levels of granularity• Data from multiple sources• Various degrees of structure
Challenges• Quality• Coherence• Co-evolution• Update propagation• Curation & interaction
Opportunities• Background knowledge for various applications (e.g. question answering, data
integration, machine learning)• Facilitate intra-organizational data sharing and exchange (data value chains)
14
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Comparison of various enterprise data integration paradigmsParadigm Data
ModelIntegr. Strateg
y
Conceptual/
operational
Hetero-geneous data
Intern./ extern.
data
No. of source
s
Type of integr.
Domain coverage
Se-mantic repres.
XML Schema
DOM trees
LaV operational medium
both medium high
Data Warehouse
relational GaV operational - partially medium
physical small medium
Data Lake various LaV operational large physical high medium
MDM UML GaV conceptual - - small physical small medium
PIM / PCS trees GaV operational partially partially - physical medium medium
Enterprise search
document - operational partially large virtual high low
EKG RDF LaV both medium
both high very high
[1] Michael Galkin, Sören Auer, Simon Screrri: Enterprise Knowledge Graphs: A Survey. Submitted to 37th International Conference on Information Systems. 2016.
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Knowledge Graph Technology
16
17
Adding a Semantic Layer to Data Lakes
ManagementAccounting
Marketing Sales SupportR&D
Semantic Data Lake• central place for
model, schema and data historization
• Combination of Scale Out (cost reduction) and semantics (increased control & flexibility)
• grows incrementally (pay-as-you-go)
Inbound
Data Sources
Outbound and Consumption
Inbound Raw Data Store
Data Lake (order of magnitude cheaper scalable data store)
Knowledge Graph for Relationship Definition and Meta Data
Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to
Target Systems
JSON-LD CSVW R2RMLXML2RDF
© eccenca.com See also https://www.eccenca.com/en/products-corporate-memory.html
Sören Auer 18
W3C R2RML – Relational to RDF Mapping
R2RML: RDB to RDF Mapping Language, W3C Recommendation 27 September 2012Editors: Souripriya Das, Seema Sundara, Richard Cyganiakhttp://www.w3.org/TR/r2rml/
Sören Auer 19
Example R2RML Mapping
1. Either resulting RDF knowledge base is materialized in a triple store &2. subsequently queried using SPARQL3. or the materialization step is avoided by dynamically mapping an input
SPAQRL query into a corresponding SQL query, which renders exactly the same results as the SPARQL query being executed against the materialized RDF dump
SPARQLMap – Mapping RDB 2 RDF
Example: Sparqlify
• Rationale: Exploit existing formalisms (SQL, SPARQL Construct) as much as possible
• flexible & versatile mapping language• translating one SPARQL query into
exactly one efficiently executable SQL query
• Solid theoretical formalization based on SPARQL-relational algebra transformations
• Extremely scalable through elaborated view candidate selection mechanism
• Used to publish 20B triples for LinkedGeoData
[1] Stadler, Unbehauen, Auer, Lehmann: Sparqlify – Very Large Scale Linked Data Publication from Relational Databases.[2] Unbehauen, Stadler, Auer: Optimizing SPARQL-to-SQL Rewriting. iiWAS 2013[3] Auer, et al.: Triplify: light-weight linked data publication from relational databases. WWW 2009
SPARQLConstruct
SQLView
Bridge
Sören Auer 22
Semantified Big Data Architecture Blueprint
[1] Mami, Scerri, Auer, Vidal: Towards the Semantification of Big Data Technology. DEXA 2016
Datasources Ingestion Storage
Semantic Lifting with Mappings
QuerysStoring of semantic and semantified data in Apache Parquet files on HDFS
Sören Auer 23
SEBIDA Implementation Architecture
Sören Auer 24
SEBIDA Evaluation Results
• Loads data faster• Has quite different query
performance characteristics – faster in 5 out of 12 queries, similar performance in 2, slower in 5
© Fraunhofer · Seite 25
VOCOL: COLLABORATIVE VOCABULARY CURATION ENVIRONMENT
Comprehensive Support for Evolving Vocabularies
© Fraunhofer · Seite 26
Industry 4.0Semantic Models as Bridge between Shop & Office Floor
© Fraunhofer · Seite 27
Semantic Administrative Shell & Reference Architecture for Industry 4.0 (RAMI4.0)Administrative Shell (Verwaltungsschale)
provides a digital identity for arbitrary Industry 4.0 components (e.g. sensors, actors/robots) exposing data covering the whole life-cycle
Reference Architecture for Industry 4.0 (RAMI4.0) provides a conceptual framework for implementing comprehensive Industry 4.0 scenarios
We have implemented both concepts along with a number of IEC and ISO standards in a comprehensive information model ready to be implemented in productive environments
© Fraunhofer · Seite 28
VoCol collaborative Development Environment for Vocabularies
VersioningGit/
Bitbucket
Integrates a number of tools & services for different aspects of vocabulary developmentIs centered around Git version control (or Bitbucket), thus supporting the branching and merging of vocabulariesSupports the roundtrip between• Schema/vocabulary
development• Competency questions
(expressed in SPARQL)• Example data Bridges between conceptual
models and executable codehttp://eis.iai.uni-bonn.de/Projects/VoCol.html
© Fraunhofer · Seite 29
Development based on Git – Version Control
Git is meanwhile the most widely used version control system. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.Git was initially designed and developed in 2005 by Linux kernel developers for Linux kernel developmentGit is the basis for a variety of open-source or commercial services and products such as:GitHub/Bitbucket - Web-based Git repository hosting service with
millions of usersGitLab/Gitolite - open-source Web-based Git repository management
platformsSince TeamFoundationServer release 2013, Microsoft added native
support for GitGit is easily extensible and integratable into arbitrary workflows via GitHooks
VoCol Collaborative Vocabulary Development Environment Entry Page
VoCol:Dynamic Documentation
© Fraunhofer · Seite 32
Environment: Dynamic Documentation
© Fraunhofer · Seite 33
VoCol Environment: Dynamic Visualization
© Fraunhofer · Seite 34
VoCol Environment: Analytics
VoCol Environment: Version Control with Git/GitHub/GitLab/Bitbucket
© Fraunhofer · Seite 36
VoCol Environment:Integrated SPARQL Querying, e.g. for checking competency questions
VoColMap Visualization
VoCol Environment: Direct Turtle Editing
VoCol Environment: Vocabulary Evolution Report
© Fraunhofer · Seite 40
INDUSTRIAL DATA SPACE
© Fraunhofer · Seite 41
Vocabulary-based Integration facilitates Data-driven Businesses
Vocablary
© Fraunhofer ·· Seite 42
Die Arbeiten zum Industrial Data Space sind komplementär verzahnt mit der Plattform Industrie 4.0
Handel 4.0 Bank 4.0Versicherung4.0
…Industrie 4.0
Fokus auf die produzierende
IndustrieSmart Services
Übertragung,Netzwerke
Echtzeitsysteme
Industrial Data SpaceFokus auf Daten
Daten
…
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
The Industrial Data Space InitiativeCommunity of >30 large German and European CompaniesPre-competitive, publicly funded innovation project involving 11
Fraunhofer institutes for developing IDS reference architectureCurrent members of the
Industrial Data Space Association
© Fraunhofer · Seite 44
Bilder: ©FotoliaFrancesco De Paoli, Nmedia, hakandogu
Semantic Data Linking for Enterprise Data Value Chains
Data Lake Pure Internet
centralized, monopolistic federated, secure, „trusted“, standard-based
completely dezentral, open, unsecure
Data management Central Repository Decentral Decentral
Data Ownership Central Decentral Decentral
Data Linking Single provider Federated, on demand Missing
Data Security Bilateral Certified system Bilateral
Market structure Central Provider Role system Unstructured
Transport infrastructure Internet Internet Internet
Industrial Data Space
© Fraunhofer · Seite 45
Bilder: © Fotolia 77260795 ∙ 73040142 58947296 ∙ 68898041
Basic principles of the Industrial Data Space
On DemandVernetzung
Linked Light Semantics
Securitywith
Industrial Data
Container
Certified Roles
On DemandInterlinking
© Fraunhofer · Seite 46
Bildquellen: Istockphoto
Industrial Data Space: On Demand Interlinking
Service A
Service C
Service EService B
Service D
Service GService F
Enterprise 4
Enterprise 1
Enterprise 6
Enterprise 2 Enterprise 3
Enterprise 5
All Data stays with its Ownern and are controlled and secured. Only on request for a service data will be shared. No central platform.
© Fraunhofer · Seite 47 --- VERTRAULICH ---
Industrial Data Space
Upload / Download / SearchInternet
AppsVocabulary
Industrial Data SpaceBroker
Clearing
RegistryIndex
Industrial Data SpaceApp Store
Internal IDS
Connector
Company A Internal IDS
Connector
Company B
External IDS
Connector
External IDS
Connector
Upload
Third PartyCloud Provider
Download
Upload / Download
© Fraunhofer
IDS Architecture Overview
Sören Auer 48
Big Data is not Just Volume and VelocityVariety (& Varacity) are key challengesLinked Data helps dealing with both• Linked Data life-cycle requires to integrate and
adapt results from a number of disciplines– NLP, – Machine Learning, – Knowledge Representation, – Data Management, – User Interaction– …
• Applications in a number of domains – cultural heritage, – life sciences, – industry 4.0 / cyber-physical systems, – smart cities, – mobility,– …
Linked Data links not only data but also:• Various disciplines• Applications and Use cases
Creating Knowledge out of Interlinked Data
Thanks for your attention!
Sören Auerhttp://www.iai.uni-bonn.de/~auer | http://[email protected]
https://www.eccenca.com