![Page 1: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/1.jpg)
This project has received funding from the European Union's Horizon 2020 Research and Innovation
programme under grant agreement No 809965.
SANSA: Distributed Semantic Analytics
![Page 2: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/2.jpg)
Killing two birds with one stone… Why not installing SANSA during the presentation? 😉
❖ From a the SANSA-Notebooks github repository
<https://github.com/sansa-stack/sansa-notebooks>
❖ Requirements: ➢ docker ➢ docker-compose
❖ Only 5 commands: ➢ git clone https://github.com/sansa-stack/sansa-notebooks
➢ cd sansa-notebooks/
➢ make
➢ make up
➢ make load-data
➢ Goto: http://localhost/
2
![Page 3: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/3.jpg)
SANSA: Motivation ❖ Abundant machine readable structured information is
available (e.g. in RDF) ➢ Across SCs, e.g. Life Science Data ➢ General: DBpedia, Google knowledge graph ➢ Social graphs: Facebook, Twitter
❖ Need for scalable querying, inference and machine learning ➢ Link prediction ➢ Knowledge base completion ➢ Predictive analytics
3
![Page 4: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/4.jpg)
Social Networks
![Page 5: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/5.jpg)
Social Networks as Graphs
http://www.touchgraph.com/assets/images/TouchGraph%20Hashable%20SXSW.png
![Page 6: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/6.jpg)
Knowledge Graphs
❖Modelling entities and their relationships
❖ Analysis: finding underlying structure of graph e.g. to predict unknown relationships
❖ Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO, Twitter, LinkedIn, MS Academic Graph, WikiData
![Page 7: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/7.jpg)
Knowledge Graph
![Page 8: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/8.jpg)
Motivation ❖ Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were published ➢ As of August 2018
Source: LOD-Cloud (http://lod-cloud.net/ )
8
![Page 9: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/9.jpg)
Motivation ❖ Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were published ➢ Based on LOD Stats (http://lodstats.aksw.org/ )
Source: LOD-Cloud (http://lod-cloud.net/ )
~10, 000 datasets Openly available
online using Semantic Web standards
9
![Page 10: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/10.jpg)
Motivation ❖ Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were published ➢ Based on LOD Stats (http://lodstats.aksw.org/ )
Source: LOD-Cloud (http://lod-cloud.net/ )
many datasets RDFized and kept private
(e.g. Supply chain, manufacture, ethereum
dataset, etc.)
~10, 000 datasets Openly available
online using Semantic Web standards
10
![Page 11: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/11.jpg)
Motivation ❖ Dealing with such amount of data makes many tasks hard to
be solved on single machines
Source: LOD-Cloud (http://lod-cloud.net/ )
11
![Page 12: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/12.jpg)
Motivation ❖ Dealing with such amount of data makes many tasks hard to
be solved on single machines
Source: LOD-Cloud (http://lod-cloud.net/ )
Vocabulary Reuse
Find a suitable vocabulary for your dataset
12
![Page 13: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/13.jpg)
Motivation ❖ Dealing with such amount of data makes many tasks hard to
be solved on single machines
Source: LOD-Cloud (http://lod-cloud.net/ )
Vocabulary Reuse
Find a suitable vocabulary for your dataset
Coverage Analysis
Does dataset contain
necessary information?
13
![Page 14: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/14.jpg)
Motivation ❖ Dealing with such amount of data makes many tasks hard to
be solved on single machines
Source: LOD-Cloud (http://lod-cloud.net/ )
Vocabulary Reuse
Find a suitable vocabulary for your dataset
Coverage Analysis
Does dataset contain
necessary information?
Privacy Analysis
Does dataset contain
sensitive information?
14
![Page 15: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/15.jpg)
Motivation ❖ Dealing with such amount of data makes many tasks hard to
be solved on single machines
Source: LOD-Cloud (http://lod-cloud.net/ )
Vocabulary Reuse
Find a suitable vocabulary for your dataset
Coverage Analysis
Does dataset contain
necessary information?
Privacy Analysis
Does dataset contain
sensitive information?
Entity Linking
Which datasets are good
candidates for interlinking?
15
![Page 16: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/16.jpg)
Tasks hard to solve on single machines (>1TB memory consumption):
• Querying and processing LinkedGeoData • Dataset statistics and quality assessment of the LOD Cloud • Vandalism and outlier detection in Wikidata • Inference on life science data (e.g. UniProt, EggNOG, StringDB) • Clustering of DBpedia data • Clustering of user-logs of the Big Data Europe integrator platform for the
creation of user profiles • Large-scale enrichment and link prediction for e.g. DBpedia →
LinkedGeoData
Why Distributed RDF Data Processing?
16
![Page 17: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/17.jpg)
SANSA Stack Vision
17
![Page 18: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/18.jpg)
Why combining Big Data and SW?
“Big Data” Processing (Spark/Flink) Semantic Technology Stack
Data Integration Manual pre-processing Partially automated,
standardised
Modelling Simple (often flat feature vectors) Expressive
Support for data
exchange
Limited (heterogeneous formats
with limited schema information)
Yes (RDF & OWL W3C
Recommendations)
Business value Direct Indirect
Horizontally
scalable
Yes No
Idea: combine advantages of both worlds
18
![Page 19: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/19.jpg)
SANSA Stack ❖ It’s core is a processing data flow engine that provides data
distribution, and fault tolerance for distributed computations over RDF large-scale datasets
❖ SANSA includes several libraries for creating applications: ➢ Read / Write RDF / OWL library ➢ Querying library ➢ Inference library ➢ ML- Machine Learning core library
http://sansa-stack.net/
19
![Page 20: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/20.jpg)
SANSA Layers
20 20
![Page 21: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/21.jpg)
SANSA: Read Write Layer
21
![Page 22: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/22.jpg)
SANSA: Read Write Layer ❖ Ingest RDF and OWL data in different formats using Jena /
OWL API style interfaces ❖ Represent data in multiple formats ➢ (e.g. RDD, Data Frames, GraphX, Tensors)
❖ Allow transformation among these formats ❖ Compute dataset statistics and apply functions to URIs,
literals, subjects, objects → Distributed LODStats
val triples = spark.rdf(Lang.NTRIPLES)(input)
triples.find(None, Some(NodeFactory.createURI("http://dbpedia.org/ontology/influenced")), None)
val rdf_stats_prop_dist = triples.statsPropertyUsage()
22
![Page 23: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/23.jpg)
SANSA: Read Write Layer features
RDF Data
Distributed
data
structures
Da
ta r
ep
res
en
tati
on
Gra
ph
pro
ce
ss
ing
OW
L
Pa
rtit
ion
ing
str
ate
gie
s
RD
F s
tats
Qu
ality
as
se
ss
me
nt
Te
ns
ors
/KG
E
R2
RM
L M
ap
pin
gs
SANSA Engine
23
![Page 24: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/24.jpg)
SANSA: OWL Support ❖ Distributed processing of OWL axioms ❖ Support for Manchester OWL & functional syntax ❖ Derived distributed data structures: ➢ E.g. matrix representation of subclass-of axioms
to compute its closure via matrix operations
val rdd =spark.owl(Syntax.MANCHESTER)("file.owl") // get all subclass-of axioms val sco = rdd.filter(_.isInstanceOf[OWLSubClassOfAxiom])
24
![Page 25: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/25.jpg)
RDF to Tensors (experimental) • Tijk is 1 if triple (i-th entity, k-th predicate, j-th entity) exists
and 0 otherwise
25
![Page 26: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/26.jpg)
SANSA: Query Layer
26
![Page 27: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/27.jpg)
SANSA: Query Layer ❖ To make generic queries efficient and fast using:
➢ Intelligent indexing ➢ Splitting strategies ➢ Distributed Storage
❖ SPARQL query engine evaluation ➢ (SPARQL-to-SQL approaches, Virtual Views)
❖ Provision of W3C SPARQL compliant endpoint
val triples = spark.rdf(Lang.NTRIPLES)(input)
val sparqlQuery = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val result = triples.sparql(sparqlQuery) 27
![Page 28: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/28.jpg)
Querying via SPARQL & Partitioning
SANSA Engine
RDF/OWL Layer
Data Ingestion
Partitioning
Query Layer
Sparqlifying Distributed Data
Structures
Results Views Views
28
![Page 29: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/29.jpg)
Querying via SPARQL & Partitioning
29
![Page 30: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/30.jpg)
SANSA-DataLake
❖ A solution for the virtual Big Data integration: Semantic Data Lake
➢ Directly query original data without prior
transformation/loading
❖ Scalable cross-source query execution (join)
❖ Extensible (programmatically)
➢ Do not reinvent the wheel: use existing engine connectors
(wrappers)
30
![Page 31: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/31.jpg)
Querying via Semantic Data Lake C
SV
SANSA Engine
Da
ta L
ake
La
ye
r
Wra
pp
ers
Query
Layer
Results
Par
qu
et
Mo
ngo
Cas
san
dra
Databases
val result = spark.sparqlDL(query, mappings, config)
Distributed Data
Structures
SPARQL query Data Lake
31
![Page 32: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/32.jpg)
SANSA-DataLake
32
![Page 33: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/33.jpg)
SANSA-DataLake
DF2 DF1 DFn
Join
ParSets
Relevant Data Sources
Distributed Query
Processing
Query Decomposition
Mappings
DFr
Final Results
Data Wrapping Relevant Entity
Extraction
Data Lake
Parallel Operational Area (POA)
Union
Config
Query
4
1
2 3
33
![Page 34: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/34.jpg)
SANSA: Inference Layer
34
![Page 35: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/35.jpg)
SANSA: Inference Layer • The volume of semantic data growing rapidly • Implicit information needs to be derived by reasoning
• Reasoning - the process of deducing implicit information from existing RDF data by using W3C Standards for Modelling: RDFS or OWL fragments
35
![Page 36: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/36.jpg)
SANSA: Inference Layer • Reasoning can be performed in two different strategies:
– The forward chaining strategy derives and stores the derived RDF data back into original RDF data storage for late queries from applications (data-driven).
– The backward chaining strategy derives implicit RDF data on the fly during query process (goal-directed).
• The forward chaining strategy has lower query response time and high load time.
36
![Page 37: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/37.jpg)
SANSA: Inference Layer • Parallel in-memory inference via rule-based forward chaining • Beyond state of the art: dynamically build a rule dependency
graph for a rule set
→ Adjustable performance
→ Allows domain-specific customisation
37
![Page 38: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/38.jpg)
SANSA: Inference Layer Parallel RDFS Reasoning Algorithm based on Spark
Input RDD
Filtered RDDs
Input
Ontology
Intermediate Join
results RDD Single Rule
results
Rule Set
results
...
Iterate until fixed point iteration
filter
filter
filter
join
union
distinct
38
![Page 39: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/39.jpg)
SANSA: Inference Layer Some RDFS inference rules • (X R Y), (R subPropertyOf Q) ➞ (X Q Y) • (X R Y), (R domain C) ➞ (X type C) • (X type C), (C subClassOf D) ➞ (X type D)
40
![Page 40: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/40.jpg)
SANSA: Inference Layer
RDFS rule dependency graph (simplified)
41
![Page 41: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/41.jpg)
SANSA: Inference Layer Some OWL Horst Rules: • p owl:inverseOf q, v p w ➞ w q v • p owl:inverseOf q, v q w ➞ w p v
43
![Page 42: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/42.jpg)
SANSA: Inference Layer • Part of OWL Horst rule dependency
graph (simplified)
a) p owl:inverseOf q, v p w ➞ w q v
b) p owl:inverseOf q, v q w ➞ w p v
p rdfs:subPropertyOf q
, s p o ➞ s q o
p rdf:type owl:SymmetricProperty
v p u ➞ u p v
p rdf:type owl:TranstiveProperty
v p u
W p v
➞ u p v
44
![Page 43: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/43.jpg)
SANSA: Inference Layer • SANSA-Inference Layer support RDFs and OWL-Horst
reasoning in Triples and OWLAxioms
• Triple based forward chaining:
// load triples from disk
val graph = RDFGraphLoader.loadFromDisk(spark, input, parallelism)
val reasoner = new ForwardRuleReasonerOWLHorst(spark.sparkContext)
// compute inferred graph
val inferredGraph = reasoner.apply(graph)
45
![Page 44: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/44.jpg)
SANSA: Inference Layer • Axiom based forward chaining:
// load axioms from disk
var owlAxioms = spark.owl(Syntax.FUNCTIONAL)(input)
// create reasoner and compute inferred graph
val inferredGraph = profile match {
case RDFS => new ForwardRuleReasonerRDFS(spark.sparkContext,
parallelism)(owlAxioms)
case OWL_HORST => new
ForwardRuleReasonerOWLHorst(spark.sparkContext, parallelism)(owlAxioms)
case _ => throw new RuntimeException("Invalid profile: '" +
profile + "'")
} 46
![Page 45: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/45.jpg)
SANSA: ML Layer
47
![Page 46: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/46.jpg)
SANSA: ML Layer ❖ Distributed Machine Learning (ML) algorithms that work on
RDF data and make use of its structure / semantics ❖ Algorithms: ➢ Graph Clustering
■ Power Iteration, ■ BorderFlow, ■ Link based ■ Modularity based clustering
➢ Association rule mining (AMIE+ = mining horn rules from RDF data using partial completeness assumption and type constraints)
➢ Outlier detection ➢ KG Kernels 48
![Page 47: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/47.jpg)
Scope
49
![Page 48: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/48.jpg)
RDF By Modularity Clustering • A Hierarchical clustering method • Starts with each vertex in its own community • Iteratively join pairs of community choosing the one with
greatest increase in the optimizing function Q – Optimization function identifies the significant community
structure
• The cut off point maximal value of Q. • Scales as the square of the network size
50
RDFByModularityClusteringAlg(spark.sparkContext, numIterations, input, output)
spark.stop
![Page 49: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/49.jpg)
Power Iteration Clustering • Simple and fast version of spectral clustering technique • Efficient and scalable in terms of time O(n) and space • Applying PowerIteration to the row normalized affinity
matrix • Partitioning clustering algorithm
– Outputs one-level clustering solution
51
val lang = Lang.NTRIPLES val triples = spark.rdf(lang)(input) val graph = triples.asStringGraph() val cluster = RDFGraphPowerIterationClustering(spark, graph, output, k, maxIterations) cluster.saveAsTextFile(output)
![Page 50: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/50.jpg)
BorderFlow Clustering local graph clustering algorithm
Designing for directed and undirected weighted graphs
Clusters in BorderFlow:
Maximal intra-cluster density
Minimal outer-cluster density
52
val lang = Lang.NTRIPLES val triples = spark.rdf(lang)(input) val graph = triples.asGraph() val borderflow = algName match { case "borderflow" => BorderFlow(spark, graph, output, outputevlsoft, outputevlhard) case "firsthardening" => FirstHardeninginBorderFlow(spark, graph, output, outputevlhard) case _ => throw new RuntimeException("'" + algName + "' - Not supported, yet.") }
![Page 51: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/51.jpg)
Link Based Clustering • Hierarchical link clustering method • Bottom up approach of hierarchical called the
“agglomerative” • Clusters are created recursively • The similarity S between links can be given by e.g. Jaccard
similarity 1. val lang = Lang.NTRIPLES 2. val triples = spark.rdf(lang)(input) 3. val graph = triples.asStringGraph()
4. AlgSilviaClustering(spark, graph, output, outputeval)
53
![Page 52: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/52.jpg)
DBSCAN Clustering • Clustering of POIs on the basis of spatially closed
coordinates. • POIs located in same geolocation are clustered irrespective
of their categories.
54
1. val lang = Lang.NTRIPLES 2. val triples = spark.rdf(lang)(input) 3. val graph = triples.asStringGraph()
4. AlgSilviaClustering(spark, graph, output,
outputeval)
![Page 53: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/53.jpg)
Numerical Outliers Detection • Detecting numerical outliers in large RDF dataset. • Spark minHashLSH used to create the cohort of similar class. • A scalable approach to find the outliers in a massive dataset. • Numerical outliers detected in the data are useful for
improving the quality of RDF Data
55
![Page 54: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/54.jpg)
RDF Graph Kernels Given an RDF graph constructing a tree for each instance and counting the number of paths in it.
Literals in RDF can only occur as objects in triples and therefore have no out-going edges in the RDF graph.
Apply Term Frequency-Inverse Document Frequency (TF-IDF) for vectors.
56
1. val rdfFastGraphKernel = RDFFastGraphKernel(spark, triples, "http://swrc.ontoware.org/ontology#affiliation")
2. val data = rdfFastGraphKernel.getMLLibLabeledPoints
3. RDFFastTreeGraphKernelUtil.predictLogisticRegressionMLLIB(data,
4, iteration)
![Page 55: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/55.jpg)
Rule Mining • Association Rule Mining under Incomplete Evidence (AMIE) • Atoms : are facts with the subject and object position
substituted with variables e.g. (isChildOf(?a,?b)) • Rules : made up of atoms having a head (one atom) and the
body (multiple atoms) • The body predicts the head • A Rule can be written as
• OR
57
![Page 56: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/56.jpg)
Interactive SANSA in your Browser
58
![Page 57: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/57.jpg)
BDE & General Integration ❖ SANSA = Scala / Maven Repositories based on Spark / Flink ❖ Easy to include both in BDE platform and any Spark / Flink
environment
59
![Page 58: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/58.jpg)
SANSA Planning and Pulse ❖ SANSA 0.6 in June 2018, releases every 6 months ❖ Apache Open Source License ❖ Project activity: ➢ Contributors (at least one commit): 17 ➢ Commits per day: 5.9 - Commits previous year: 2175 ➢ Github stars (all repos): 187
60
![Page 59: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/59.jpg)
Conclusions and Next steps ❖ A generic stack for (big) Linked Data
➢ Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
❖ Out-of-the-box framework for scalable and distributed semantic data analysis combining semantic web and distributed machine learning for (1) querying, (2) inference and (3) analytics of RDF datasets.
❖ Next steps ➢ Support for SPARQL 1.1 and other partitioning strategies
(Query Layer) ➢ Backward chaining and better evaluation (Inference Layer) ➢ More algorithms and definition of ML pipelines (ML Layer)
61
![Page 60: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/60.jpg)
Associated Publications 1. Distributed Semantic Analytics using the SANSA Stack by Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick
Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen in
Proceedings of 16th International Semantic Web Conference – Resources Track (ISWC’2017), 2017 [BibTex].
2. The Tale of Sansa Spark by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler,
Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngomo Ngonga, and Hajira Jabeen in
Proceedings of 16th International Semantic Web Conference, Poster & Demos, 2017 [BibTex].
3. DistLODStats: Distributed Computation of RDF Dataset Statistics by Gezim Sejdiu, Ivan Ermilov, Jens Lehmann, and
Mohamed Nadjib-Mami in Proceedings of 17th International Semantic Web Conference, 2018. [BibTex]
4. STATisfy Me: What are my Stats? by Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and Mohamed-Nadjib Mami. In
Proceedings of 17th International Semantic Web Conference, Poster & Demos, 2018.
5. Profiting from Kitties on Ethereum: Leveraging Blockchain RDF with SANSA by Damien Graux; Gezim Sejdiu; Hajira
Jabeen; Jens Lehmann; Danning Sui; Dominik Muhs; and Johannes Pfeffer. In 14th International Conference on Semantic
Systems, Poster & Demos, 2018.
6. SPIRIT: A Semantic Transparency and Compliance Stack by Patrick Westphal, Javier Fernández, Sabrina Kirrane and
Jens Lehmann. In 14th International Conference on Semantic Systems, Poster & Demos, 2018.
7. Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale knowledge graphs (CONOD) by
Hajira Jabeen; Rajjat Dadwal; Gezim Sejdiu; and Jens Lehmann. In 21st International Conference on Knowledge Engineering
and Knowledge Management (EKAW’2018), 2018.
8. Clustering Pipelines of large RDF POI Data by Rajjat Dadwal; Damien Graux; Gezim Sejdiu; Hajira Jabeen; and Jens
Lehmann. In ESWC 2019 (Poster Track).
62
![Page 61: SANSA: Distributed Semantic Analytics · 2019-06-21 · SANSA: Motivation Abundant machine readable structured information is available (e.g. in RDF) Across SCs, e.g. Life Science](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa3355c2f7db875301479a6/html5/thumbnails/61.jpg)
This project has received funding from the European Union's Horizon 2020 Research and Innovation
programme under grant agreement No 809965.
THANK YOU !
Prof. Jens Lehmann Dr. Damien Graux Dr. Hajira Jabeen