linked open data - how to juggle with more than a billion triples
DESCRIPTION
Slides of my inauguration talk at the University of Mannheim in Germany in October 2012. Download this slide set to enjoy all animations.TRANSCRIPT
Slide 1Ansgar Scherp – [email protected]
How to Juggle with more
than a Billion Triples?
Ansgar ScherpResearch Group on Data and Web Science
Universität MannheimOctober 2012
Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/
Slide 2Ansgar Scherp – [email protected]
My thanks go to …
• Marianna• Simon Schenk• Carsten Saathoff• Thomas Franz• Thomas Gottron• Steffen Staab • Arne Peters• Bastian Krayer
• Daniel Eißing• Mathias Konrath• Daniel Schmeiß• Anton Baumesberger • Frederik Jochum • Alexander Kleinen
And many more …
Slide 3Ansgar Scherp – [email protected]
Scenario
• Tim plans to travel– from London – to a customer in Cologne
Slide 4Ansgar Scherp – [email protected]
Website of the German Railway
It works, why bother…?
Eurostar
DB
KVB
Slide 5Ansgar Scherp – [email protected]
Let‘s Try Different Queries
Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map? …
All these queries cannot be answered, because the data …
Slide 6Ansgar Scherp – [email protected]
… locked in Silos!
– High Integration Effort– Lack in Reuse of Data
B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
Slide 7Ansgar Scherp – [email protected]
Linked Data
• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web
World Wide Web Linked DataDocuments DataHyperlinks Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)
Example: http://www.uni-mannheim.de/
Slide 8Ansgar Scherp – [email protected]
Relevance of Linked Data?
Slide 9Ansgar Scherp – [email protected]
Linked Data: May ‘07
< 31 Billion Triples
Media
Geographic
Publications
Web 2.0
eGovernment
Cross-Domain
LifeSciences
Sept. ‘11
Source: http://lod-cloud.net
Slide 10Ansgar Scherp – [email protected]
Linked Data Principles
1. Identification2. Interlinkage3. Dereferencing4. Description
Slide 11Ansgar Scherp – [email protected]
Example: Big Lynx
Big LynxCompany
Matt Briggs
Scott Miller
Source: http://lod-cloud.net< 31 Milliarde Triple
?
Slide 12Ansgar Scherp – [email protected]
1. Use URIs for Identification
B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY
http://biglynx.co.uk/people/matt-briggs
http://biglynx.co.uk/people/scott-miller
Matt Briggs
Scott Miller
Slide 13Ansgar Scherp – [email protected]
Example: Big Lynx
Big LynxCompany
Matt Briggs
Scott Miller
How to model relationships like knows?
Slide 14Ansgar Scherp – [email protected]
• Description of Ressources with RDF triple
Matt Briggs is a Person
Resource Description Framework (RDF)
Subject Predicate Object
@prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> .
@prefix foaf:<http://xmlns.com/foaf/0.1/> .
<http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .
Slide 15Ansgar Scherp – [email protected]
1. Use URIs also for Relations
B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY
http://biglynx.co.uk/people/matt-briggs
http://biglynx.co.uk/people/scott-miller
foaf:knows
foaf:knows
Slide 16Ansgar Scherp – [email protected]
Example: Big Lynx
Big Lynx Company
DBpedia
Matts privateWebseite
„sameperson“
Matt Briggs
„lives here“
Dave SmithLondon
…
Matt Briggs
Scott Miller
Slide 17Ansgar Scherp – [email protected]
2. Establishing Interlinkage
• Relation links between ressources
<http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .
Identity links between ressources
<http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .
Slide 18Ansgar Scherp – [email protected]
Example: Big Lynx
DBpedia
Big Lynx Company
Matts privateWebseite
foaf:based_near
owl:sameAs
Matt Briggs
Dave SmithLondon
Matt Briggs
„same Person“
„lives here“
Slide 19Ansgar Scherp – [email protected]
• Looking up of web documents
• How can we “look up” things of the real world?
3. Dereferencing of URIs
http://biglynx.co.uk/people/matt-briggs
Slide 20Ansgar Scherp – [email protected]
Two Approaches
1. Hash URIs– URI contains a part separated by #, e.g.,
http://biglynx.co.uk/vocab/sme#Team
2. Negotiation via „303 See Other“ request
http://biglynx.co.uk/people/matt-briggs
Response: „Look here:“http://biglynx.co.uk/people/matt-briggs.rdf
Slide 21Ansgar Scherp – [email protected]
Example: Big Lynx
DBpedia
Big Lynx Company
Matts privateWebseite
foaf:based_near
owl:sameAs
Matt Briggs
Dave SmithLondon
Matt BriggsDescription of Matt?
Slide 22Ansgar Scherp – [email protected]
4. Description of URIs
biglynx:matt-briggs
foaf:Person
biglynx:dave-smith
dp:Birmingham
rdf:type
foaf:knows
foaf:based_near
_:point
wgs84:lat
wgs84:long
dp:London
foaf:based_near
……
…
…
ex:loc
…
“-0.118”
“51.509”
Slide 23Ansgar Scherp – [email protected]
Formalization of Description
),,( EPVG Given a RDF graph with
VPBRELBRV )( and
SimpleCBD(n) = I with
∩j = 0
I 0 = { (s, p, o) | (s, p, o) E s = n }
I = { (o, p‘, o‘) E | (s, p, o) I : o B
j
jj+1
∩
k = 0k
j
(o, p‘, o‘) I }
∞
Slide 24Ansgar Scherp – [email protected]
W3C RDF / RDF Schema Vocabulary
• rdf:type • rdf:Property • rdf:XMLLiteral • rdf:List • rdf:first • rdf:rest • rdf:Seq • rdf:Bag • rdf:Alt • ... • rdf:value
• rdfs:domain • rdfs:range • rdfs:Resource • rdfs:Literal • rdfs:Datatype • rdfs:Class • rdfs:subClassOf • rdfs:subPropertyOf • rdfs:comment • …• rdfs:label
• Set of URIs defined in rdf:/rdfs: namespace
Slide 25Ansgar Scherp – [email protected]
Semantic Web Layer Cake (Simplified)
Slide 26Ansgar Scherp – [email protected]
Exploration of Linked Data
Source: http://lod-cloud.net< 31 Billion Triples
WordNet
Swoogle
GeoNames
Slide 27Ansgar Scherp – [email protected]
Naive Approach
• Download all data• Store in really big
database• Programming of
queries• Design of
user interface
Swoogle
WordNet
GeoNames
Inflexible Monolithic
Notscaleable
RDFS
Rules Geo
Slide 28Ansgar Scherp – [email protected]
SemaPlorer Approach
GeoQueries
GeoNames
Flexible
Scaleable
Extensible
RDFS Rules
WordNet Swoogle
Fulltext
12 Month in 2005/06 700 Mio. Triple
+ + +
> 1 Billion Triples
placeOfBirthbirthplace
birthplace
+
Slide 29Ansgar Scherp – [email protected]
SemaPlorer – Semantic Social Media
Watch video online: http://vimeo.com/2057249
Slide 31Ansgar Scherp – [email protected]
Searching for Linked Data Sources
Quelle: http://lod-cloud.net< 31 Milliarde Triples
Persons that are - Politicians and - Actors ? ?
Slide 32Ansgar Scherp – [email protected]
Idea: Index of Data Sources
“Politician and Actor”
?Query
SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .}
Index
Slide 33Ansgar Scherp – [email protected]
The Naive Approach
1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup
- Big machinery- Late in processing the data- High effort to scale with LOD cloudCan we do smarter?
Can we do smarter?
Slide 34Ansgar Scherp – [email protected]
Idea Schema-level index
Define families of graph patternsAssign instances to graph patternsMap graph patterns to context (source URI)
ConstructionStream-based for scalabilityLittle loss of accuracy
Note Index defined over instancesBut stores the context
Slide 35Ansgar Scherp – [email protected]
Input Data n-Quads
<subject> <predicate> <object> <context>
Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>
w3p:#me
foaf:Person
http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf
Slide 36Ansgar Scherp – [email protected]
SchemEX Approach• Stream-based schema extraction• While crawling the data
Nquad-Stream
Schema-LevelIndex
Schema-Extractor
Parser
Instance-Cache
LOD-CrawlerRDF-DumpTriple Store
NxParser
RDFRDBMS
FIFO
Slide 37Ansgar Scherp – [email protected]
Building the Index from a Stream Stream of n-quads (coming from a LD crawler)
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
4
3
2
1
1
6
23
4
5
C3
C2
C2
C1
• Linear runtime complexity wrt # of input triples
Slide 38Ansgar Scherp – [email protected]
DS1 DS2 DS3 DS4 DS5 DSxData
sources
consistsOf
hasDataSource
Building the Schema and Index
…
EQC1 EQC2 EQCn Equivalenceclasses
hasEQClass p1 p2
…
TC1 TC2 TCm
Type clusters…
C2C1 C3 CkRDF
classes…
Slide 39Ansgar Scherp – [email protected]
Layer 1: RDF Classes
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
http://dig.csail.mit.edu/2008/...
C1
DS 3DS 2DS 1
SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}
All instances of a particular type
Slide 40Ansgar Scherp – [email protected]
Layer 2: Type Clusters
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
DS 3DS 2DS 1SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}
C1
C2
pim:Male
tc4711
pim:Male
All instances belonging to exactly the same set of types
TC1
Slide 41Ansgar Scherp – [email protected]
Two instances are equivalent iff:They are in the same TCThey have the same
propertiesThe property targets are
in the same TC
Similar to 1-Bisimulation
Layer 3: Equivalence Classes
EQC1
DS 3DS 2DS 1
C1
C2
TC1
C3
TC2
p
Slide 42Ansgar Scherp – [email protected]
Layer 3: Equivalence Classes
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
pim:Male
tc4711
pim:Male
foaf:PPD
timbl:card
eqc0815
foaf:PPD
tc1234
eqc0815-maker-tc1234
foaf:maker
SELECT ?x WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}
Slide 43Ansgar Scherp – [email protected]
Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec
• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+
• Commodity hardware (4GB RAM, single CPU)
Slide 44Ansgar Scherp – [email protected]
Quality of Stream-based Index Construction
+ Runtime increases hardly with window size+ Memory consumption scales with window size
Slide 47Ansgar Scherp – [email protected]
And 2012? Get the Google Feeling!
Slide 48Ansgar Scherp – [email protected]
Semantic Data Management Chain• Research topics in a greater context
UseAggregatePublish Collect
OntoMDE
Core OntologiesKreuzverweis.com
SchemEX*
Mobile Facets
SemaPlorer*
* Winner of Billion Triple Challenge 2011/2008
See at: dws.informatik.uni-mannheim.de
Slide 49Ansgar Scherp – [email protected]
Slide 50Ansgar Scherp – [email protected]
Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:
Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x
• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp: SemaPlorer - Interactive semantic exploration of data and media based on a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009)URL: http://dx.doi.org/10.1016/j.websem.2009.09.006
• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716
• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001