querying distributed rdf data sources with sparql
DESCRIPTION
Querying Distributed RDF Data Sources with SPARQL. Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung. Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea. - PowerPoint PPT PresentationTRANSCRIPT
Querying Distributed RDF Data Sources with Querying Distributed RDF Data Sources with SPARQLSPARQL
Presented by Bastian Quilitz and Ulf Leser
Humboldt-Universitat zu Berlin
ESWC 2008
2009-07-23
Summarized by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2009 by CEBT
IntroductionIntroduction
SPARQL has to deal with thousands of RDF data
with a local machine
with multiple and distributed machines
Integrated access to multiple RDF data sources is a key challenge for many semantic web applications
Current implementations of SPARQL load all RDF graphs to the local machine
This usually incurs a large overhead in network traffic
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 2/19
Copyright 2009 by CEBT
IntroductionIntroduction
DARQ, an engine for federated SPARQL queries
Provides transparent query access to multiple SPARQL services
Distributed ARQ, as an extension to ARQ (jena)
Available under GPL License at http://darq.sf.net/
Center for E-Business Technology
Do not care
In this presentation, ..
Data Source
Building Sub-queries
Metadata for each DS
2009 IDS & IDB Lab. Seminar – 3/19
Copyright 2009 by CEBT
PreliminariesPreliminaries
A SPARQL query Q is defined as Q = (E, DS, R)
E : an algebra expression of the SPARQL query
DS : a RDF data source
R : Query Type (SELECT, CONSTRUCT, DESCRIBE, ASK)
The algebra expression E consists of
Graph Patterns
– Triple Pattern : (s, p, o)
– Basic Graph Pattern : a set of triple pattern
– Filtered BGP : BGP with constraints
Solution Modifiers,
– Such as PROJECTION, DISTINCT, LIMIT or ORDER BY
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 4/19
Copyright 2009 by CEBT
An Example SPARQ QueryAn Example SPARQ Query
Center for E-Business Technology
SELECT ?name ?mbox WHERE {
?x foaf:name ?name.
?x foaf:mbox ?mbox.
FILTER regex(?name, “^Tim”) && regex(?mbox, “w3c”)
} ORDER BY ?name LIMIT 5
Query TypeQuery Type ProjectionProjection
TPTP
BGPBGP FBGPFBGP
Solution ModifiersSolution Modifiers
2009 IDS & IDB Lab. Seminar – 5/19
Copyright 2009 by CEBT
Query ProcessingQuery Processing
A query is processed in 4 stages:
Parsing : converts the query string into a tree model of SPARQL. The DARQ query engine reuses the parser shipped with ARQ
Query Planning : the query engine decomposes the original query and builds multiple sub-queries according to the information in the service descriptions, each of which can be answered by one known data source
Query Optimization : In the third stage, the query optimizer takes the sub-queries and rewrites them for optimization
Query Execution : the Query execution plan is executed. The sub-queries are sent to the data sources and the results are integrated
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 6/19
Copyright 2009 by CEBT
Service DescriptionsService Descriptions
Information for each data sources is helpful
To find the relevant data sources for the different triples
To decompose the query into sub-queries
Service descriptions
Let us know whether the data available from a data source
Allow limitations on access patterns
Include statistical information used for query optimization
Are represented in RDF
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 7/19
Copyright 2009 by CEBT
Service DescriptionsService Descriptions
Data Description
A service description defines the capabilities which indicates whether data is available or not
Ex) sd:capability [ sd:predicate rdf:type ];
The definition of capabilities is based on predicates
– DARQ currently only supports queries with bounded predicates
Limitation on Access Pattern
DARQ supports limitations on access patterns
Ex) sd:requiredBindings [ sd:subjectBinding foaf:name ];
Ex) sd:requiredBindings [ sd:objectBinding foaf:name ];
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 8/19
Copyright 2009 by CEBT
Service DescriptionsService Descriptions
Statistical Information
Helps the query optimizer to find a cost-effective query plan
Includes
– Ns : The total number of triples
– Optional information for each predicate
nD(p) : The number of triples for the predicate p in the data source D
sselD(p) : The selectivity of a triple pattern for the predicate p when the
subject is bounded (default = 1 / nD(p) )
oselD(p) : The selectivity of a triple pattern for the predicate p when the
object is bounded (default = 1)
Using simple statistics => every data source can provide them
– More precise statistics would be preferable but will not be available
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 9/19
Copyright 2009 by CEBT
Service DescriptionsService Descriptions
The data source defined in the example can answer queries for foaf:name, foaf:mbox and foaf:weblog.
Objects for a triple with predicate foaf:name must always start with a letter from A to R
In total it stores 112 triples
The data source has limitations on access patterns, i.e. a query must contain a triple pattern with predicate foaf:name or foaf:mbox with a bounded object
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 10/19
Copyright 2009 by CEBT
Query PlanningQuery Planning
Query planning is based on the information provided by service descriptions
In this system, we have two stages Source Selection: let us know which data source is relevant to
a given query
– The algorithm simply matches given triple patterns against the capabilities of the data sources Ex) sd:capability [ sd:predicate rdf:type ];
SELECT ?x WHERE ?x rdf:type foaf:Person;
– As a result, every triple pattern in a BGP has a set of corresponding data sources
– The results from source selection are used to build sub-queries that can be answered by the data source
Building Sub-Queries
– Each data source has a sub-query
– Each sub-query has a filtered BGP
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 11/19
Copyright 2009 by CEBT
Query PlanningQuery Planning
Center for E-Business Technology
(Person, name, “TBL”) (Person, mbox, “[email protected]”) (Person, name, “ABC”)(Person, mbox, “[email protected])
sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox.
sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox.
sd:capability sd:predicate foaf:mbox.sd:capability sd:predicate foaf:mbox.
sd:capability sd:predicate foaf:name.sd:capability sd:predicate foaf:name.
SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTER regex(?name, “^Tim”) && regex(?mbox, “w3c”)} ORDER BY ?name LIMIT 5
DARQDARQ
(?x foaf:name ?name)(?x foaf:name ?name) (?x foaf:mbox ?mbox)(?x foaf:mbox ?mbox) (?x foaf:name ?name)(?x foaf:mbox ?mbox)(?x foaf:name ?name)(?x foaf:mbox ?mbox)
2009 IDS & IDB Lab. Seminar – 12/19
Copyright 2009 by CEBT
Query Optimization - LogicalQuery Optimization - Logical
Rule-based Query Rewriting
Based on [Perez, J. et al., ISWC 2006]
Reduces the number of BGP & variables
Moving value constraints into sub-queries
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 13/19
Copyright 2009 by CEBT
Query Optimization - PhysicalQuery Optimization - Physical
Physical optimization is about the intermediate result size estimation (cost-based optimization)
The result size estimation is based on the statistics provided in the service descriptions
Join, Single Triple, Multiple Triples (BGP)
An example of a single triple pattern
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 14/19
Copyright 2009 by CEBT
EvaluationEvaluation
Dataset : a subset of DBpedia, 31.5 million triples in total
Contains RDF data extracted from Wikipedia
http://dbpedia.org
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 15/19
Copyright 2009 by CEBT
EvaluationEvaluation
2 physical machines, 5 logical SPARQL endpoints
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 16/19
Copyright 2009 by CEBT
EvaluationEvaluation
Optimization has made significant improvements
My opinion
The experiment doesn’t count the loading time
There need to be compared with other systems
– http://esw.w3.org/topic/LargeTripleStores
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 17/19
Copyright 2009 by CEBT
ConclusionConclusion
DARQ offers a single interface for querying multiple, distributed SPARQL end-points
Using SPARQL Standard => Flexible
Using Service Descriptions
– Data sources can be added and/or removed dynamically
– A query can be federated and optimized with statistical information
Limitation
Predicates must be bounded (Sub. ?p Obj. is not allowed)
CONSTRUCT, DESCRIBE, ASK are not supported
GRAPH, UNION, OPTIONAL are not supported
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 18/19
Copyright 2009 by CEBT
Paper EvaluationPaper Evaluation
Pros
Good idea
– Distributed SPARQL processing is relatively new research field
Defining service descriptions
Dealing with all aspects of query engine
Implementation
My Comments
Too simple, and still slow
Many limitations
Center for E-Business Technology 2009 IDS & IDB Lab. Seminar – 19/19