arxiv.org · arxiv:1507.04614v2 [cs.db] 19 jul 2015 ldql: a query language for the web of linked...

39
arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version) Olaf Hartig 1 and Jorge P´ erez 2 1 http://olafhartig.de/ 2 Department of Computer Science, Universidad de Chile [email protected] Abstract The Web of Linked Data is composed of tons of RDF documents in- terlinked to each other forming a huge repository of distributed semantic data. Effectively querying this distributed data source is an important open problem in the Semantic Web area. In this paper, we propose LDQL, a declarative lan- guage to query Linked Data on the Web. One of the novelties of LDQL is that it expresses separately (i) patterns that describe the expected query result, and (ii) Web navigation paths that select the data sources to be used for computing the result. We present a formal syntax and semantics, prove equivalence rules, and study the expressiveness of the language. In particular, we show that LDQL is strictly more expressive than the query formalisms that have been proposed previously for Linked Data on the Web. The high expressiveness allows LDQL to define queries for which a complete execution is not computationally feasible over the Web. We formally study this issue and provide a syntactic sufficient con- dition to avoid this problem; queries satisfying this condition are ensured to have a procedure to be effectively evaluated over the Web of Linked Data. 1 Introduction In recent years an increasing amount of structured data has been published and inter- linked on the World Wide Web (WWW) in adherence to the Linked Data principles [3]. These principles are based on standard Web technologies. In particular, (i) the Hypertext Transfer Protocol (HTTP) is used to access data, (ii) HTTP-based Uniform Resource Identifiers (URIs) are used as identifiers for entities described in the data, and (iii) the Resource Description Framework (RDF) is used as data model. Then, any HTTP URI in an RDF triple presents a data link that enables software clients to retrieve more data by looking up the URI with an HTTP request. The adoption of these principles has lead to the creation of a globally distributed dataspace: the Web of Linked Data. The emergence of the Web of Linked Data makes possible an online execution of declarative queries over up-to-date data from a virtually unbounded set of data sources, each of which is readily accessible without any need for implementing source-specific APIs or wrappers. This possibility has spawned research interest in approaches to query Linked Data on the WWW as if it was a single (distributed) database. For an overview on query execution techniques proposed in this context refer to [12]. The main contribution of this paper is the proposal of LDQL, a novel query lan- guage for the Web of Linked Data. The most important feature of LDQL is that it clearly separates query components for selecting query-relevant regions of the Web of Linked Data, from components for specifying the query result that has to be constructed from the data in the selected regions. The most basic construction in LDQL are tuples of the form L, Qwhere L is an expression used to select a set of relevant documents, and Q is a query intended to be executed over the data in these documents as if they This document is an extended version of a paper published in ISWC 2015 [13].

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

arX

iv:1

507.

0461

4v2

[cs.

DB

] 19

Jul

201

5

LDQL: A Query Language for the Web of Linked Data(Extended Version)⋆

Olaf Hartig1 and Jorge Perez2

1 http://olafhartig.de/

2 Department of Computer Science, Universidad de [email protected]

Abstract The Web of Linked Data is composed of tons of RDF documents in-terlinked to each other forming a huge repository of distributed semantic data.Effectively querying this distributed data source is an important open problemin the Semantic Web area. In this paper, we propose LDQL, a declarative lan-guage to query Linked Data on the Web. One of the novelties of LDQL is thatit expresses separately (i) patterns that describe the expected query result, and(ii) Web navigation paths that select the data sources to be used for computingthe result. We present a formal syntax and semantics, prove equivalence rules,and study the expressiveness of the language. In particular, we show that LDQLis strictly more expressive than the query formalisms that have been proposedpreviously for Linked Data on the Web. The high expressiveness allows LDQLto define queries for which a complete execution is not computationally feasibleover the Web. We formally study this issue and provide a syntactic sufficient con-dition to avoid this problem; queries satisfying this condition are ensured to havea procedure to be effectively evaluated over the Web of Linked Data.

1 IntroductionIn recent years an increasing amount of structured data has been published and inter-linked on the World Wide Web (WWW) in adherence to the Linked Data principles [3].These principles are based on standard Web technologies. Inparticular, (i) the HypertextTransfer Protocol (HTTP) is used to access data, (ii) HTTP-based Uniform ResourceIdentifiers (URIs) are used as identifiers for entities described in the data, and (iii) theResource Description Framework (RDF) is used as data model.Then, any HTTP URIin an RDF triple presents adata link that enables software clients to retrieve more databy looking up the URI with an HTTP request. The adoption of these principles has leadto the creation of a globally distributed dataspace: theWeb of Linked Data.

The emergence of the Web of Linked Data makes possible anonline executionofdeclarative queries over up-to-date data from a virtually unbounded set of data sources,each of which is readily accessible without any need for implementing source-specificAPIs or wrappers. This possibility has spawned research interest in approaches to queryLinked Data on the WWW as if it was a single (distributed) database. For an overviewon query execution techniques proposed in this context refer to [12].

The main contribution of this paper is the proposal of LDQL, anovel query lan-guage for the Web of Linked Data. The most important feature of LDQL is that itclearly separates query components for selecting query-relevant regions of the Web ofLinked Data, from components for specifying the query result that has to be constructedfrom the data in the selected regions. The most basic construction in LDQL are tuplesof the form〈L,Q〉 whereL is an expression used to select a set of relevant documents,andQ is a query intended to be executed over the data in these documents as if they

⋆ This document is an extended version of a paper published in ISWC 2015 [13].

Page 2: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

were a single RDF repository. In an abstract setting one can use several formalisms toexpressL andQ. In our proposal, for the former part we introduce the notionof linkpath expressionsthat are a form of nested regular expressions (with some other impor-tant features) used to navigate the link graph of the Web. Forthe latter, we use standardSPARQL graph patterns. To begin evaluating these queries one needs to specify a set ofseed URIs. The language also possesses features to dynamically (at query time) identifynew seed URIs to evaluate portions of a query. Additionally,such queries can be com-bined by using conjunctions, disjunctions, and projection. We present a formal syntaxand semantics for LDQL, propose some rewrite rules, and study its expressive power.

While there does not exist a standard language for expressing queries over LinkedData on the WWW, a few options have been proposed. In particular, a first strand of re-search focuses on extending the scope of SPARQL such that an evaluation of SPARQLqueries over Linked Data has a well-defined semantics [9,11,14,18]. A second strandof research focuses on navigational languages [7,14]. Although these languages havedifferent motivations, a commonality of all these proposals is that, in contrast to LDQL,the definition of query-relevant regions of the Web of LinkedData and the definition ofquery-relevant data within the specified regions are mixed.

As our second main contribution we compare LDQL with three previously pro-posed formalisms for querying the Web of Linked Data:SPARQL under reachability-based query semantics[11], NautiLOD[7], andSPARQL Property Path patterns undercontext-based semantics[14]. We formally prove that LDQL is strictly more expressivethan every one of these. We show that for every queryQ in the previous languages, onecan effectively construct an LDQL query which is equivalenttoQ. Moreover, for everyone of the previous languages, there exists an LDQL query that cannot be expressed inthat language. These results show that LDQL presents an interesting expressive power.

The downside of the expressiveness provided by LDQL is the existence of queriesfor which a complete execution is not feasible in practice. To capture this issue formally,we define a notion ofWeb-safenessfor LDQL queries. Then, the obvious question thatarises is how to identify LDQL queries that are Web-safe. Ourlast technical contributionis the identification of a sufficient syntactic condition forWeb-safeness.

The rest of the paper is structured as follows. Section2 introduces a data modelthat provides the basis for defining the semantics of LDQL. InSection3 we formallydefine the syntax and semantics of LDQL and show some simple algebraic properties.In Section4 we compare LDQL with the three mentioned languages, and in Section 5we focus on Web-safeness. Section6 concludes the paper and sketches future work.Proofs of the formal results in this paper can be found in the Appendix.

A preliminary version of some of the results in this paper have been presented in aworkshop [10]. This paper is a substantial extension of [10] refining the definition ofLDQL and introducing important changes to the syntax and thesemantics of the lan-guage. Moreover, the comparison with previous proposals was not discussed in [10].

2 Data Model

In this section we introduce a structural data model that captures the concept of a Webof Linked Data formally. As usual [7,9,11,14,18], for the definitions and analysis in thispaper, we assume that the Web is fixed during the execution of any single query.

Page 3: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

We use the RDF data model [5] as a basis for our model of a Web of Linked Data.That is, we assume three pairwise disjoint, infinite setsU (URIs),B (blank nodes), andL (literals). AnRDF triple is a tuple〈s, p, o〉 ∈ T with T = (U ∪B)×U×(U ∪B∪L).For any RDF triplet = 〈s, p, o〉 we writeuris(t) to denote the set of all URIs int.

Additionally, we assume another infinite setD that is disjoint fromU , B, andL,respectively. We refer to elements in this set asdocumentsand use them to represent theconcept of Web documents from which Linked Data can be extracted. Hence, we as-sume a function, saydata, that maps each documentd ∈ D to a finite set of RDF triplesdata(d) ⊆ T such that the data of each document uses a unique set of blank nodes.

Given these preliminaries, we are ready to define aWeb of Linked Data.

Definition 1. A Web of Linked Data is a tupleW = 〈D, adoc〉 that consists of a setof documentsD ⊆ D and a partial functionadoc : U → D that is surjective.

Functionadoc of a Web of Linked DataW = 〈D, adoc〉 captures the relationshipbetween the URIs that can be looked up in this Web and the documents that can beretrieved by such lookups. Since not every URI can be looked up, the function is par-tial. For any URIu ∈ U with u ∈ dom(adoc) (i.e., any URI that can be looked upin W ), documentd = adoc(u) can be considered the authoritative source of data foruin W (hence, the nameadoc). To accommodate for documents that are authoritativefor multiple URIs, we do not require injectivity for functionadoc. However, we requiresurjectivity because we conceive documents as irrelevant for a Web of Linked Data ifthey cannot be retrieved by any URI lookup in this Web.

LetW = 〈D, adoc〉 be a Web of Linked Data.W is said to be finite [11] if its setDof documents is finite. In this paper we assume that every Web of Linked Data is finite.Given documentsd, d′ ∈ D and a triplet ∈ data(d), we say that a URIu ∈ uris(t)establishes adata link from d to d′, if adoc(u) = d′. As a final concept, we formalizethe notion of alink graphassociated toW. This graph has documents inD as nodes,and directed edges representing data links between documents. Each edge is associatedwith a label that identifies both the particular RDF triple and the URI in this triple thatestablishes the corresponding data link. These labels shall provide the basis for definingthe navigational component of our query language.

Definition 2. The link graph of a Web of Linked DataW = 〈D, adoc〉, is a directed,edge-labeled multigraph,GW = 〈D,EW 〉, with set of edgesEW ⊆ D×(T×U)×D de-fined asEW =

〈dsrc, (t, u), dtgt〉 | t ∈ data(dsrc), u ∈ uris(t) anddtgt = adoc(u)

.

For a link graph edgee = 〈dsrc, (t, u), dtgt〉, tuple(t, u) is the label ofe. Moreover,we sometimes writee ∈ GW to denote thate is an edge in the link graphGW .

Example 1.As a running example for this paper assume a simple Web of Linked DataWex = 〈Dex, adocex〉 with three documents,dA, dB, anddC (i.e.,Dex = dA, dB, dC).The data in these documents are the following sets of RDF triples:

data(dA) = 〈uA, p1, uB〉, data(dB) = 〈uB, p1, uC〉;

〈uB, p2, uC〉; data(dC) = 〈uA, p2, uC〉;

Page 4: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Figure 1. The link graphGWex of our example Web of Linked DataWex.

and for functionadocex we have:adocex(uA)=dA, adocex(uB)=dB, adocex(uC)=dC,andadocex(p1) = dA (i.e., dom(adocex) = uA, uB, uC, p1). This Web contains 10data links. For instance, URIuA in the RDF triple〈uA, p2, uC〉 ∈ data(dC) establishesa data link to documentdA. Hence, the corresponding edge in the link graph ofWex is⟨

dC, (〈uA, p2, uC〉, uA), dA⟩

. Figure1 illustrates the link graphGWexwith all 10 edges.

3 Definition of LDQL

This section defines our Linked Data query language, LDQL. LDQL queries are meantto be evaluated over a Web of Linked Data and each such query isbuilt from two typesof components:Link path expressions(LPEs) for selecting query-relevant documents ofthe queried Web of Linked Data; and SPARQL graph patterns forspecifying the queryresult that has to be constructed from the data in the selected documents. For this paper,we assume that the reader is familiar with the definition of SPARQL [8], including thealgebraic formalization introduced in [16,2]. In particular, for SPARQL graph patternswe closely follow the formalization in [2] considering operatorsAND, OPT, UNION, FILTER,andGRAPH, plus the operatorBIND defined in [8]. We begin this section by introducingthe most basic concept of our language, the notion of link patterns. We use link patternsas the basis for navigating the link graph of a Web of Linked Data.

3.1 Link Patterns

A link pattern is a tuple in(

U ∪ ,+)

×(

U ∪ ,+)

×(

U ∪L∪ ,+)

. Link pat-terns are used to match link graph edges in the context of a designatedcontextURI. Thespecial symbol+ denotes a placeholder for the context URI. The special symbol de-notes a wildcard that will drive the direction of the navigation. Before formalizing howlink graph edges actually match link patterns, we show some intuition. Consider thelink graph of WebWex in Example1 (see Fig.1), and the link pattern〈+, p1, 〉. Intu-itively, in the context of URIuA, the edge with label(〈uA, p1, uB〉, uB) from documentdA to documentdB, matches the link pattern〈+, p1, 〉. Notice that in the matching,the context URIuA takes the place of symbol+, anduB takes the place of the wildcardsymbol . Notice thatuB also denotes the direction of the edge that matches the linkpattern. On the other hand, the edge with label(〈uA, p1, uB〉, uA) from dA to dA, doesnot match〈+, p1, 〉; althoughuB can take the place of the wildcard symbol, thedirection of the edge is not touB. That is, when matching an edge labeled by(t, u) werequire URIu to be taking the place of a wildcard in the link pattern. When more thanone wildcard symbol is used, the link pattern can be matched by edges pointing to the

Page 5: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

direction of any of the URIs taking the place of a wildcard. For instance, in the contextof uA, the link pattern〈 , p2, 〉 is matched by edges〈dA, (〈uB, p2, uC〉, uB), dB〉 and〈dA, (〈uB, p2, uC〉, uC), dC〉. The next definition formalizes this notion of matching.

Definition 3. A link graph edge with label(〈x1, x2, x3〉, u) matchesa link pattern〈y1, y2, y3〉 in the context of a URIuctx if the following two properties hold:

1. there existsi ∈ 1, 2, 3 such thatyi = andxi = u, and2. for everyi ∈ 1, 2, 3 eitheryi = + andxi = uctx, or yi = xi, or yi = .

One of the rationales for adopting the notion of a context URIand the+ symbolin our definition of link patterns, is to support cases in which link graph navigationhas to be focused solely on data links that areauthoritative. A data link representedby link graph edge〈dsrc, (t, u), dtgt〉 ∈ GW is authoritative in a Web of Linked DataW = 〈D, adoc〉 if dsrc = adoc(u′) for some URIu′ ∈ uris(t). Thus, if we fix a contextURI uctx, a link pattern that uses the+ symbol allows us to follow only authoritativedata links from documentdctx = adoc(uctx).

3.2 LDQL Queries

The most basic construction in LDQL queries are tuples of thefrom 〈L, P 〉 whereLis an expression used to select a set of documents from the Webof Linked Data, andP is a SPARQL graph pattern to query these documents as if they were a single RDFdataset. In an abstract setting, one can use any formalism tospecifyL as long asLdefines sets of RDF documents. In our proposal we use what we call link path expres-sions(LPEs) that are a form of nested regular expressions [17] over the alphabet of linkpatterns. Every link path expression begins its navigationin a context URI, traversesthe Web, and returns a set of URIs; these URIs are used to construct an RDF datasetwith all the documents to be retrieved by looking up the URIs.This dataset is passedto the SPARQL graph pattern to obtain the final evaluation of the whole query. Besidesthe basic constructions of the form〈L, P 〉, in LDQL one can also useAND, UNION andprojection, to combine them. We also introduce an operatorSEED that is used to dynam-ically change, at query time, the seed URI from which the navigation begins. The nextdefinition formalizes the syntax of LDQL queries and LPEs.

Definition 4. The syntax of LDQL is given by the following production rulesin whichlp is an arbitrary link pattern,?v is a variable,P is a SPARQL graph pattern (as per [2]),V is a finite set of variables, andU is a finite set of URIs:

q := 〈lpe , P 〉 | (SEED U q) | (SEED ?v q) | (q AND q) | (q UNION q) | πV q

lpe := ε | lp | lpe/lpe | lpe|lpe | lpe∗ | [lpe] | 〈?v, q〉

Any expression that satisfies the productionq is anLDQL query , any expression thatsatisfies the productionlpe is a link path expression (LPE), and any LDQL query ofthe form〈lpe , P 〉 is abasic LDQL query.

Page 6: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Before going into the formal semantics of LDQL and LPEs, we give some moreintuition about how these expressions are evaluated in a Webof Linked DataW. Asmentioned before, the most basic expression in LDQL is of theform 〈lpe, P 〉. To evalu-ate this expression overW we will need a setS of seedURIs. When evaluating〈lpe, P 〉,every one of the seed URIs inS will trigger a navigation of link graphGW via the linkpath expressionlpe starting on that seed. That is, the seed URIs are passed tolpe ascontextURIs in which the LPE should be evaluated. These evaluationsof lpe will resultin a set of URIs that are used to construct a dataset over whichP is finally evaluated.

Regarding the navigation of link graphGW, the most basic form of navigation is tofollow a single link graph edge that matches a link patternlp. When a navigation viaa link patternlp is triggered from a context URIu, we proceed as follows. We firstgo to the authoritative document foru, that isadoc(u), and try to find outgoing linkgraph edges that matchlp in the context ofu (as explained in Section3.1). Every oneof these matches defines a new context URIu′ from which the navigation can continue.More complex forms of navigation are obtained by combining link patterns via clas-sical regular expression operators such as concatenation/, disjunction|, and recursiveconcatenation(·)∗. The nesting operator[·] is used to test for existence of paths. Whena context URIu is passed to an expression[lpe], it checks whetherGW contains a pathfrom dctx = adoc(u) that matcheslpe. If such a path exists, the navigation can con-tinue from the same context URIu. The most involved form of navigation is by usingthe expression〈?v, q〉 with q an LDQL query. To evaluate this expression from contextURI u one first has to passu as a seed URI forq and recursively evaluateq from thatseed. This evaluation generates a set of solution mappings,and for every one of thesemappings its value on variable?v is used as the new context URI from which the navi-gation continues. Finally, note that our notion of LPEs doesnot provide an operator fornavigating paths in their inverse direction. The reason foromitting such an operator isthat traversing arbitrary data links backwards is impossible on the WWW.

To formally define the semantics of LDQL we need to introduce some terminology.We first define a functiondatasetW (·) that from a set of URIs constructs an RDFdataset with all the documents pointed to by those URIs inW. Formally, given a Web ofLinked DataW = 〈D, adoc〉 and a setU of URIs,datasetW (U) is an RDF dataset (asper [8,2]) that has the set of triplest ∈ data(adoc(u)) | u ∈ U ∩ dom(adoc) asdefault graph. Moreover, for every URIu ∈ U ∩ dom(adoc), datasetW (U) containsthe named graph〈u, data(adoc(u))〉.

Example 2.Consider the WebWex in Example1 and the set of URIsU = uA, uC.ThendatasetWex

(U) has〈uA, p1, uB〉, 〈uB, p2, uC〉, 〈uA, p2, uC〉 as default graph, andtwo named graphs,〈uA, 〈uA, p1, uB〉, 〈uB, p2, uC〉〉 and〈uC, 〈uA, p2, uC〉〉.

In the formalization of the semantics of LDQL, we use the standard join operator⋊⋉over sets of solution mappings [8,16]. We also make use of the semantics of SPARQLgraph patterns over datasets as defined in [2]. In particular, given an RDF datasetD, anRDF graphG in D, and a SPARQL graph patternP , we denote by[[P ]]DG the evaluationof P overG in D [2, Definition 13.3].

We are now ready to formally define the semantics of LDQL and LPEs. Given a Webof Linked DataW and a setS of URIs, we formalize the evaluation of LDQL queries

Page 7: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

overW from the seed URIsS, as a function[[·]]SW that given an LDQL query, producesa set of solution mappings. Similarly, the evaluation of LPEs overW from a contextURI u, is formalized as a function[[·]]uW that given an LPE, produces a set of URIs.

Definition 5. Given a finite setS⊆U , theS-based evaluationof LDQL queries over aWeb of Linked DataW =〈D, adoc〉, denoted by[[·]]SW , is defined recursively as follows:

[[〈lpe, P 〉]]SW = [[P ]]DG whereD = datasetW(⋃

u∈S [[lpe]]uW

)

with default graphG,

[[(SEED U q)]]SW = [[q]]UW ,

[[(SEED ?v q)]]SW =⋃

u∈U

(

[[q]]uW ⋊⋉ µu

)

whereµu = ?v 7→ u for all u ∈ U ,

[[(q1 UNION q2)]]SW = [[q1]]

SW ∪ [[q2]]

SW ,

[[(q1AND q2)]]SW = [[q1]]

SW ⋊⋉ [[q2]]

SW ,

[[πV q ]]SW = µ | there existsµ′∈ [[q]]SW such thatµ andµ′ are

compatible anddom(µ) = dom(µ′) ∩ V .

Now for the semantics of LPEs, given a context URIuctx ∈ dom(adoc), theuctx-basedevaluationof LPEs overW, denoted by[[·]]uctx

W , is defined recursively as follows:

[[ ε ]]uctx

W = uctx,

[[lp]]uctx

W = u ∈ U | there exist a link graph edge〈dsrc, (t, u), dtgt〉 ∈ GW , withdsrc = adoc(uctx), that matcheslp in the context ofuctx,

[[lpe1/lpe2]]uctx

W = u ∈ [[lpe2]]u′

W | u′ ∈ [[lpe1]]uctx

W ,

[[lpe1|lpe2]]uctx

W = [[lpe1]]uctx

W ∪ [[lpe2]]uctx

W ,

[[lpe∗]]uctx

W = uctx ∪ [[lpe]]uctx

W ∪ [[lpe/lpe]]uctx

W ∪ [[lpe/lpe/lpe]]uctx

W ∪ ... ,

[[ [lpe] ]]uctx

W = uctx | [[lpe]]uctx

W 6= ∅,

[[ 〈?v, q〉 ]]uctx

W = u ∈ U | there existsµ ∈ [[q]]uctxW such thatµ(?v) = u.

Moreover, ifuctx /∈ dom(adoc), then[[lpe]]uctx

W = ∅ for every LPE.

Example 3.Let lpeex be the LPE〈 , p1, 〉∗/[〈 , p2, 〉]. This LPE selects docu-ments that can be reached via arbitrarily long paths of data links with predicatep1and, additionally, have some outgoing data link with predicatep2. For our exampleWeb Wex and context URIuA, the LPE selects documentsdA = adocex(uA) anddC = adocex(uC). More precisely, we have[[lpeex]]

uA

Wex= uA, uC. Note that docu-

mentdB can also be reached via ap1–path, but it does not pass thep2–related test.

Example 4.Consider a set of URIsSex = uA and a basic LDQL query〈lpeex, Bex〉whose LPE islpeex as introduced in Example3 and whose SPARQL graph pattern is abasic graph pattern that contains two triple patterns,Bex = 〈?x, p1, ?y〉, 〈?x, p2, ?z〉.Given that we have[[lpeex]]

uA

Wex= uA, uC (cf. Example3), datasetWex

([[lpeex]]uA

Wex)

has the default graph〈uA, p1, uB〉, 〈uB, p2, uC〉, 〈uA, p2, uC〉 (cf. Example2). Then,according to the query semantics, the result of query〈lpeex, Bex〉 overWex using seedsSex consists of a single solution mapping, namelyµ = ?x 7→ uA, ?y 7→ uB, ?z 7→ uC.

Page 8: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Example 5.Consider an LDQL queryqex =(

SEED ?x⟨

ε, 〈?x, p1, ?w〉⟩)

whose sub-query is a basic LDQL query that has a single triple pattern asits SPARQL graphpattern. Additionally, letq′ex =

lpeex, 〈?x, p1, ?y〉, 〈?x, p2, ?z〉⟩

be the basic LDQLquery introduced in Example4, and letq′′ex be the conjunction of these two queries; i.e.,q′′ex = (qex AND q′ex). By Example4 we know that[[q′ex]]

Sex

Wex= µ with µ = ?x 7→ uA,

?y 7→ uB, ?z 7→ uC. Furthermore, based on the data given in Example1, it is easy tosee that[[qex]]

Sex

Wex= µ1, µ2 with µ1 = ?x 7→ uA, ?w 7→ uB andµ2 = ?x 7→ uB,

?w 7→ uC. For theSex-based evaluation ofq′′ex overWex, the result sets[[qex]]Sex

Wexand

[[q′ex]]Sex

Wexhave to be joined. Thus, we need to computeµ1, µ2 ⋊⋉ µ, which results in

a single mappingµ′ = µ1 ∪ µ = ?x 7→ uA, ?w 7→ uC, ?y 7→ uB, ?z 7→ uC.

3.3 Algebraic Properties of LDQL Queries

As a basis for the discussion in the next sections, we show some simple algebraic prop-erties. We say that LDQL queriesq andq′ are semantically equivalent, denoted byq≡q′,if [[q]]SW = [[q′]]SW holds for every Web of Linked DataW and every finite setS ⊆ U .

Lemma 1. The operatorsAND andUNION are associative and commutative.

Lemma 2. Letq1, q2, q3 be LDQL queries, the following semantic equivalences hold:

(q1 AND (q2 UNION q3)) ≡ ((q1 AND q2) UNION (q1 AND q3)) (1)

πV (q1 UNION q2) ≡ (πV q1 UNION πV q2) (2)

(SEED U (q1 UNION q2)) ≡ ((SEED U q1) UNION (SEED U q2)) (3)

(SEED ?v (q1 UNION q2)) ≡ ((SEED ?v q1) UNION (SEED ?v q2)) (4)

Lemma1 allows us to write sequences of eitherAND or UNION without parentheses.Our next result shows the power of the construction〈?v, q〉. In particular, it shows thesomehow surprising finding that link patternslp, concatenation/, disjunction|, and thetest[·], are justsyntactic sugaras they can be simulated by usingε, 〈?v, q〉 and(·)∗.

Proposition 1. For every LDQL queryq, there exists an LDQL queryq′ s.t.q ≡ q′ andevery LPE inq′ consists only of the symbolε, the construction〈?v, q〉, and operator(·)∗.

Proof (Sketch).The proof is based on a recursive translation of link path expressionsbeginning with link patterns. For instance, a link pattern of the form 〈+, p, 〉 is en-coded by〈?v, 〈ε, (GRAPH ?u (?u, p, ?v))〉〉, and we can similarly encode all types of linkpatterns. To encode/ we make use of〈?v, q〉 and the operatorAND insideq as follows.Consider an LPEr = r1/r2. It can be shown thatr is equivalent to〈?v, q〉 whereq is:

(

〈r1, (GRAPH ?x )〉 AND(

SEED ?x 〈r2, (GRAPH ?v )〉) )

.

Similarly, to encode| we make use ofUNION and to encode[·] we use projection.

Although not strictly necessary, we decided to keep link patterns and operators/, |,and[·] since they represent a natural and intuitive way of expressing navigation paths.

Page 9: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

4 Comparison with Previous Linked Data Query FormalismsIn this section, we compare LDQL with alternative formalisms to query Linked Data onthe WWW. There are some general query languages for the WWW (proposed beforethe advent of Linked Data) that are related to our proposal; in particular, WebSQL [15],which is similar in spirit to LDQL but different in the features that the languages posses.Two main novelties of LDQL compared with WebSQL are the possibility to dynami-cally select seed URIs at query time, and the traversal of links according to propertiesof the queried documents that can be defined in the same LDQL query. Neither of theseare expressible in WebSQL. While a complete formal comparison between LDQL andWebSQL is certainly very interesting, we leave it for futurework and, instead, focus onthree more recent proposals of query formalisms for the Web of Linked Data [7,11,14].We formally show that LDQL is strictly more expressive than every one of them.

4.1 Comparison with Property Paths under Context-Based Query SemanticsProperty paths (PPs for short) were introduced in SPARQL 1.1as a way of addingnavigational power to the language [8]. PPs are a form of regular expressions that areevaluated over a single (local) RDF graph; a PP expression isused to retrieve pairs〈a, b〉of nodes in the graph such that there is a path froma to b whose sequence of edge labelsbelongs (as a string) to the regular language defined by the expression. The syntax ofPP expressions is given by the following grammar3, wherep, u1, u2, ... , uk are URIs.

pe := p | !(u1|u2| · · · |uk) | pe/pe | pe|pe | pe∗

A PP-pattern is defined as a tuple of the form〈α, pe, β〉 wherepe is a PP expression,andα andβ are inU ∪ L ∪ V .

In [14] the authors adapted the semantics of PP-patterns so that they can be usedto query the Web of Linked Data. The proposed query semanticsis calledcontext-based semantics[14]. To define this semantics, the authors first introduce the notionof a context selectorfor a Web of Linked DataW. This context selector is a functionCW(·) that given a URIu ∈ dom(adoc) returns the RDF triples indata(adoc(u))that haveu in the subject position. Formally, for every URIu ∈ dom(adoc) we haveCW(u) = 〈s, p, o〉 ∈ data(adoc(u)) | s = u. To simplify the exposition, the authorsextended the definition ofCW(·) to also handle URIs not indom(adoc), and literalsand blank nodes. For any such RDF terma they defineCW(a) as the empty set.

The context-based semantics for PPs over the Web of Linked Data in [14] is a bagsemantics that follows closely the semantics for PPs definedin the normative semanticsof SPARQL 1.1 [8]. Hence, both semantics use a procedure, theArbitraryLengthPathprocedure [8], to define the semantics of the(·)∗ operator. It was shown in [1] that forsets semantics, the normative semantics of PPs can be definedby using standard tech-niques for regular expressions. To make the comparison withLDQL, in this paper weadapt the context-based semantics for PPs presented in [14] by following the techniquesin [1], and consider only sets of mappings. To this end, we define a function[[·]]ctxtW , thatgiven a PP-pattern, returns its evaluation under context-based semantics over the Webof Linked DataW. In the definition, for a solution mappingµ and an RDF termα, we

3 In [14] the reverse path construction ˆpe is also considered. We do not consider it here as theform of navigation of these reverse paths does not representa traversal of the link graph.

Page 10: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

use the notationµ[α] with the following meaning:µ[α] = µ(α) if α ∈ dom(µ), andµ[α] = α in the other case. Similarly,µ[〈s, p, o〉] = 〈µ[s], µ[p], µ[o]〉.

[[(α, p, β)]]ctxtW = µ | dom(µ) = α, β ∩ V andµ[〈α, p, β〉] ∈ CW (µ[α])

[[(α, !(u1| · · · |uk), β)]]ctxtW = µ | dom(µ) = α, β ∩ V and existsp s.t.

µ[〈α, p, β〉] ∈ CW (µ[α]) andp /∈ u1, ... , uk

[[(α, pe1/pe2, β)]]ctxtW = πα,β∩V

(

[[(α, pe1, ?v)]]ctxtW ⋊⋉ [[(?v, pe2, β)]]

ctxtW

)

[[(α, pe1|pe2, β)]]ctxtW = [[(α, pe1, β)]]

ctxtW ∪ [[(α, pe2, β)]]

ctxtW

[[(α, pe∗, β)]]ctxtW = µ | dom(µ) = α, β ∩ V, µ[α] = µ[β] andµ[α] ∈ terms(W )∪

[[(α, pe, β)]]ctxtW ∪ [[(α, pe/pe, β)]]ctxtW ∪ [[(α, pe/pe/pe, β)]]ctxtW ∪ · · ·

A PP-based SPARQL query[14] is an expression formed by combining PP-patternsusing the standard SPARQL operatorsAND, UNION, OPT, FILTER and so on, following thestandard semantics for these operators [2]. Our next results show that LDQL is strictlymore expressive than PP-based SPARQL queries under context-based semantics.

Theorem 1. There exists an LDQL query that cannot be expressed as a PP-basedSPARQL query under context-based semantics.

Proof (Sketch).One can show that LDQL queryq=(

SEED U⟨

〈+, p, 〉, (?x, ?x, ?x)⟩)

with U = u cannot be expressed by PPs under context-based semantics becausethis semantics is “blind” to triples that are not authoritative. For instance, in a WebW = 〈d, d′, adoc〉 with data(d) = 〈u, p, u′〉, data(d′) = 〈u′, p, u〉, 〈u, u, u〉,adoc(u) = d andadoc(u′) = d′, the evaluation ofq is the solution mapping?x 7→ u.Notice that the only authoritative triple ind′ is 〈u′, p, u〉 asd′ = adoc(u′) 6= adoc(u).Hence, one can prove that PP-based SPARQL queries under context-based semanticscannot access triple〈u, u, u〉 in d′, and thus, will never have?x 7→ u as solution.

Theorem 2. Letα, β ∈ U ∪ L ∪ V . Then, for every PP-pattern〈α, pe, β〉, there existsan LDQL queryq such that[[〈α, pe, β〉]]ctxtW = [[q]]∅W for every Web of Linked DataW.

Proof (Sketch).In the proof we provide a translation scheme from PPs to LDQL.Onemajor complication is that PPs can retrieve literals and, ingeneral, values that are not indom(adoc), which are difficult to handle by LPEs. For every PP-pattern〈?x, pe, ?y〉 weconstruct an LDQL queryQpe(?x, ?y). For example, for〈?x, pe1/pe2, ?y〉, our query isπ?x,?y

(

Qpe1(?x, ?z) ANDQpe

2(?z, ?y)

)

, and for〈?x, !(u1| · · · |uk), ?y〉 the translationis(

SEED ?x⟨

ε,(

(?x, ?p, ?y) FILTER (?p 6= u1 ∧ · · · ∧ ?p 6= uk))⟩)

. To handlepe∗ weneed to use the construction〈?v, q〉 of LPEs, plus(·)∗.

4.2 Comparison with NautiLOD

NautiLOD is a navigation language to traverse Linked Data onthe WWW and to per-form actions (such as sending emails) during the traversal [7]. We compare LDQL withNautiLOD without action rules. The syntax of NautiLOD expressions (without actions)is given by the following grammar (wherep ∈ U andP is a SPARQL graph pattern).

ne := p | pˆ | 〈 〉 | ne/ne | ne|ne | ne∗ | ne[(ASKP )]

Page 11: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

In terms of our data model4, the semantics of NautiLOD expressions over a Web ofLinked DataW =〈D, adoc〉 from URIu∈dom(adoc) is defined recursively as follows.

[[ p ]]uW = u′ | 〈u, p, u′〉 ∈ data(adoc(u))

[[ pˆ ]]uW = u′ | 〈u′, p, u〉 ∈ data(adoc(u))

[[ 〈 〉 ]]uW = u′ | 〈u, p, u′〉 ∈ data(adoc(u)) for somep ∈ U

[[ ne1/ne2 ]]uW = u′′ | u′′∈ [[ ne2 ]]

u′

W for someu′∈ [[ ne1 ]]uW with u′∈ dom(adoc)

[[ ne1| ne2 ]]uW = [[ ne1 ]]

uW ∪ [[ ne2 ]]

uW

[[ ne∗ ]]uW = u ∪ [[ ne]]uW ∪ [[ ne/ne]]uW ∪ [[ ne/ne/ne]]uW ∪ · · ·

[[ ne[(ASKP )] ]]uW = u′ | u′ ∈ [[ ne]]uW , u′ ∈ dom(adoc) and[[P ]]data(adoc(u′)) 6= ∅

We next show that for every NautiLOD expression there existsan equivalent LDQLquery. Notice that the evaluation of a NautiLOD expression is a set of URIs, whereas theevaluation of an LDQL query is a set of mappings. Thus, to formally state our result wecompare NautiLOD with LDQL queries that have a singlefree variable. Letq(?x) be anLDQL query with?x as free variable. We say thatq(?x) and a NautiLOD expressionneare equivalent if for every Web of Linked DataW = 〈D, adoc〉 and URIsu, u′ with

u ∈ dom(adoc) it holds thatu′∈ [[ne]]uW if and only if ?x 7→ u′ ∈ [[q(?x)]]uW .

Theorem 3. For every NautiLOD expression ne, there exists an LDQL queryq(?x),with ?x a free variable, that is equivalent to ne.

Proof (Sketch).The proof begins with a simple translation that replaces every p ∈ U ina NautiLOD expression by a link pattern〈+, p, 〉. For instance, the expressionp1/p∗2is translated into〈+, p1, 〉/〈+, p2, 〉∗. To translate〈 〉 and[(ASKP )] we use〈?v, q〉.The complete translation poses several other complications (as described in the ap-pendix). In particular, the last step of NautiLOD expressions must be translated byusing a SPARQL pattern and not an LPE. For this we use the following property. Givena regular expressionr that does not generate the empty word, one can always writer asr1/a1| · · · |rk/ak where theai’s are base symbols of the alphabet. Thus, we can trans-late r by using LPEs to translate theri’s as outlined above; next, translate theai’s byusing a method similar to the proof of Theorem2, and finally useUNION for |.

Along the same lines of Theorem1 one can prove the following result.

Theorem 4. There exists an LDQL queryq(?x) that cannot be expressed in NautiLOD.

4.3 Comparison with SPARQL under Reachability-Based QuerySemantics

In [11] the author introduces a family of reachability-based query semantics based onwhich SPARQL graph patterns can be used as a query language for Linked Data onthe WWW. Similar to how the scope of the SPARQL part of a basic LDQL query isrestricted to particular documents, reachability-based semantics restrict the scope of

4 In [7], all URIs have an assigned set of RDF triples (which may be empty). In our data modelone can have URIs not indom(adoc). Hence, to properly capture the semantics of NautiLODin terms of our data model we have to introduce conditions of the form “u′ ∈ dom(adoc).”

Page 12: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

SPARQL queries to documents that can be reached by traversing a well-defined set ofdata links. To specify what data links belong to such a set, the notion of areachabilitycriterion is used; that is, a functionc : T ×U ×P → true, false whereP denotes theset of all SPARQL graph patterns. Then, given such a reachability criterion c, a finitesetS of URIs and a SPARQL graph patternP , a documentd ∈ D is (c, S, P )-reachablein a Web of Linked DataW = 〈D, adoc〉 if any of the following two conditions holds:

1. There exists a URIu ∈ S such thatadoc(u) = d; or2. there exists a link graph edge〈dsrc, (t, u), dtgt〉 ∈ GW such that (i)dsrc is (c, S, P )-

reachable inW, (ii) c(t, u, P ) = true, and (iii) dtgt = d.

Notice how the second condition restricts the notion of reachability by ignoringdata links that do not satisfy the given reachability criterion c. Concrete examples ofreachability criteria arecAll, cNone, andcMatch [11], wherecAll selects all data links, andcNone ignores all data links; i.e.,cAll(t, u, P ) = true andcNone(t, u, P ) = false for alltuples〈t, u, P 〉 ∈ T × U × P . In contrast to such an all-or-nothing strategy, criterioncMatch returnstrue for every data link whose triple matches a triple pattern of the givengraph pattern; formally,cMatch(t, u, P ) = true if and only if there exists some solutionmappingµ such thatµ[tp] = t for an arbitrary triple patterntp that is contained inP .

Given the notion of a reachability criterion, it is possibleto define a family of (reach-ability-based) query semantics for SPARQL. To this end, letc be a reachability criterion,letS be a finite set of URIs, and letP be a SPARQL graph pattern. Then, for any Web ofLinked DataW = 〈D, adoc〉, theS-based evaluationof P overW underc-semantics,

denoted by[[P ]]R(c,S)W , is the set of solution mappings[[P ]]G whereG is the RDF graph

that consists of all triples from all documents that are (c, S, P )-reachable inW.While there exist an infinite number of possible reachability criteria, in this paper

we focus oncAll, cNone, andcMatch. The following two results show that LDQL is strictlymore expressive than SPARQL graph patterns under any of these three query semantics.

Theorem 5. Let c ∈ cAll, cNone, cMatch. For every SPARQL graph patternP there

exists an LDQL queryq such that[[P ]]R(c,S)W = [[q]]SW for every WebW andS ⊆ U .

Proof (Sketch).We only sketch the case ofcAll-semantics. In this case, one can provethat the LPElpecAll = 〈 , , 〉∗ simulates the reachability criterioncAll, and, thus,

[[P ]]R(cAll,S)W = [[〈lpecAll , P 〉]]SW . One can also find LPEs to simulatecNone andcMatch.

Theorem 6. Letc ∈ cAll, cNone, cMatch. There exists an LDQL queryq for which there

does not exist a SPARQL patternP such that[[P ]]R(c,S)W = [[q]]SW for everyW andS ⊆ U .

5 Web-Safeness of LDQL QueriesIn this section we study the “Web-safeness” of LDQL queries,where, informally, wecall a queryWeb-safeif a complete execution of the query over the WWW is possiblein practice (which is not the case for all LDQL queries as we shall see). To provide amore formal definition of this notion of Web-safeness we makethe following obser-vations. While the mathematical structures introduced by our data model capture thenotion of Linked Data on the WWW formally (and, thus, allow usto provide a formal

Page 13: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

semantics for LDQL queries), in practice, these structuresare not available completelyfor the WWW. For instance, given that an infinite number of strings can be used asHTTP URIs [6], we cannot assume complete information about which URIs are in thedomain of the partial functionadoc (i.e., can be looked up to retrieve some document)and which are not; in fact, disclosing this information would require a process that sys-tematically tries to look up every possible HTTP URI and, thus, would never terminate.Therefore, it is also impossible to guarantee the discoveryof every document in thesetD (without looking up an infinite number of URIs). Consequently, any query whoseexecution requires a complete enumeration of this set is notfeasible in practice. Basedon these observations, we defineWeb-safenessof LDQL queries as follows.

Definition 6. An LDQL queryq is Web-safeif there exists an algorithm that, for anyfinite Web of Linked DataW = 〈D, adoc〉 and any finite setS of URIs, computes[[q]]SWby looking up only a finite number of URIs without assuming an apriori availability ofany information about the setsD anddom(adoc).

Example 6.Recall our example queriesqex, q′ex, andq′′ex (cf. Example5). For queryqex =

(

SEED ?x⟨

ε, 〈?x, p1, ?z〉⟩)

, any URIu ∈ U may be used to obtain a nonemptysubset of the query result as long as a lookup ofu retrieves a document whose data in-cludes RDF triples that match〈u, p1, ?z〉. Therefore, without access toD ordom(adoc)of the queried WebW = 〈D, adoc〉, the completeness of the computed query resultcan be guaranteed only by checking each of the infinitely manypossible HTTP URIs.Hence, queryqex is not Web-safe. In contrast, although it containsqex as a subquery,queryq′′ex = (qex AND q′ex) is Web-safe, and so isq′ex = 〈lpeex, Bex〉. GivenuA as seedURI, a possible execution algorithm forq′ex may first compute[[lpeex]]

uA

W by traversingthe queried WebW based onlpeex. Thereafter, the algorithm retrieves documents bylooking up all URIsu ∈ [[lpeex]]

uA

W (or simply keeps these documents after the traver-sal); and, finally, the algorithm evaluates patternBex over the union of the RDF data inthe retrieved documents. IfW is finite (i.e., contains a finite number of documents), thetraversal process requires a finite number of URI lookups only, and so does the retrievalof documents in the second step; the final step does not look upany URI. To see thatq′′ex is also Web-safe we note that after executing subqueryq′ex (e.g., by using the algo-rithm as outlined before), the execution of the other (non-Web-safe) subqueryqex canbe reduced to a finite number of URI lookups, namely the URIs bound to variable?xin solution mappings obtained for subqueryq′ex. Although any other URI may also beused to obtain solution mappings forqex, such solution mappings cannot be joined withany of the solution mappings forq′ex and, thus, are irrelevant for the result ofq′′ex.

The example illustrates that there exists an LDQL query thatis not Web-safe. Infact, it is not difficult to see that the argument for the non-Web-safeness of queryqex asmade in the example can be applied to any LDQL query of the form(SEED ?x q) wheresubqueryq is a (satisfiable) basic LDQL query; that is, none of these queries is Web-safe. However, the example also shows that more complex queries that contain suchnon-Web-safe subqueries may still be Web-safe. Therefore,we now show properties toidentify LDQL queries that are Web-safe even if some of theirsubqueries are not. Webegin with queries of the forms〈lpe, P 〉, πV q, (SEED U q), and(q1 UNION ... UNION qn).

Page 14: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Proposition 2. An LDQL queryq is Web-safe if any of the following properties holds:

1. Queryq is of the form〈lpe, P 〉 andlpe is Web-safe, where we call an LPE Web-safeif either (i) it is of the form〈?v, q′〉 and LDQL queryq′ is Web-safe, or (ii) it is ofany form other than〈?v, q′〉 and all its subexpressions (if any) are Web-safe LPEs;

2. Queryq is of the formπV q′ or (SEED U q′), and subqueryq′ is Web-safe; or

3. Queryq is of the form(q1 UNION ... UNION qn) and eachqi (1≤ i≤n) is Web-safe.

It remains to discuss LDQL queries of the form(q1 AND ... AND qm). Our discussionof queryq′′ex in Example6 suggests that such queries can be shown to be Web-safe if allnon-Web-safe subqueries are of the form(SEED ?v q) and it is possible to execute thesesubqueries by using variable bindings obtained from other subqueries. A necessary con-dition for this execution strategy is that the variable in question (i.e.,?v) is guaranteedto be bound in every possible solution mapping obtained fromthe other subqueries.

To allow for an automated verification of this condition we adopt Buil-Aranda etal.’s notion of strongly bound variables [4]. To this end, for any SPARQL graph pat-ternP , let sbvars(P ) denote the set of strongly bound variables inP as defined byBuil-Aranda et al. [4]. For the sake of space, we do not repeat the definition here. How-ever, we emphasize thatsbvars(P ) can be constructed recursively, and each variable insbvars(P ) is guaranteed to be bound in every possible solution forP [4, Proposition 1].To carry over these properties to LDQL queries, we use the notion of strongly boundvariables in SPARQL patterns to define the following notion of strongly bound variablesin LDQL queries; thereafter, in Lemma3, we show the desired boundedness guarantee.

Definition 7. The set ofstrongly bound variables in an LDQL queryq, denoted bysbvars(q), is defined recursively as follows:

1. If q is of the form〈lpe , P 〉, thensbvars(q) = sbvars(P ).2. If q is of the form(q1 AND q2), thensbvars(q) = sbvars(q1) ∪ sbvars(q2).3. If q is of the form(q1 UNION q2), thensbvars(q) = sbvars(q1) ∩ sbvars(q2).4. If q is of the formπV q

′, thensbvars(q) = sbvars(q′) ∩ V .5. If q is of the form(SEED U q′), thensbvars(q) = sbvars(q′).6. If q is of the form(SEED ?v q′), thensbvars(q) = sbvars(q′) ∪ ?v.

Lemma 3. Letq be an LDQL query. For every finite setS of URIs, every Web of LinkedDataW, and every solution mappingµ ∈ [[q]]SW , it holds thatsbvars(q) ⊆ dom(µ).

We are now ready to show the following result.

Theorem 7. An LDQL query of the form(q1 AND q2 AND ... AND qm) is Web-safe if thereexists a total order≺ over the set of subqueriesq1, q2, ... , qm such that for eachsubqueryqi (1 ≤ i ≤ m), it holds that either (i)qi is Web-safe or (ii)qi is of theform (SEED ?v q) whereq is Web-safe and?v ∈

qj≺qisbvars(qj).

Proof (Sketch).We prove Theorem7 based on an iterative algorithm that generalizesthe execution of queryq′′ex as outlined in Example6. That is, the algorithm executes thesubqueriesq1 ... qm sequentially in the order≺ such that each iteration executes one ofthe subqueries by using the solution mappings computed during the previous iteration.

Page 15: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

With the results in this section we have all ingredients to devise a procedure toshow Web-safeness for a large number of queries (including queries that are arbitrar-ily nested). However, as a potential limitation of such a procedure we note that The-orem7 can be applied only in cases in which all non-Web-safe subqueries are of theform (SEED ?v q). For instance, the theorem cannot be applied to show that an LDQLquery of the form

(

q1 AND (q2 UNION (SEED ?x q3)))

is Web-safe if?x ∈ sbvars(q1) andq1, q2 andq3 are Web-safe. On the other hand, for the semantically equivalent query(

(q1 AND q2) UNION (q1 AND (SEED ?x q3)))

we can show Web-safeness based on Theo-rem7 (and Proposition2). Fortunately, we may leverage the following fact to improvethe effectiveness of applying Theorem7 in the procedure that we aim to devise.

Fact 1. If an LDQL queryq is Web-safe, then so is any LDQL queryq′ with q′ ≡ q.

As a consequence of Fact1, we may use the equivalences in Lemma2 to rewritea given query into an equivalent query that is more suitable for testing Web-safenessbased on our results. To this end, we introduce specific normal forms for LDQL queries:

Definition 8. An LDQL query is in UNION-free normal form if it is of the form(q1 AND ... AND qm) with m ≥ 1 and eachqi (1 ≤ i ≤ m) is either (i) a basic LDQLquery or (ii) of the formπV q, (SEED U q) or (SEED ?v q) such that subqueryq is inUNION-free normal form. An LDQL query is inUNION normal form if it is of the form(q1 UNION ... UNION qn) with n≥1 and eachqi (1≤ i≤n) is in UNION-free normal form.

The following result is an immediate consequence of Lemma2.

Corollary 1. Every LDQL query is equivalent to an LDQL query inUNION normal form.

In conjunction with Fact1, Corollary1 allows us to focus on LDQL queries inUNION

normal form without losing generality. We are now ready to specify our procedure thatapplies the results in this paper to test a given LDQL queryq for Web-safeness: First,by using the equivalences in Lemma2, the query has to be rewritten into a semanti-cally equivalent LDQL queryqnf=(q1 UNION ... UNION qn) that is inUNION normal form.Next, the following test has to be repeated for every subquery qi (1 ≤ i ≤ n); recall thateach of these subqueries is inUNION-free normal form; i.e.,qi = (qi1 AND ... AND qimi

).The test is to find an order for their subqueriesqi1, ... , q

imi

that satisfies the conditionsin Theorem7. Every top-level subqueryqi (1 ≤ i ≤ n) for which such an order exists,is Web-safe (cf. Theorem7). If all top-level subqueries are identified to be Web-safe bythis test, thenqnf is Web-safe (cf. Proposition2), and so isq (cf. Fact1).

The given conditions are sufficient to show Web-safeness of LDQL. It remains openwhether there exists a (decidable) sufficientandnecessary condition for Web-safeness.

6 Concluding Remarks and Future Work

LDQL, the query language that we introduce in this paper, allows users to expressqueries over Linked Data on the WWW. We defined LDQL such that navigational fea-tures for selecting the query-relevant documents on the Webare separate from patternsthat are meant to be evaluated over the data in the selected documents. This separationdistinguishes LDQL from other approaches to express queries over Linked Data.

Page 16: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

We focused on expressiveness, by comparing LDQL with previous formalisms, andon the notion of Web-safeness. Several topics remain open for future work. One ofthem is the complexity of query evaluation. A classical complexity analysis is easy toperform if we assume that all the data and documents are available as if they were in acentralized repository, and that they can be processed via aRAM machine model. Weconjecture that under this model, the data complexity of evaluating LDQL will be poly-nomial. Nevertheless, a more interesting complexity analysis should consider a modelthat captures the inherent way of accessing the Web of LinkedData via HTTP requests,the overhead of data communication and transfer, the distribution of data and docu-ments, etc. A more practical direction for future research on LDQL is the developmentof approaches to actually implement LDQL queries efficiently.

AcknowledgementsPerez is supported by the Millennium Nucleus Center for Seman-tic Web Research, Grant NC120004, and Fondecyt grant 1140790.

References1. Arenas, M., Conca, S., Perez, J.: Counting beyond a yottabyte, or how SPARQL 1.1 property

paths will prevent adoption of the standard. In: WWW 2012. pp. 629–638 (2012)2. Arenas, M., Gutierrez, C., Perez, J.: On the Semantics ofSPARQL. In: Semantic Web Infor-

mation Management - A Model-Based Perspective, chap. 13, pp. 281–307. Springer (2009)3. Berners-Lee, T.: Linked Data. At http://www.w3.org/DesignIssues/LinkedData.html (2006)4. Buil-Aranda, C., Arenas, M., Corcho, O.: Semantics and Optimization of the SPARQL 1.1

Federation Extension. In: Proc. 8th Extended Semantic Web Conf. (2011)5. Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 Concepts and Abstract Syntax. W3C Rec-

ommendation (Feb 2014)6. Fielding, R., Gettys, J., Mogul, J.C., Frystyk, H., Masinter, L., Leach, P.J., Berners-Lee, T.:

Hypertext Transfer Protocol – HTTP/1.1 (Jun 1999)7. Fionda, V., Pirro, G., Gutierrez, C.: NautiLOD: A FormalLanguage for the Web of Data

Graph. ACM Transactions on the Web 9(1), 5:1–5:43 (2015)8. Harris, S., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 Query Language. W3C Recom-

mendation (Mar 2013)9. Harth, A., Speiser, S.: On Completeness Classes for QueryEvaluation on Linked Data. In:

Proc. 26th AAAI Conf. (2012)10. Hartig, O.: LDQL: A Language for Linked Data Queries. In AMW 201511. Hartig, O.: SPARQL for a Web of Linked Data: Semantics andComputability. In: Proc. 9th

Extended Semantic Web Conf. (2012)12. Hartig, O.: An Overview on Execution Strategies for Linked Data Queries. Datenbank-

Spektrum 13(2) (2013)13. Hartig, O., Perez, J.: LDQL: A Query Language for the Webof Linked Data. In: Proc. 14th

Int. Semantic Web Conf. (2015)14. Hartig, O., Pirro, G.: A Context-Based Semantics for SPARQL Property Paths over the Web.

In: Proc. 12th Extended Semantic Web Conf. (2015)15. Mendelzon, A. O., Mihaila, G. A., Milo T.: Querying the World Wide Web. In: PDIS (1996)16. Perez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. ACM Transac-

tions on Database Systems 34 (2009)17. Perez, J., Arenas, M., Gutierrez, C.: nSPARQL: A Navigational Language for RDF. J. Web

Sem. 8(4), 255–270 (2010)18. Umbrich, J., Hogan, A., Polleres, A., Decker, S.: Link Traversal Querying for a Diverse Web

of Data. Semantic Web Journal (2014)

Page 17: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

A Proofs

A.1 Proof of Lemma 1

We formalize the claims in Lemma1 as follows: Letq1, q2, andq3 be LDQL queries,the following semantic equivalences hold:

(q1 AND q2) ≡ (q2 AND q1) (5)

(q1 UNION q2) ≡ (q2 UNION q1) (6)

(q1 AND (q2 AND q3)) ≡ ((q1 AND q2) AND q3) (7)

(q1 UNION (q2 UNION q3)) ≡ ((q1 UNION q2) UNION q3) (8)

Since the definition of LDQL operatorsAND andUNION is equivalent to their SPARQLcounterparts, these semantic equivalences follow from corresponding equivalences forSPARQL graph patterns as shown by Perez et al. [16, Lemma 2.5].

A.2 Proof of Lemma 2

The equivalences follow directly from the definition of every operator.

A.3 Proof of Proposition1

The proof is based on a recursive translation of link path expressions beginning with linkpatterns. Let〈y1, y2, y3〉 be a link pattern. We construct an LPEtransL(〈y1, y2, y3〉) asfollows. Assume thaty1 = , then we construct the LDQL query

q1 =⟨

ε, (GRAPH ?u (?out, Y2, Y3))⟩

where (i) if y2 = + thenY2 =?u, (ii) if y2 ∈ U thenY2 = y2 and (iii) if y2 =thenY2 =?y2. And similarly, if y3 = + thenY3 =?u, (ii) if y3 ∈ U thenY3 = y3and (iii) if y3 = thenY3 =?y3. If ?y2 = then que can construct queryq2 =〈ε, (GRAPH ?u (Y1, ?out, Y3))〉, and if?y3 = queryq3 = 〈ε, (GRAPH ?u (Y1, Y2, ?out))〉,following a similar process as forq1. Now consider the queryq which is theUNION ofthe above queries for everyyi = . Then LPEtransL(〈y1, y2, y3〉) is constructed as

transL(〈y1, y2, y3〉) = 〈?out, q〉.

It is not difficult to prove that[[transL(〈y1, y2, y3〉)]]uW = [[〈y1, y2, y3〉]]uW .We now define the translation in general:

– For the case of LPEr = r1/r2, we have thattransL(r) = 〈?v, q〉 whereq is:(

〈transL(r1), (GRAPH ?x )〉 AND(

SEED ?x 〈transL(r2), (GRAPH ?v )〉) )

.

– For the case of LPEr = r1|r2, we have thattransL(r) = 〈?v, q〉 whereq is:(

〈transL(r1), (GRAPH ?v )〉 UNION 〈transL(r2), (GRAPH ?v )〉)

.

Page 18: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

– For the case of LPEr = [r1], we have thattransL(r) = 〈?v, q〉 whereq is:(

〈ε, (GRAPH ?v )〉 AND π?v

(

SEED ?v 〈transL(r1), (GRAPH ?x )〉) )

.

The general proof proceed by induction. We next prove that[[transL(r1|r2)]]uW =

[[r1|r2]]uW . The proof for the other cases are similar. Thus assume thatu′ ∈ [[r1|r2]]uW ,then we know thatu′ ∈ [[r1]]

uW ∪ [[r2]]

uW . If u′ ∈ [[r1]]

uW then by induction hypothesis

we know thatu′ ∈ [[transL(r1)]]uW . Now notice that

[[〈transL(r1), (GRAPH ?v )〉]]uW = [[(GRAPH ?v )]]D

WhereD = datasetW ([[transL(r1)]]uW ). Thus given thatu′ ∈ [[transL(r1)]]

uW we

know thatD has a dataset〈u′, data(adoc(u′))〉, which implies that?v → u′ is a so-

lution for [[(GRAPH ?v )]]D, and thus?v → u′ ∈ [[〈transL(r1), (GRAPH ?v )〉]]uW .

From this it is straightforward to conclude thatu′ ∈ [[transL(r1|r2)]]uW . The other di-

rection is similar.

A.4 Proof of Theorem1

Consider the LDQLQ query given by(

SEED u⟨

〈+, p, 〉, (?x, ?x, ?x)⟩)

with u, p ∈ U . Now assume that there exists a property path patternP and a set of URIsS such that

[[P ]]ctxtW = [[Q]]SW

for every Web of Linked DataW . Let u′ ∈ U . Consider nowW1 having only twodocumentsd1 = (u, p, u′) andd2 = (a, a, a) and such thatadoc(u) = d1 andadoc(u′) = d2. Moreover, considerW2 having also two documentsd1 = (u, p, u′)andd3 = (b, b, b) such thatadoc(u) = d1 andadoc(u′) = d3. First notice that foreveryS we have that

[[Q]]SW1= ?x → a 6= [[Q]]SW2

= ?x → b

Notice thatCW1(u) = CW2 (u) = (u, p, u′) andCW1(u′) = CW2(u′) = ∅. Ingeneral, we have that for every termv 6= u it holds thatCW1(v) = CW2 (v) = ∅. Thisessentially shows that the context selectorsCW1 andCW2 are equivalent. Given thatthe semantics of property paths is based on context selectors it is easy to prove that forevery PP-based SPARQL queryR we have that[[R]]ctxtW1

= [[R]]ctxtW2. This can be done by

induction in the construction of PP-based SPARQL queries. For example, the evaluationof a base PP-pattern of the form(v, p, β), with v ∈ U andβ ∈ U ∪ V overW1 is givenby

[[(v, p, β)]]ctxtW1= µ | dom(µ) = β ∩ V andµ[〈v, p, β〉] ∈ CW1(v)

which is equal to[[(v, p, β)]]ctxtW2sinceCW1(v) = CW2(v). All the other cases for the

construction of property paths are equivalent. Moreover, since for the case of propertypath patterns the evaluation is the same overW1 and overW2, we have that for a general

Page 19: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

PP-based query using operatorAND , UNION , OPT and so on, the evaluation is also thesame. Thus we have that

[[P ]]ctxtW1= [[P ]]ctxtW2

but also that[[Q]]SW1

6= [[Q]]SW2

which contradicts the fact that[[P ]]ctxtW = [[Q]]SW for every Web of Linked DataW .

A.5 Proof of Theorem2

We associate to every property-path expressionr, an LDQL queryQr(?x, ?y) with ?xand?y as free variables. The definition ofQr(?x, ?y) is by induction in the constructionof property-path expressions. In the construction, all thevariables mentioned, besides?x and?y, are considered as fresh variables.

– If r ∈ U thenQr(?x, ?y) = (SEED ?x 〈ε, (?x, r, ?y)〉).– If r = !(u1 | · · · | uk) with ui ∈ U thenQr(?x, ?y) is defined as

(

SEED ?x⟨

ε,(

(?x, ?p, ?y) FILTER (?p 6= u1 ∧ · · · ∧?p 6= uk))⟩

)

.

– If r = r1/r2 thenQr(?x, ?y) is defined as

π?x,?y

(

Qr1(?x, ?z) AND Qr2(?z, ?y))

.

– If r = r1|r2 thenQr(?x, ?y) is defined as

(

Qr1(?x, ?y) UNIONQr2(?x, ?y))

.

– If r = r∗1 thenQr(?x, ?y) is defined as follows. First consider the LDQL query

Qε(?x, ?y) = π?x,?y(SEED ?f 〈ε, P 〉)

whereP is the following pattern

P =(

(?x, ?p, ?o) AND (?y, ?p, ?o) FILTER (?x =?y))

UNION(

(?s, ?x, ?o) AND (?s, ?y, ?o) FILTER (?x =?y))

UNION(

(?s, ?p, ?x) AND (?s, ?p, ?y) FILTER (?x =?y))

Now consider the LDQL queryQs(?v) defined as

Qs(?v) =(

〈ε, (GRAPH ?u )〉 ANDQr1(?u, ?v))

.

Then, queryQr(?x, ?y) is defined by

Qε(?x, ?y) UNION(

(SEED ?x 〈(?v,Qs(?v))∗, (GRAPH ?z )〉) AND Qr1(?z, ?y)

)

Page 20: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

We prove now that for every property path pattern(?x, r, ?y) we have that

[[(?x, r, ?y)]]ctxtW = [[Qr(?x, ?y)]]∅W .

The proof is by induction in the construction ofQr(?x, ?y). We proceed by cases.

– Assume thatr ∈ U . Thenµ ∈ [[Qr(?x, ?y)]]∅W if and only if

µ ∈ [[(SEED ?x 〈ε, (?x, r, ?y)〉)]]∅W .

Notice that this occurs if and only if there exists a mappingµ′ and a URIu suchthat µ′ ∈ [[〈ε, (?x, r, ?y)〉]]

uW , µ′ is compatible with the mapping?x → u,

andµ = µ′ ∪ ?x → u. Now, given that[[ε]]uW = u, we have thatµ′ ∈

[[〈ε, (?x, r, ?y)〉]]uW if and only if µ′ ∈ [[(?x, r, ?y)]]D with D the data set with

data(adoc(u)) as default graph. With all this we have thatµ ∈ [[Qr(?x, ?y)]]∅W if

and only if dom(µ) = ?x, ?y, µ(?x) ∈ dom(adoc), and(µ(?x), r, µ(?y)) ∈data(adoc(µ(?x))), which is exactly the property

µ((?x, r, ?y)) ∈ CW (µ(?x)).

This last property holds if and only ifµ ∈ [[(?x, r, ?y)]]ctxtW .– For the case in whichr = !(u1 | · · · | uk) with ui ∈ U , the proof is similar. We

have that

µ ∈ [[

(

SEED ?x⟨

ε,(

(?x, ?p, ?y) FILTER (?p 6= u1 ∧ · · · ∧?p 6= uk))⟩

)

]]∅W

if and only if µ is in the evaluation of(

(?x, ?p, ?y) FILTER (?p 6= u1 ∧ · · · ∧?p 6= uk))

over the graphdata(µ(?x)). This happens if and only if

µ((?x, p, ?y)) ∈ CW (µ(?x))for p /∈ u1, ... , uk,

which is exactly the property

µ ∈ [[(?x, !(u1 | · · · | uk), ?y)]]ctxtW .

– For the casesr = r1/r2, r = r1|r2, the semantics of the corresponding LDQLquery exactly matches the semantics of the property path expression. Just noticethat the semantics ofAND is that of the join, and the semantics ofUNION is that ofthe set union.

– For the case ofr = r∗1 we have thatµ ∈ [[(?x, r∗1 , ?y)]]ctxtW if and only if dom(µ) =

?x, ?y and (i) µ(?x) = µ(?y) and µ(?x), µ(?y) ∈ terms(W ), or (ii) µ ∈[[(?x, rk1 , ?y)]]

ctxtW for somek > 0. For the case(i) it is easy to see thatµ ∈

[[Qε(?x, ?y)]]∅W . Just notice that ifµ(?x) is in terms(W ) then there exists a URIu ∈

dom(adoc) and a triplet in data(adoc(u)) such thatµ(?x) appears int. If µ(?x)appears in the subject position, then we know thatµ is compatible with a mapping in

Page 21: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

[[(

(?x, ?p, ?o) AND (?y, ?p, ?o) FILTER (?x =?y))

]]D whereD is a dataset with a de-fault graph in whicht appears. Finally, given thatQε(?x, ?y) = π?x,?y(SEED ?f 〈ε, P 〉)

and we know thatu is a possible value for variable?f , we obtain thatµ ∈ [[Qε(?x, ?y)]]∅W .

If µ(?x) appears in the predicate or object position, the proof is similar. For the case(ii) we will show that

µ ∈ [[Qε(?x, ?y)]]∅W∪

k−1⋃

i=0

[[π?x,?y

(

(SEED ?x 〈(?v,Qs(?v))i, (GRAPH ?z )〉) AND Qr1(?z, ?y)

)

]]∅W (9)

We will use an inductive argument. Assumek = 1, thenµ ∈ [[(?x, r1, ?y)]]ctxtW . By

the induction hypothesis on the construction of property paths, we have thatµ ∈[[Qr1(?x, ?y)]]

∅W . Now, if µ(?x) /∈ dom(adoc) then we have thatµ(?x) = µ(?y)

and thusµ ∈ [[Qε(?x, ?y)]]∅W . If µ(?x) ∈ dom(adoc), then it is easy to see that

µ ∈ [[π?x,?y

(

(SEED ?x 〈ε, (GRAPH ?z )〉) ANDQr1(?z, ?y))

]]∅W

This is because?x → µ(?x), ?z → µ(?x) ∈ [[(SEED ?x 〈ε, (GRAPH ?z )〉)]]∅Wwhich is compatible withµ. Now assume thatµ ∈ [[(?x, rk+1

1 , ?y)]]ctxtW , thus

µ ∈ π?x,?y([[(?x, rk1 , ?z)]]

ctxtW ⋊⋉ [[(?z, r1, ?y)]]

ctxtW ).

By induction hypothesis we have that

µ ∈ π?x,?y

([

[[Qε(?x, ?z)]]∅W∪

k−1⋃

i=0

[[π?x,?z

(

(SEED ?x 〈(?v,Qs(?v))i, (GRAPH ?u )〉) AND Qr1(?u, ?z)

)

]]∅W

]

⋊⋉ [[Qr1(?z, ?y)]]∅W

)

.

Then we know that there existsj such that0 ≤ j ≤ k − 1 such that

µ ∈ π?x,?y

([

[[Qε(?x, ?z)]]∅W∪

[[π?x,?z

(

(SEED ?x 〈(?v,Qs(?v))j , (GRAPH ?u )〉) AND Qr1(?u, ?z)

)

]]∅W

]

⋊⋉ [[Qr1(?z, ?y)]]∅W

)

.

If µ ∈ π?x,?y

(

[[Qε(?x, ?z)]]∅W ⋊⋉ [[Qr1(?z, ?y)]]

∅W

)

= [[Qr1(?x, ?y)]]∅W we can

apply the same argument as in the base case. Now assume that

Page 22: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

µ ∈ π?x,?y

(

[[π?x,?z

(

(SEED ?x 〈(?v,Qs(?v))j , (GRAPH ?u )〉) AND Qr1(?u, ?z)

)

]]∅W

⋊⋉ [[Qr1(?z, ?y)]]∅W

)

.

Then we know that there exists a mappingµ′ andµ′′ such that

µ′ ∈ [[π?x,?z

(

(SEED ?x 〈(?v,Qs(?v))j , (GRAPH ?u )〉) ANDQr1(?u, ?z)

)

]]∅W

andµ′′ ∈ [[Qr1(?z, ?y)]]

∅W

µ equalsµ′ ∪ µ′′ restricted to variables?x, ?y. Notice thatµ′′ is compatible withµ′ thus, we have thatµ′(?z) = µ′′(?z). Now, if µ′′(?z) /∈ dom(adoc), sinceµ′′ ∈ [[Qr1(?z, ?y)]]

∅W , then necessarilyµ′′(?z) = µ′′(?y), and given thatµ′ is

compatible withµ′′ we obtain thatµ′(?z) = µ′′(?y). All this implies that

µ ∈ π?x,?y

(

[[π?x,?z

(

(SEED ?x 〈(?v,Qs(?v))j , (GRAPH ?u )〉) AND Qr1(?u, ?y)

)

]]∅W

)

.

and thus (9) holds. Assume now thatµ′′(?z) ∈ dom(adoc). We will prove that

µ′ ∈ [[(SEED ?x 〈(?v,Qs(?v))j+1, (GRAPH ?z )〉)]]∅W .

We know that

µ′ ∈ [[π?x,?z

(

(SEED ?x 〈(?v,Qs(?v))j , (GRAPH ?u )〉) ANDQr1(?u, ?z)

)

]]∅W

Thusµ′ equalsµ1 ∪ µ2 (restricted to variables?x, ?z) where

µ1 ∈ [[(SEED ?x 〈(?v,Qs(?v))j , (GRAPH ?u )〉)]]∅W

andµ2 ∈ [[Qr1(?u, ?z)]]

∅W

Thus, regardingµ1 we know that there exists a sequence of URIs,u1, u2, ... uj

such thatµ1(?x) = u1, µ1(?u) = uj andui+1 ∈ [[(?v,Qs(?v))]]ui

W . Now, recallthat the definition ofQs(?v) is

Qs(?v) =(

〈ε, (GRAPH ?f )〉 AND Qr1(?f, ?v))

.

Then essentially what we have is that

?f → ui, ?v → ui+1 ∈ [[Qr1(?f, ?v)]]∅W .

Page 23: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Moreover, sinceµ1 andµ2 are compatible, we know thatµ1(?u) = µ2(?u) = uj

and sinceµ2 ∈ [[Qr1(?u, ?z)]]∅W we know that

?f → uj, ?v → µ2(?z) ∈ [[Qr1(?f, ?v)]]∅W .

Finally, given that we are assuming thatµ′′(?z) = µ2(?z) is in dom(adoc) we havethat

?x → µ1(?x), ?z → µ2(?z) ∈ [[(SEED ?x 〈(?v,Qs(?v))j+1, (GRAPH ?z )〉)]]∅W

which is what we wanted to prove. Thus we have that

µ′ ∈ [[(SEED ?x 〈(?v,Qs(?v))j+1, (GRAPH ?z )〉)]]∅W .

and also thatµ′′ ∈ [[Qr1(?z, ?y)]]

∅W

and given thatµ equalsµ′ ∪ µ′′ restricted to variables?x, ?y, we have that

µ ∈ [[π?x,?y

(

(SEED ?x 〈(?v,Qs(?v))j+1, (GRAPH ?z )〉) AND Qr1(?z, ?y)

)

]]∅W

and sincej + 1 ≤ k we obtain

µ ∈ [[Qε(?x, ?y)]]∅W∪

k⋃

i=0

[[π?x,?y

(

(SEED ?x 〈(?v,Qs(?v))i, (GRAPH ?z )〉) AND Qr1(?z, ?y)

)

]]∅W

If one assumes thatµ ∈ [[Qr(?x, ?y)]]ctxtW then by an argument on exactly the same

lines of the argument above, one can show thatµ ∈ [[(?x, r∗1 , ?y)]]ctxtW .

We have shown how to construct an equivalent LDQL query for every property pathpattern with two variables. If the triple does not have two variables, we need a slightlydifferent construction, in particular for the case in which(·)∗ is used. We now showthe details of the construction but leave the complete proofas an exercise (it can becompleted using the arguments of the previous part of this proof).

Consider a propery path pattern(α, r, β) whereα is a URI or variable, andβ is aURI, variable or literal. Then for the casesr = p ∈ U , r =!(u1| · · · |uk), r = r1/r2,r = r1|r2, we construct a query asQr(α, β) whereQr(α, β) is queryQr(?x, ?y)where all occurrences of?x has been replaced byα and all occurrences of?y has beenreplaced byβ. For the case ofr = r∗1 we need to do a slightly different construction.For a pattern(u, r, ?y) we construct a queryPr(?y) as

〈ε, BIND(u AS ?y)〉 UNION(

(SEED u 〈(?v,Qs(?v))∗, (GRAPH ?z )〉) ANDQr1(?z, ?y)

)

For a pattern(?x, r, v) we construct a querySvr (?x) as

〈ε, BIND(u AS ?x)〉 UNION(

(SEED ?x 〈(?v,Qs(?v))∗, (GRAPH ?z )〉) ANDT (?z)

)

Page 24: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

whereT (?z) is eitherQr1(?z, v) orSvr1(?z) depending on the form ofr1. Finally, for a

pattern(u, r, v) we construct a queryUr as

〈ε, (BIND(u AS ?x) AND BIND(v AS ?y)) FILTER (?x =?y)〉 UNION(

(SEED u 〈(?v,Qs(?v))∗, (GRAPH ?z )〉) ANDT (?z)

)

whereT (?z) is eitherQr1(?z, v) or Svr1(?z) depending on the form ofr1.

Finally consider a property path pattern(ℓ, r, β), whereℓ is a literal. Then for thecasesr = p ∈ U , r =!(u1| · · · |uk) we should translate it into an unsatisfiable query.One way of obtaining that query is, for example, with an expression

〈ε, (BIND(ℓ AS ?x) AND BIND(ℓ AS ?y)) FILTER (?x 6=?y)〉

For the casesr = r1/r2 andr = r1|r2 we follow the same construction as ifℓ were aURI but with the last base case. For the case ofr = r∗1 , if β is a variabley we considerthe following query

〈ε, BIND(ℓ AS ?y)〉.

and ifβ is a URI or literal the query

〈ε, (BIND(ℓ AS ?x) AND BIND(β AS ?y)) FILTER (?x =?y)〉.

The correctness of this translation can be proved along the same lines as for the case ofproperty path pattern(?x, r, ?y).

A.6 Proof of Theorem3

We proceed by induction showing how to translate every posible NautiLOD query. Thetranslation works in two parts. We first define the following function transN (·) thatgiven a NautiLOD query, produces an LPE.

transN (p) = 〈+, p, 〉

transN (pˆ) = 〈 , p,+〉

transN (〈 〉) =(

?x,⟨

ε, (GRAPH ?u (?u, ?p, ?x))⟩)

transN (n1/n2) = transN (n1)/ transN (n2)

transN (n1|n2) = transN (n1)| transN (n2)

transN (n∗) = transN (n)∗

transN (n[(ASK P )]) = transN (n)/[(?x, 〈ε, (GRAPH ?x P )〉]

Before presenting the complete translations, we prove the following result. Letn be aNautiLOD expression, then for every Web of Linked Data and URIsu, v ∈ dom(adoc)we have that

v ∈ [[n]]uW if and only if v ∈ [[transN (n)]]uW .

The proof is by induction in the construction of the NautiLODexpression.

Page 25: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

– for the case ofp ∈ U we have that

[[p]]uW = u′ | (u, p, u′) ∈ data(adoc(u))

notice thatv ∈ dom(adoc) andv ∈ [[p]]uW , if and only if there is a link fromdocumentadoc(u) to documentadoc(v) that matches〈+, p, 〉. This happens, if

and only ifv ∈ [[〈+, p, 〉]]uW , which is what we wanted to prove.

– the case forpˆ is similar but using〈+, p, 〉.– the case for〈 〉. Just notice thatv ∈ [[〈 〉]]uW if and only if there exists ap ∈U such that(u, p, v) ∈ data(adoc(u)). On the other hand we have thatv ∈

[[transN (〈 〉)]]uW = [[

(

?x,⟨

ε, (GRAPH ?u (?u, ?p, ?x))⟩)

]]uW if and only if v ∈

[[π?x(GRAPH ?u (?u, ?p, ?x))]]D whereD = data(adoc(u)), 〈u, data(adoc(u))〉.

Thus v ∈ [[transN (〈 〉)]]uW if and only if there existsp such that(u, p, v) ∈

data(adoc(u)). This proves the desired property.– for the case of an expressionn1/n2, we have thatv in dom(adoc) is in [[n1/n2]]

uW

if and only if, there existsv′ ∈ dom(adoc) such thatv′ ∈ [[n1]]uW andv ∈ [[n2]]

v′

W .The we can apply or induction hypothesis and we have thatv ∈ [[n1/n2]]

uW if

and only if v′ ∈ [[transN (n1)]]uW and v ∈ [[transN (n2)]]

v′W , and thusv ∈

[[transN (n1/n2)]]uW .

– casesn1|n2 andn∗ are direct from the definition of NautiLOD and LDQL.– for the case of expressionn[(ASKP )] we have thatv ∈ [[n[(ASK P )]]]uW if and only ifv ∈ [[n]]uW , v ∈ dom(adoc) and[[P ]]data(adoc(v)) 6= ∅. On the other hand, we have

thatv ∈ [[transN (n[(ASKP )])]]uW if and only if

v ∈ [[transN (n)/[(?x, 〈ε, (GRAPH ?x P )〉]]]uW .

This happens if and only if there exists av′ such thatv′ ∈ [[transN (n)]]uW and

v ∈ [[[(?x, 〈ε, (GRAPH ?x P )〉]]]v′W . From the last property and the semantics of

[·] in LDQL, we have thatv = v′ and that[[(?x, 〈ε, (GRAPH ?x P )〉]]vW 6= ∅.

The last holds if and only if[[π?x(GRAPH ?x P )]]D 6= ∅, with D the RDF dataset

data(adoc(v)), 〈v, data(adoc(v))〉. Thus we have thatv ∈ [[transN (n[(ASKP )])]]uW

if and only if v ∈ [[transN (n)]]uW and[[P ]]data(adoc(u)) 6= ∅. Applying our induc-

tion hypothesis we havev ∈ [[n]]uW and[[P ]]data(adoc(u)) 6= ∅, which is exactly whatwe needed to prove.

Notice that the hypothesis thatv ∈ dom(adoc) was fundamental to prove the previousresult. Nevertheless, the output of a NautiLOD query can be aURI not in dom(adoc)or even a literal, so we need to do a different translation in general. Thus, we use nowtransN (·) to translate a general NautiLOD query. Given a NautiLOD expressionn wehave two cases. Assume first thatn, as a regular expression, does not produce the emptystringε. Then, by using reglar language results, we know that we can write an equivalentexpressionn′ of the form

n1/e1 | · · · | nk/ek | m1[(ASKP1)] | · · · | mℓ[(ASKPℓ)]

Page 26: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

where everyni andmj is a NautiLOD query, and everyei is either of the formp, orpˆ,or 〈 〉. We are ready now to produce an LDQL queryQn(?x) which is equivalent ton.The query is constructed as follows.

Qn(?x) = π?x

(

〈transN (n1), Q1〉 UNION · · · UNION 〈transN (nk), Qk〉 UNION

〈transN (m1), (GRAPH ?x P1)〉 UNION · · · UNION 〈transN (mℓ), (GRAPH ?x Pℓ)〉

)

where queryQi depends on the form ofei:

– if ei = p thenQi = (GRAPH ?u (?u, p, ?x))– if ei = pˆ thenQi = (GRAPH ?u (?x, p, ?u))– if ei = 〈 〉 thenQi = (GRAPH ?u (?u, ?p, ?x))

Now to prove the correctness of our construction, assume that v ∈ [[n]]uW . Then weknow thatv ∈ [[ni/ei]]

uW or v ∈ [[mi[(ASKPi)]]]

uW for somei. If v ∈ [[ni/ei]]

uW we know

that there exists av′ such thatv′ ∈ [[ni]]uW andv ∈ [[ei]]

v′

W . Notice that, sincev ∈ [[ei]]v′

W ,andei is eitherp, or pˆ, or 〈 〉 then we know thatv′ is in dom(adoc). Thus we can

apply our previous result to conclude fromv′ ∈ [[ni]]uW that v′ ∈ [[transN (ni)]]

uW .

Now if ei = p then fromv ∈ [[ei]]v′

W we conclude that(v′, p, v) ∈ data(adoc(v′)) andthus [[(?u, p, ?x)]]data(adoc(v′)) contains the mappingµ = ?u → v′, ?x → v, then[[(GRAPH ?u (?u, p, ?x))]]D hasµ as solution, withD = data(adoc(v′)), 〈v′, data(adoc(v′))〉.

Given thatv′ ∈ [[transN (ni)]]uW , we have that

µ = ?u → v′, ?x → v ∈ [[〈transN (ni), Qi〉]]uW .

Finally, given thatQn(?x) only keep the?x variable, we have that?x → v is in[[Qn(?x)]]

uW , which is what we wanted to show. Ifei = pˆ or ei = 〈 〉 the proof is the

essentially the same.Now assume thatv ∈ [[mi[(ASKPi)]]]

uW . This implies thatv is in [[mi]]

uW and that

[[Pi]]data(adoc(v)) 6= ∅. By the semantics of NautiLOD, we have thatv is in dom(adoc)(otherwise we could not have been able to evaluateP ), and thus we can apply our re-sult above to obtain thatv ∈ [[transN (mi)]]

uW . Now, given that[[Pi]]data(adoc(v)) 6= ∅

we have that[[(GRAPH ?x Pi)]]D 6= ∅ whereD = data(adoc(v)), 〈v, data(adoc(v))〉.

Moreover, we have that every mappingµ in [[(GRAPH ?x Pi)]]D is such thatµ(?x) = v.

All these facts implies that mappingµ′ = ?x → v is in [[〈transN (mℓ), (GRAPH ?x Pℓ)〉]]uW ,

and thusµ′ is in [[Qn(?x)]]uW which is exactly what we wanted to prove.

If we start by assuming thatµ = ?x → v is in [[Qn(?x)]]uW , then following a

similar reasoning as above one concludes thatv ∈ [[n]]uW .To complete the proof we have to cover the case in whichn, as a regular expres-

sion, can produce the empty string. Then, by applying some classical regular languagesproperties, one can rewriten asε|n′ with n′ an expression that does not produce theempty stringε. Thus we can translaten into the LDQL query

〈ε, (GRAPH ?x )〉 UNION Qn′(?x)

Page 27: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Notice that for everyu ∈ data(adoc(v)) we have that[[〈ε, (GRAPH ?x )〉]]uW results

in a single mappingµ = ?x → u.

A.7 Proof of Theorem4

Recall that NautiLOD can only express paths and no combination of those paths viaSPARQL operators is allowed. Thus, it is easy to prove that NautiLOD cannot expressoperators such asSEED , AND , UNION that are natively allowed in LDQL. Thus to makea stronger claim, we will prove that there exists simple LDQLquery not using thementioned operators, that cannot be expressed using NautiLOD. The proof is similar tothe proof of Theorem1.

Thus, consider the LDQLQ(?x) query given by

〈+, p, 〉, (?x, ?x, ?x)⟩

with p ∈ U . Now assume that there exists a NautiLOD expressionn such that

[[n]]vW = [[Q(?x)]]vW

for every Web of Linked DataW and v ∈ dom(adoc). Let u, u′, a, b be differentelements inU that are not mentioned inn. Consider nowW1 having only two doc-umentsd1 = (u, p, u′) and d2 = (a, a, a) and such thatadoc(u) = d1 andadoc(u′) = d2. Moreover, considerW2 having also two documentsd1 = (u, p, u′)andd3 = (b, b, b) such thatadoc(u) = d1 andadoc(u′) = d3. First notice that

[[Q(?x)]]uW1

= ?x → a 6= [[Q(?x)]]uW2

= ?x → b

We now prove that[[n]]uW1= [[n]]uW2

which is a contradiction. To prove this, we showthat for every subexpressione of n, and for every possible URIv, it holds that[[e]]vW1

=[[e]]vW2

. First notice thatW1 andW2 has only two URIs indom(adoc), namely,u andu′, thus, we only have to reason for the cases in whichv = u or v = u′. We proceed byinduction.

– Assume thate = r ∈ U . Given that inW1 andW2 the URIu is associated with thesame document (documentd1), then[[r]]uW1

= [[r]]uW2. Moreover, given thatr 6= a

andr 6= b (recall thatn does not mentiona or b), we have that[[r]]u′

W1= [[r]]u

W2= ∅.

– Assume thate = rˆ with r ∈ U . Exactly the same argument as the above caseapplies.

– Assume thate = 〈 〉. For the same reason as in the above two cases we have that[[r]]uW1

= [[r]]uW2. Now consider[[〈 〉]]u

W1. Then we have that URIv is in [[〈 〉]]u

W1if

and only if, there exists somep such that(u′, p, v) ∈ data(adoc(u′)), but the onlytriple in data(adoc(u′)) is (a, a, a) and sincea 6= u′ we have that[[〈 〉]]u

W1= ∅.

For a similar reason we obtain that[[〈 〉]]u′

W2= ∅, completing this part of the proof.

– The casese = r1/r2, e = r1|r2 ande = r∗ follows from the base cases provedabove.

Page 28: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

– Assumee = r[(ASK P )]. By definition we have that

[[r[(ASK P )]]]vW = v′ | v′ ∈ [[r]]vW , v′ ∈ dom(adoc) and[[P ]]data(adoc(v′)) 6= ∅

By induction hypothesis we have that[[r]]vW1= [[r]]vW2

for v = u, u′. Thus we onlyneed to prove that the evaluation ofP is always the same. given thatdata(adoc(u))is the same document inW1 andW2, we have that foru the property holds. Nowconsider[[P ]]d2

and[[P ]]d3with d1 = (a, a, a) andd2 = (b, b, b). Recall that

P does not mentiona or b, thus we have that ifµ ∈ [[P ]]d2then the mappingµ′

obtained fromµ by replacing every occurrence ofa by b, is in [[P ]]d3, and vice

versa. Thus we have that[[P ]]d2= ∅ if and only if [[P ]]d3

= ∅. This proves that[[r[(ASK P )]]]vW1

= [[r[(ASK P )]]]vW2for v = u, u′.

We have finished the proof that[[n]]uW1= [[n]]uW2

thus contradicting the fact thatn isequivalent toQ(?x).

A.8 Proof of Theorem5

Let P be an arbitrary SPARQL graph pattern, letW = 〈D, adoc〉 be an arbitrary Webof Linked Data, and letS be some finite set of URIs. To prove the theorem we usethe (basic) LDQL queries〈lpecAll , P 〉, 〈lpecNone , P 〉, and〈lpecMatch , P 〉, with the follow-ing LPEs:

lpecAll is 〈 , , 〉∗,

lpecNone is ε, and

lpecMatch is

(

〈?s, q1〉 | 〈?p, q1〉 | 〈?o, q1〉 | ... | 〈?s, qm〉 | 〈?p, qm〉 | 〈?o, qm〉)∗

where?s, ?p and ?o are fresh variables (not used inP ), m is the number oftriple patterns inP , and for each such triple patterntpk (1 ≤ k ≤ m)there exists a subqueryqk of the form 〈ε, Pk〉 with a SPARQL pat-tern Pk that is constructed as follows:Pk contains the triple pattern〈?s, ?p, ?o〉 and—depending on the form of the corresponding triple pat-tern tpk = 〈sk, pk, ok〉—may contain additionalFILTER operators; in par-ticular, if sk /∈ V , thenPk containsFILTER ?s = sk; if pk /∈ V , thenPk

containsFILTER ?p = pk; and ifok /∈ V , thenPk containsFILTER ?o = ok.

Then, for each reachability criterionc ∈ cAll, cNone, cMatch with its correspondingLPE lpe

c as specified above, we have to show the following equivalence:

[[P ]]R(c,S)W = [[〈lpec, P 〉]]SW . (10)

By the definition of the reachability-based query semantics(cf. Section4.3) andthe definition of LDQL query semantics (cf. Definition5), it is sufficient to prove thefollowing lemma to show that (10) holds for eachc ∈ cAll, cNone, cMatch.

Lemma 4. For eachc ∈ cAll, cNone, cMatch, the set of all documents that are (c, S, P )-reachable inW is equivalent to the following set of documents:

DcLPE = adoc(u) | u ∈ [[lpec]]uctx

W for someuctx ∈ S.

Page 29: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Notice that for eachc ∈ cAll, cNone, cMatch, the setDcLPE is the set of documents

selected by evaluatinglpec overW using every URI inS as context URI. In the follow-ing, we prove Lemma4 for each of the three reachability criteria,cAll, cNone, andcMatch.

cAll-semantics: To prove Lemma4 for cAll we show that the setDcAllLPE is both a subset

and a superset of the set of all (cAll, S, P )-reachable documents inW.We begin with the former. Hence, for an arbitrary document inDcAll

LPE we have toshow that this document is (cAll, S, P )-reachable inW. Let dLPE ∈ DcAll

LPE be such adocument. SincedLPE ∈ DcAll

LPE, we know that there exist two URIs,uctx andu, such that

– uctx ∈ S,– u ∈ [[lpecAll ]]uctx

W , and– dLPE = adoc(u).

Then, either we haveuctx = u oructx 6= u. In the following, we discuss these two cases.If uctx = u, thendLPE = adoc(uctx) and, thus, documentdLPE is (cAll, S, P )-reach-

able inW because it satisfies the first of the two alternative conditions for reachabilityas given in Section4.3.

If uctx 6= u, then, given thatu ∈ [[lpecAll ]]uctx

W , there exists a nonempty sequence oflink graph edges

〈d1, (t1, u1), d′1〉 ∈ GW , 〈d2, (t2, u2), d

′2〉 ∈ GW , ... , 〈dn, (tn, un), d

′n〉 ∈ GW

such that

– d1 = adoc(uctx),– d′i = di+1 for all i ∈ 1, ... , n− 1, and– d′n = dLPE (andun = u).

Then, sinced1 = adoc(uctx) anductx ∈ S, we have that documentd1 is (cAll, S, P )-reachable inW (the document satisfies the first of the two conditions for reachabilityas given in Section4.3). As a consequence, we can use the fact thatd′i = di+1 for alli ∈ 1, ... , n− 1 to show that all other documents connected by the sequence oflinkgraph edges are also (cAll, S, P )-reachable inW (they satisfy the second condition).Therefore, due tod′n = dLPE, documentdLPE is (cAll, S, P )-reachable inW.

After showing that in both cases,uctx = u anductx 6= u, documentdLPE ∈ DcAllLPE

is (cAll, S, P )-reachable inW, we conclude that the setDcAllLPE is a subset of the set of all

(cAll, S, P )-reachable documents inW. It remains to show thatDcAllLPE is also a superset.

To this end, letdR be a document that is (cAll, S, P )-reachable inW. We have toshow thatdR is in DcAll

LPE. We note that documentdR may be (cAll, S, P )-reachable inW because it satisfies either the first or the second of the two alternative conditions forreachability as given in Section4.3. In the following, we discuss both cases.

If dR satisfies the first condition, there exists a URIuR∈S such thatadoc(uR) = dR.SincelpecAll is 〈 , , 〉∗, we also haveuR ∈ [[lpecAll ]]uR

W . Therefore, we can use URIuR

as bothuctx andu in the definition ofDcAllLPE, which shows thatdR ∈ DcAll

LPE.

Page 30: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

If dR satisfies the second condition, then there exist both a seed URI u0 ∈ S and anonempty sequence of link graph edges

〈d1, (t1, u1), d′1〉 ∈ GW , 〈d2, (t2, u2), d

′2〉 ∈ GW , ... , 〈dn, (tn, un), d

′n〉 ∈ GW

such that

– d1 = adoc(u0),– d′i = di+1 for all i ∈ 1, ... , n− 1, and– d′n = dR and, thus,dR = adoc(un).

Moreover, every such link graph edge〈dj , (tj , uj), d′j〉 matches link pattern〈 , , 〉

in the context of URIuj−1 (1 ≤ j ≤ n). Therefore, sincelpecAll is 〈 , , 〉∗, we haveun ∈ [[lpecAll ]]u0

W . Then, withdR = adoc(un) andu0 ∈ S, we can useun asu andu0 asuctx in the definition ofDcAll

LPE, which shows thatdR ∈ DcAllLPE.

In conclusion, independent of whetherdR satisfies the first or the second conditionfor being (cAll, S, P )-reachable inW, we find thatdR ∈ DcAll

LPE. Hence,DcAllLPE is not only

a subset of all (cAll, S, P )-reachable documents inW, but also a superset thereof, whichshows that both sets are equivalent (as claimed in Lemma4).

cNone-semantics: To prove Lemma4 for cNone we show that the setDcNoneLPE is both a

subset and a superset of the set of all (cNone, S, P )-reachable documents inW.To begin with the former, assume an arbitrary document indLPE ∈ DcNone

LPE . We haveto show that this document is (cNone, S, P )-reachable inW. SincedLPE ∈ DcAll

LPE, weknow that there exist two URIs,uctx andu, such that

– uctx ∈ S,– u ∈ [[lpecNone ]]uctx

W , and– dLPE = adoc(u).

Given thatlpecNone is ε, by u ∈ [[lpecNone ]]uctx

W and Definition5, we obtain thatu =uctx and, thus,dLPE = adoc(uctx). Therefore, documentdLPE is (cNone, S, P )-reachablein W because it satisfies the first of the two alternative conditions for reachability asgiven in Section4.3. As a consequence, we can conclude that the setDcNone

LPE is a subsetof the set of all (cNone, S, P )-reachable documents inW.

To show thatDcNoneLPE is also a superset, letdR be an arbitrary document that is

(cNone, S, P )-reachable inW. We have to show thatdR is inDcNoneLPE . We note thatdR can

be (cNone, S, P )-reachable inW only if it satisfies the first of the two alternative condi-tions for reachability as given in Section4.3(for cNone, the second condition cannot besatisfied by any document becausecNone(t, u, P ) = false for all 〈t, u, P 〉 ∈ T ×U×P).Therefore, given thatdR satisfies the first condition, there exists a URIuR ∈ S suchthat adoc(uR) = dR. SincelpecNone is ε, we also haveuR ∈ [[lpecNone ]]uR

W . Therefore,we can use URIuR as bothuctx andu in the definition ofDcNone

LPE and, thus, obtain thatdR ∈ DcNone

LPE , which shows that the setDcNoneLPE is a superset of the set of all (cNone, S, P )-

reachable documents inW. Since we have shown before thatDcNoneLPE is also a subset of

the set of all documents that are (cNone, S, P )-reachable inW, we conclude that bothsets are equivalent. Hence, Lemma4 holds for reachability criterioncNone.

Page 31: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

cMatch-semantics: It remains to prove Lemma4 for cMatch. To this end, we showthat the setDcMatch

LPE is both a subset and a superset of the set of all documents thatare(cMatch, S, P )-reachable inW. As before, we begin with the former.

Let dLPE be an arbitrary document inDcAllLPE. We have to show that this document

is (cMatch, S, P )-reachable inW. SincedLPE ∈ DcMatch

LPE , we know by the definition ofDcMatch

LPE (as given in Lemma4) that there exist two URIs,uctx andu, such that

– uctx ∈ S,– u ∈ [[lpecMatch ]]uctx

W , and– dLPE = adoc(u).

Given thatu ∈ [[lpecMatch ]]uctx

W , there exists a nonempty sequence of URIsu0, u1, ... , un

and a corresponding sequence of documentsd0, d1, ... , dn such that

– di = adoc(ui) for eachi ∈ 0, ... , n,– u0 = uctx,– un = u (and, thus,dn = dLPE), and– for eachi ∈ 1, ... , n, there exists a triple patterntpk in P (1 ≤ k ≤ m) such

thatui ∈ [[〈?v, qk〉]]ui−1

W where?v ∈ ?s, ?p, ?o andqk is the LDQL query thatcorresponds totpk as specified in the definition oflpecMatch above.

We show by induction overn that alln+1 documents,d0, d1, ... , dn, are (cMatch, S, P )-reachable inW, and, thus, so isdLPE = dn.

Base case (n= 0): d0 is (cMatch, S, P )-reachable inW becaused0 = adoc(u0),u0 = uctx, anductx ∈ S; i.e.,d0 satisfies the first condition as specified in Section4.3.

Induction step (n> 0): By induction, we assume that documentdn-1 is (cMatch, S, P )-reachable inW. To show thatdn is also (cMatch, S, P )-reachable inW we aim to showthat dn satisfies the second condition for reachability as given in Section 4.3. Thatis, we aim to show that there exists a link graph edge〈dsrc, (t, u), dtgt〉 ∈ GW suchthat (i) dsrc is (cMatch, S, P )-reachable inW, (ii) cMatch(t, u, P ) = true, (iii) u = un,and (iv) dtgt = dn. Let dsrc be dn-1, which is (cMatch, S, P )-reachable inW by ourinductive hypothesis. Hence, it remains to show the existence of a link graph edge〈dn-1, (t, un), dn〉 ∈ GW for which cMatch(t, un, P ) = true. To this end, we use thefourth of the four aforementioned properties of the sequence of URIsu0, u1, ... , un.

Let tpk = 〈sk, pk, ok〉 be a triple pattern inP such thatun ∈ [[〈?v, qk〉]]un−1

W where?v ∈ ?s, ?p, ?o andqk is the LDQL query that corresponds totpk as specified in thedefinition of lpecMatch above; i.e.,qk is a basic LDQL query of the form〈ε, Pk〉 whereSPARQL patternPk contains the triple pattern〈?s, ?p, ?o〉 and (i) if sk /∈ V , thenPk

containsFILTER ?s = sk, (ii) if pk /∈ V , thenPk containsFILTER ?p = pk, and (iii) ifok /∈ V , thenPk containsFILTER ?o = ok.

Sinceun ∈ [[〈?v, qk〉]]un−1

W , by Definition5, there exists a solution mappingµ such

thatµ ∈ [[qk]]un−1W andµ(?v) = un. Moreover, sinceqk is of the form〈ε, Pk〉, we

haveµ ∈ [[Pk]]D

G whereD = datasetW (un−1) with default graphG = data(dn−1).Then, due to the construction ofPk, it is easily verified that there exists an RDFtriple t ∈ data(dn−1) such thatµ[tpk] = t andun ∈ uris(t). As a consequence,(i) cMatch(t, un, tpk) = true and (ii) by Definition2, there exists a link graph edge

Page 32: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

〈dn−1, (t, un), dn〉 ∈ GW . Finally, sincetpk is a triple pattern inP , we also havecMatch(t, un, P ) = true.

While this concludes showing that the setDcMatch

LPE is a subset of the set of all documentsthat are (cMatch, S, P )-reachable inW, we now show that it is also a superset thereof.

LetdR be a document that is (cMatch, S, P )-reachable inW. We have to show thatdRis inDcMatch

LPE . SincedR is (cMatch, S, P )-reachable inW, there exist a nonempty sequenceof URIsu0, u1, ... , un, a corresponding sequence of documents

d0 = adoc(u0), d1 = adoc(u1), d2 = adoc(u2), ... , dn = adoc(un),

and a corresponding sequence of link graph edges

〈d′1, (t1, u1), d1〉 ∈ GW , 〈d′2, (t2, u2), d2〉 ∈ GW , ... , 〈d′n, (tn, un), dn〉 ∈ GW

such that

– u0 ∈ S,– cMatch(ti, ui, P ) = true for all i ∈ 1, ... , n,– d′i = di−1 for all i ∈ 1, ... , n, and– dn = dR and, thus,dR = adoc(un).

We aim to show that each of then + 1 documents,d0, d1, d2, ... , dn, is inDcMatch

LPE ,and, thus, so isdR = dn. To this end, it is sufficient to show that each of then+1 URIs,u0, u1, ... , un, is in [[lpecMatch ]]u0

W . Then, withu0 ∈ S, for eachi ∈ 0, ... , n we canuse URIui asu andu0 asuctx in the definition ofDcMatch

LPE , which shows that documentdi = adoc(ui) is in DcMatch

LPE . We use proof by induction.

Base case (n= 0): SincelpecMatch is of the form(·)∗, we haveu0 ∈ [[lpecMatch ]]u0

W .

Induction step (n> 0): By induction, we assume thatun−1 ∈ [[lpecMatch ]]u0

W . Then,to show thatun is also in[[lpecMatch ]]u0

W , it is sufficient to show thatun is in [[lpecMatch

step ]]un−1

W

wherelpecMatch

step is 〈?s, q1〉 | 〈?p, q1〉 | 〈?o, q1〉 | ... | 〈?s, qm〉 | 〈?p, qm〉 | 〈?o, qm〉 such that

lpecMatch =

(

lpecMatch

step

)∗. Due to the existence of link graph edge〈d′n, (tn, un), dn〉 ∈ GW

with d′n = dn−1, we know by Definition2 that there exists a tripletn ∈ data(dn−1)with un ∈ uris(tn). Moreover, sincecMatch(tn, un, P ) = true, there exist both a triplepatterntpk in P and a solution mappingµ such thatµ[tpk] = tn. Then, given theLDQL query qk = 〈ε, Pk〉 that is constructed fortpk as specified in the definition oflpe

cMatch , it easy to verify that there exists a solution mappingµ′ such thatµ′ ∈ [[Pk]]D

G

andµ′(?v) = un, where?v ∈ ?s, ?p, ?o andD = datasetW (un−1) with default

graphG = data(dn−1). Then, by Definition5, we also haveµ′ ∈ [[qk]]un−1W and,

thus, un ∈ [[〈?v, qk〉]]un−1

W . Since〈?v, qk〉 is a disjunct inlpecMatch

step , we also obtainun ∈ [[lpecMatch

step ]]un−1

W and, thus,un ∈ [[lpecMatch ]]u0

W .

As argued before, as a consequence ofun ∈ [[lpecMatch ]]u0

W (andu0 ∈ S), we canshow that documentdR = adoc(un) is inDcMatch

LPE by usingun asu andu0 asuctx in thedefinition ofDcMatch

LPE (cf. Lemma4). Therefore, the setDcMatch

LPE is not only a subset of theset of all documents that are (cMatch, S, P )-reachable inW (as shown before), but alsoa superset. Hence, both sets are equivalent and, thus, Lemma4 holds forcMatch.

Page 33: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

A.9 Proof of Theorem6

In the proof we use the following simple LDQL queryQ(?x) given by⟨

〈+, p, 〉, (?x, ?x, ?x)⟩

.

We prove first that the reachability criterioncNone cannot expressQ(?x). On thecontrary, assume that there exists a SPARQL patternP such that

[[P ]]R(cNone,S)W = [[Q(?x)]]SW

for everyS andW . Letu, u′, a, b be different elements inU that are not mentioned inP .Consider nowW1 having only two documentsd1 = (u, p, u′) andd2 = (a, a, a)and such thatadoc(u) = d1 andadoc(u′) = d2. Moreover, considerW2 having alsotwo documentsd1 = (u, p, u′) andd3 = (b, b, b) such thatadoc(u) = d1 andadoc(u′) = d3. First notice that

[[Q(?x)]]uW1

= ?x → a 6= [[Q(?x)]]uW2

= ?x → b

It is easy to see that[[P ]]R(cNone,u)W1

= [[P ]]R(cNone,u)W2

. Just notice that fromu, theset of reachable documents following thecNone criterium is the same setd1 in both

W1 andW2. Thus we have that[[P ]]R(cNone,u)W1

= [[P ]]R(cNone,u)W2

but [[Q(?x)]]uW1

6=

[[Q(?x)]]uW2

which is a contradiction.To continue with the proof, we now show that the reachabilitycriterioncAll cannot

expressQ(?x). To obtain a contradiction, assume that there exists a patternP such that

[[P ]]R(cAll,S)W = [[Q(?x)]]SW

for everyS andW . Letu, u′, a, b be different elements inU that are not mentioned inP .Consider nowW1 = (d1, d2, d3, adoc1) having three documentsd1 = (u, p, u′),d2 = (a, a, a) andd3 = (b, b, b) and such thatadoc1(u) = d1, adoc1(u′) = d2andadoc1(a) = d3. Moreover, considerW2 = (d1, d2, d3, adoc2) having exactly thesame documents asW1, and such thatadoc2(u) = d1, adoc2(u′) = d3 andadoc2(b) =d2. First notice that

[[Q(?x)]]uW1

= ?x → a 6= [[Q(?x)]]uW2

= ?x → b.

Now notice that fromu, the set of reachable documents inW1 following the cAllcriterium is the setd1, d2, d3; d1 is the document associated tou, d2 is reachablefrom d1 via the URIu′, andd3 is reachable fromd2 via the URIa. Moreover, the setreachable documents fromu in W2 is alsod1, d2, d3; d1 is the document associatedto u, d3 is reachable fromd1 via the URIu′, andd2 is reachable fromd3 via URI b.Given that the set of reachable documents is the same in bothW1 andW2 we have[[P ]]

R(cAll,u)W1

= [[P ]]R(cAll,u)W2

. Given that[[Q(?x)]]uW1

6= [[Q(?x)]]uW2

we obtain ourdesired contradiction.

We consider now the case ofcMatch, and prove that it cannot expressQ(?x). Toobtain a contradiction, assume that there exists a patternP such that

[[P ]]R(cMatch,S)W = [[Q(?x)]]SW

Page 34: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

for everyS andW . Let u, u′, u′′, a be different elements inU that are not mentionedin P . Consider nowW1 having two documentsd1 = (u, p, u′) andd2 = (a, a, a)and such thatadoc(u) = d1 andadoc(u′) = d2. Moreover, considerW2 having alsotwo documentsd′1 = (u′′, p, u′) andd′2 = (a, a, a) such thatadoc(u) = d′1 andadoc(u′) = d′2. First notice that

[[Q(?x)]]uW1

= ?x → a 6= [[Q(?x)]]uW2

= ∅.

We prove now that[[P ]]R(cMatch,u)W1

= [[P ]]R(cMatch,u)W2

. Now given thatd1 is the docu-ment associated tou in W1, we have thatd1 is (cMatch, u, P )-reachable inW1. Simi-larly, we know thatd′1 is (cMatch, u, P )-reachable inW2. Moreover, given thatP doesnot mentionu, u′ andu′′ we have that(u, p, u′) matches a triple pattern inP if and onlyif (u′′, p, u′) matches a triple pattern inP . Thus we have thatd2 is (cMatch, u, P )-reachable inW1 if and only if d′2 is (cMatch, u, P )-reachable inW2. Thus we haveonly two cases, either

– d1 is the set of (cMatch, u, P )-reachable documents inW1, andd′1 is the setof (cMatch, u, P )-reachable documents inW2, or

– d1, d2 is the set of (cMatch, u, P )-reachable documents inW1, andd′1, d′2 is

the set of (cMatch, u, P )-reachable documents inW2.

In the first case we have that[[P ]]R(cMatch,u)W1

is obtained by evaluatingP over graph

G1 = (u, p, u′), and that[[P ]]R(cMatch,u)W2

is obtained by evaluatingP over graphG2 = (u′′, p, u′). Given thatP does not mentionu, u′ andu′′ we obtain that theevaluation ofP overG1 is the same as the evaluation ofP overG2 which impliesthat [[P ]]

R(cMatch,u)W1

= [[P ]]R(cMatch,u)W2

. In the second case,[[P ]]R(cMatch,u)W1

is obtained

by evaluatingP over graphG1 = (u, p, u′), (a, a, a), and [[P ]]R(cMatch,u)W2

is ob-tained by evaluatingP over graphG2 = (u′′, p, u′), (a, a, a). For the same reasonas above we have that the evaluation ofP is the same overG1 and overG2 whichimplies that[[P ]]

R(cMatch,u)W1

= [[P ]]R(cMatch,u)W2

. We have proved that[[P ]]R(cMatch,u)W1

=

[[P ]]R(cMatch,u)W2

, while [[Q(?x)]]uW1

6= [[Q(?x)]]uW2

which is our desired contradiction.

A.10 Proof of Proposition2

Property 1 Let q be an arbitrary basic LDQL query of the form〈lpe, P 〉 such thatlpe is Web-safe. To show thatq is Web-safe we provide Algorithm1. In line 3 thealgorithm calls a subroutine, EXECLPE, that evaluates a given LPE in the context ofa given URI (cf. Algorithm2). The correctness of the algorithm and its subroutine iseasily checked. Moreover, a trivial proof by induction on the possible structure of LPEscan show that for any Web-safe LPE, the given subroutine looks up a finite number ofURIs only. The crux of such a proof is twofold: First, the evaluation of LPEs of the formlpe

∗ (lines26 to 34 in Algorithm 2) is guaranteed to reach a fixed point for anyfiniteWeb of Linked Data. Second, the evaluation of LPEs of the form〈?v, q〉 (lines 38 to42) uses an algorithm for subqueryq that has the properties as required in Definition6.Due to the Web-safeness of the given LPE and, thus, ofq, such an algorithm exists.

Page 35: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Algorithm 1 Execution of a basic LDQL query〈lpe, P 〉 using a setS of URIs as seed.1: Φ := a new empty set of URIs2: for all u ∈ S do3: Φ := Φ∪ EXECLPE(lpe, u)4: end for5: G := a new empty set of RDF triples (i.e., an empty RDF graph)6: N := a new empty set of pairs consisting of a URI and an RDF graph7: for all u ∈ Φ do8: if looking up URIu results in retrieving a document, sayd then9: G := G ∪ data(d)

10: N := N ∪ 〈u,data(d)〉11: end if12: end for13: return [[P ]]

〈G,N〉G // [[P ]]

〈G,N〉G can be computed by using any algorithm that implements

// the standard (set-based) SPARQL evaluation function [2]

Property 2 First, letq be an LDQL query of the formπV q′ such that subqueryq′ is

Web-safe. Due to the Web-safeness ofq′, there exists an algorithm forq′ that has theproperties as required in Definition6. We may use this algorithm to construct an algo-rithm for q; that is, our algorithm forq calls the algorithm forq′, applies the projectionoperator to the result, and returns the set of solution mappings resulting from this pro-jection. Since the application of the projection operator does not involve URI lookups,the constructed algorithm forq has the properties as required in Definition6. Second,let q be an LDQL query of the form(SEED U q′) such thatq′ is Web-safe. Hence, thereexists an algorithm forq′ that has the properties as required in Definition6. Then, show-ing the Web-safeness ofq is trivial because the algorithm forq′ can also be used forq.

Property 3 Let q be an LDQL query of the form(q1 UNION ... UNION qn) such thateach subqueryqi (1≤ i≤n) is Web-safe. Hence, for each subqueryqi, there exists analgorithm that has the properties as required in Definition6. Then, the Web-safeness ofqueryq is easily shown by specifying another algorithm that calls the algorithms of thesubqueries sequentially and unions their results.

A.11 Proof of Lemma3

Lemma3 follows from Definition7 and Buil-Aranda et al.’s result [4, Proposition 1].

A.12 Proof of Theorem7

We prove Theorem7 based on Algorithm3, which is an iterative algorithm that gener-alizes the execution strategy outlined for queryq′′ex in Example6. That is, the algorithmexecutes the subqueriesq1, q2, ... , qm sequentially in the order≺ such that each itera-tion step (lines2 to 24) executes one of the subqueries by using the solution mappingscomputed during the previous step (which are passed on via the setsΩ0, Ω1, ... , Ωm).

Page 36: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Algorithm 2 EXECLPE(lpe, uctx)1: if looking up URIuctx results in retrieving a document, saydctx then2: if lpe is ε then3: return a new singleton setuctx

4: else iflpe is a link patternlp = 〈y1, y2, y3〉 then5: lp′ := 〈y′

1, y′2, y

′3〉, where〈y′

1, y′2, y

′3〉 is a link pattern generated fromlp such that any

occurrence of symbol+ in lp is replaced by URIuctx

6: Φ := a new empty set of URIs7: for all 〈x1, x2, x3〉 ∈ data(dctx) do8: if (y′

1 = x1 or y′1 = ) and (y′

2 = x2 or y′2 = ) and (y′

3 = x3 or y′3 = ) then

9: for all i ∈ 1, 2, 3 do10: if y′

i = and xi is a URI whose lookup retrieves a documentthen11: Φ := Φ ∪ xi12: end if13: end for14: end if15: end for16: return Φ

17: else iflpe is of the formlpe1/lpe2 then18: Φ′ := EXECLPE(lpe1, uctx)19: Φ := a new empty set of URIs20: for all u′∈ Φ′ doΦ := Φ∪ EXECLPE(lpe2, u

′) end for21: return Φ

22: else iflpe is of the formlpe1|lpe2 then23: Φ1 := EXECLPE(lpe1, uctx)24: Φ2 := EXECLPE(lpe2, uctx)25: return Φ1 ∪ Φ2

26: else iflpe is of the forml∗ then27: Φcur := EXECLPE(ε,uctx)28: lpe ′ := l29: repeat30: Φprev := Φcur

31: Φcur := Φcur ∪ EXECLPE(lpe′, uctx)32: lpe ′ := an LPE of the formlpe′/l33: until Φcur = Φprev

34: return Φcur

35: else iflpe is of the form[lpe ′] then36: Φ := EXECLPE(lpe′, uctx)37: if Φ 6= ∅ then return a new singleton setuctx else returna new empty setend if

38: else iflpe is of the form〈?v, q〉 then39: Ω := EXEC(q,uctx) // where EXEC denotes an arbitrary algorithm that can be used

// to compute theuctx-based evaluation ofq over the queried// Web of Linked Data

40: Φ := a new empty set of URIs41: for all µ ∈ Ω for which ?v ∈ dom(µ) andµ(?v) ∈ U doΦ := Φ ∪ µ(?v) end for42: return Φ

43: end if

44: else45: return a new empty set46: end if

Page 37: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

Algorithm 3 Execution of an LDQL queryq of the form(q1 AND q2 AND ... AND qm)using a finite setS of URIs as seed.Require: m ≥ 1Require: LDQL queryq is given as an arrayQ consisting of all subqueries ofq such that

the order of the subqueries in this array satisfies the conditions as given in Theorem7.

1: Ω0 := µ∅, whereµ∅ is the empty solution mapping; i.e.,dom(µ∅) = ∅2: for j := 1, ... ,m do

3: Ωtmp := a new empty set of solution mappings4: qj := thej-th subquery in arrayQ

5: if qj is of the form(SEED ?v q′) then6: Utmp := a new empty set of URIs7: for all µ ∈ Ωj−1 do8: if µ(?v) is a URIthenUtmp := Utmp ∪ µ(?v) end if9: end for

10: for all u ∈ Utmp do11: Ωtmp := Ωtmp ∪ EXEC(q′,u) // where EXEC denotes an arbitrary algorithm that

// can be used to compute theu-based evaluation// of q′ over the queried Web of Linked Data

12: end for13: else14: Ωtmp := EXEC(qj, S) // where EXEC denotes an arbitrary algorithm that can

// be used to compute theS-based evaluation ofqj// over the queried Web of Linked Data

15: end if

16: Ωj := a new empty set of solution mappings17: for all µ ∈ Ωj−1 do18: for all µ′ ∈ Ωtmp do19: if µ andµ′ are compatiblethen20: Ωj := Ωj ∪ µjoin, whereµjoin = µ ∪ µ′

21: end if22: end for23: end for

24: end for25: return Ωm

To prove that Algorithm3 has the properties as required in Definition6 we have toshow that the algorithm is sound and complete (i.e., for any finite setS of URIs andany Web of Linked DataW, the algorithm returns[[q]]SW ) and that it is guaranteed tolook up a finite number of URIs only. We show these properties by induction on themiteration steps performed by the algorithm. To this end, we assume that the indices asused for the subqueriesq1, q2, ... , qm reflect the order≺, that is, subqueryq1 is the firstaccording to≺, subqueryq2 is the second, and so on.

Base Case (m = 1): By the conditions in Theorem7, the first subquery (accordingto ≺) must be Web-safe and, thus, cannot be of the form(SEED ?v q′). Hence, the

Page 38: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

algorithm enters the corresponding else-branch (line14). Due to the Web-safeness ofq1, there exists an algorithm for subqueryq1, sayA1, that has the properties as requiredin Definition 6. Algorithm 3 uses algorithmA1 to obtainΩtmp = [[q1]]

SW (whereW is

the queried Web of Linked Data), which requires only a finite number of URI lookups.Thereafter, Algorithm3 computesΩ1 = Ω0 ⋊⋉ Ωtmp (lines 16 to 23) and returnsΩ1 (line 25), which does not require any more URI lookups. Hence, form = 1, thealgorithm looks up a finite number of URIs (if the queried Web of Linked Data is finite).SinceΩ0 contains only the empty solution mappingµ∅ (line 1), which is compatiblewith any other solution mapping, we haveΩ1 = Ωtmp and, thus,Ω1 = [[q1]]

SW .

Induction Step (m > 1): By induction we assume that after completing the (m–1)-th iteration, the algorithm has looked up a finite number of URIs only and the currentintermediate resultΩm−1 covers the conjunction of subqueriesq1, q2, ... , qm−1; that is,Ωm−1 = [[(q1 AND q2 AND ... AND qm−1)]]

SW . We show that them-th iteration also looks

up a finite number of URIs only and thatΩm = [[(q1 AND q2 AND ... AND qm)]]SW .If subqueryqm is Web-safe, it is not difficult to see these properties: Since qm is

Web-safe, there exists an algorithm forqm, sayAm, that has the properties as requiredin Definition 6. The corresponding call of algorithmAm in line 14 of Algorithm 3looks up a finite number of URIs only, and the subsequent join computation in lines16 to 23 does not require any more lookups. Moreover, the result of calling algorithmAm in line 14 is Ωtmp = [[qm]]SW and, since the subsequent join computation returnsΩm = Ωm−1 ⋊⋉ Ωtmp, we haveΩm = [[(q1 AND q2 AND ... AND qm)]]SW , as desired.

It remains to discuss the case of subqueryqm being of the form(SEED ?v q′),where, by the conditions in Theorem7, subqueryq′ is Web-safe. Hence, there ex-ists an algorithm forq′, sayA′, that has the properties as required in Definition6.In this case, Algorithm3 first iterates over all solution mappings inΩm−1 to popu-late a setUtmp with all URIs that any of these mappings binds to variable?v (lines6to 9). Due to the finiteness assumed for all queried Webs of LinkedData (cf. Def-inition 6), Ωm−1 is finite. Hence, the resulting setUtmp contains a finite number ofURIs. Therefore, the subsequent loop in lines10 to 12 calls algorithmA′ a finitenumber of times and, thus, them-th iteration looks up a finite number of URIs only.To show the remaining claim,Ωm = [[(q1 AND q2 AND ... AND qm)]]SW , we first showΩm ⊆ [[(q1 AND q2 AND ... AND qm)]]SW . Let µjoin be an arbitrary solution mapping inΩm; i.e.,µjoin ∈ Ωm. By lines17to 23, there exist solution mappingsµ andµ′ such that

(i) µ ∈ Ωm−1 = [[(q1 AND q2 AND ... AND qm−1)]]SW , (ii) µ′ ∈ Ωtmp =

u∈Utmp[[q′]]

uW ,

(iii) µ andµ′ are compatible, and (iv)µjoin = µ ∪ µ′. Then, by Definition5, we haveµjoin ∈ [[(q1 AND q2 AND ... AND qm)]]SW and, thus,Ωm⊆ [[(q1 AND q2 AND ... AND qm)]]SW .Finally, we showΩm ⊇ [[(q1 AND q2 AND ... AND qm)]]SW . Assume an arbitrary solu-tion mappingµ∗ ∈ [[(q1 AND q2 AND ... AND qm)]]SW . Then, by Definition5, there existtwo solution mappingsµ∗

1 andµ∗2 such that (i)µ∗

1 ∈ [[(q1 AND q2 AND ... AND qm−1)]]SW ,

(ii) µ∗2 ∈ [[qm]]SW , (iii) µ∗

1 andµ∗2 are compatible, and (iv)µ∗ = µ∗

1 ∪ µ∗2. By our in-

duction hypothesis, we haveµ∗1 ∈ Ωm−1. Then, given lines17 to 23, we have to show

thatµ∗2 ∈ Ωtmp whereΩtmp is the set of solution mappings computed during them-th

iteration. Sinceqm is of the form(SEED ?v q′), it holds thatΩtmp =⋃

u∈Utmp[[q′]]

uW

whereUtmp = u ∈ U | µ(?v) = u for someµ ∈ [[(q1 AND q2 AND ... AND qm−1)]]SW .

Hence, to show thatµ∗2 ∈ Ωtmp we show that there exists a URIu ∈ Utmp such thatµ∗

2

Page 39: arxiv.org · arXiv:1507.04614v2 [cs.DB] 19 Jul 2015 LDQL: A Query Language for the Web of Linked Data (Extended Version)⋆ Olaf Hartig1 and Jorge Pe´rez2 1

is in [[q′]]uW . Sinceµ∗

2 ∈ [[qm]]SW , by Definition5, solution mappingµ∗2 binds variable

?v to a URI, sayu∗; i.e.,?v ∈ dom(µ∗2) andµ∗

2(?v) = u∗ with u∗ ∈ U . Furthermore,by Lemma3 and the condition in Theorem7 (i.e.,?v ∈

qk≺qmsbvars(qk)), solution

mappingµ∗1 also has a binding for variable?v, and, sinceµ∗

1 andµ∗2 are compatible,

these bindings are the same, that is,µ∗1(?v) = µ∗

2(?v). Hence, for URIu∗ = µ∗2(?v) it

holds thatu∗ ∈ Utmp. Then, by Definition5, we obtain thatµ∗2 ∈ [[q′]]

uW , which shows

thatµ∗2 ∈ Ωtmp and, thus, we can conclude thatΩm ⊇ [[(q1 AND q2 AND ... AND qm)]]SW .

A.13 Proof of Corollary 1

Corollary1 is an immediate consequence of Lemma2.