an o ine optimal sparql query planning approach to ... · 2.2 query flattening to simplify the...

An Offline Optimal SPARQL Query PlanningApproach to Evaluate Online Heuristic Planners

Mihaela Bornea, Julian Dolby, Achille Fokoue, Anastasios Kementsietsidis, andKavitha Srinivas

IBM T.J. Watson Research

Abstract. In graph databases, a given graph query can be executed ina large variety of semantically equivalent ways. Each such execution planproduces the same results, but at different computation costs. The queryplanning problem consists of finding, for a given query, an execution planwith the minimum cost. The traditional greedy or heuristic cost-basedapproaches addressing the query planning problem do not guarantee bydesign the optimality of the chosen execution plan. In this paper, wepresent a principled framework to solve the query planning problem bycasting it into an Integer Linear Programming problem, and discuss itsapplications to testing and improving heuristic-based query planners.

Keywords: RDF, SPARQL, Query Optimization, Query Planning, ILP

1 Introduction

Obtaining good performance for declarative query languages requires an op-timized total system, with an efficient data layout, good data statistics, andcareful query optimization (e.g. [1]). One key piece of such systems is a queryplanner that translates a declarative query into a concrete execution plan withminimal cost. This problem has been extensively studied - in particular, in therelational database literature [18, 6, 3, 4]. The traditional solution builds a cost-model that, based on data statistics [14, 16], is able to estimate the cost of agiven query execution plan. However, since the number of execution plans canbe extremely large, only a small subset of all valid plans are constructed (usingheuristics and/or greedy approaches that consider plans likely to have a low cost)[9, 10]. The cost of those selected candidate plans are then estimated using thecost-model, and the cheapest plan is selected for execution. The chosen plan is alocal optimal and not guaranteed to be a global optimal. Even with sub-optimalplans, the performance of an optimizer is still considered satisfactory, if it per-forms better (in terms of evaluation times) when compared to other competingoptimizers. Yet, there is an alternative metric to measure how well the optimizerperforms: how far its local optimal plans are from global optimal plans. However,finding a global optimal is challenging and is one of the reasons why the heuristicplanners were devised in the first place. To the best of our knowledge, there isno practical mechanism for assessing how good these planners are, i.e. whetherthey produce optimal plans given the data layout and statistics available.

2 Authors Suppressed Due to Excessive Length

In this paper, we describe an efficient offline technique to find optimal plansfor the graph query planning problem. The problem is NP-hard, as shown in[8] and through reduction from TSP in Section 2. Our approach to find optimalplans works by casting it as an integer programming (or ILP) problem. An ILPproblem consists of an objective function (in our case, the cost of executing thequery) that needs to be minimized under a set of inequality constraints (in ourcase, these constraints encode the semantics of the input graph query and ofvalid execution plans). Both the objective function and the set of constraints areexpressed as linear expressions over a set variables - some of which are restrictedto take only integer values. Although ILP is also known to be NP-hard in theworst case, in practice, highly optimized solvers exist to efficiently solve ourformulation of the query planning problem (see section 4). Furthermore, weshow that our ILP formulation can be used to evaluate the effectiveness of anygreedy/heuristic planning solution. In fact, the two approaches can be potentiallycombined, with the ILP formulation being used to precompile specific queriesthat may occur frequently within a workload, or test the heuristic solution tofind how far away it is from optimal.

Our main contributions in the paper are as follows:

1. We present an abstract general formulation of the planning problem whichdecouples it from the actual resolution of the problem.

2. We show how to translate that abstract formulation into an ILP problem.3. Using an implementation of this approach for SPARQL 1.0, we show how

it can be concretely applied to test and improve a state of the art onlineheuristic-based planner in DB2RDF [1]. This formal and principled evalua-tion approach helps uncover new opportunities for further optimization in amature system which outperforms existing systems (Virtuoso, Jena, Sesameand RDF3X) across four benchmark and real datasets [1].

The rest of this paper is organized as follows. Section 2 presents an algebraicrepresentation of queries, and uses it to introduce the universe of alternativequery plans for an input query q. In Section 3, we cast the planning problemas an ILP problem. In particular, our approach is inspired by electronic circuitdesign and we intuitively construct a single concise circuit board that capturesthe whole universe of plans. Then, we introduce appropriate constraints and costfunctions and use an ILP solver to identify an optimal sub-circuit in the boardthat connects all the circuit components (i.e. all the input query sub-patterns)which corresponds to the optimal query plan. Section 4 empirically demonstratesthat our approach is a practical formalization of the optimization problem, fortesting and improving query planners and offline query optimization.

2 The SPARQL Query Planning Problem

2.1 Planning Problem Input

There are three inputs to the process of SPARQL planning:

Title Suppressed Due to Excessive Length 3

SELECT ?WHERE {

?x home “Palo Alto” t1

{ ?x founder ?y t2 UNION

?x member ?y t3 }{ ?y industry “Software” t4

?z developer ?y t5

?y revenue ?n t6 }OPTIONAL {?y employees ?m t7 } }

(a) Sample query q

t1,scan=106

t1,acs=100

t1,aco=30

(b) The cost C for t1

JOIN

UNION

JOIN

JOIN

JOIN

LEFT-JOIN

(t4, aco)

(t2, aco) (t3, aco)

(t1, acs)

(t6, acs)

(t5, aco)

(t7, acs)

JOIN

UNION

JOIN

JOIN

JOIN

LEFT-JOIN

(t1, aco)

(t2, aco) (t3, aco)

(t4, acs)

(t5, aco)

(t6, acs)

(t7, acs)

(c) Two syntactic reorderings

Top-level component

x y z n m

x y

(t1,acs)

x

y

(t4,aco)

x y

UNION

y

x y n

(t6,acs)

y

x y z n

(t5,aco)

y

x y z n m

OPTIONAL

y

(d) Optimal query flow

Fig. 1. Sample input, alternative plans, and optimal flow

1. The query q: The SPARQL query conforms to the SPARQL standard. There-fore, each query q is composed of a set of hierarchically nested graph patternsP, with each graph pattern p ∈ P being either a simple triple pattern 1 or morecomplex patterns such as AND, UNION, or OPTIONAL.

2. The access methodsM: Access methods provide alternative ways to evalu-ate a pattern P ∈ P. The methods are system-specific, and dependent on existingindexes in the store. For example, a store might have subject indexes to access atriple by subject ( access-by-subject (acs)), or by object (access-by-object (aco)),by a scan (access-by-scan (scan).

3. The cost C: Each access method for a pattern P is annotated with a cost,based on system specific notions of how expensive it is to execute the method.The C may be derived from statistics maintained by the system about the char-acteristics of a particular dataset, or well known costs for a particular accessmethod (e.g, scans are more expensive than index based access).

Figure 1 shows a sample input where query q retrieves the people thatfounded or are board members of companies in the software industry, and live in“Palo Alto”. For each such company, the query retrieves the products that weredeveloped by it, its revenue, and optionally its number of employees. Three dif-ferent access methods are assumed inM, one that performs a data scan (scan),one that retrieves all the triples given a subject (acs), and one that retrieves allthe triples given an object (aco). The cost C for accessing a specific pattern p,given an acccess method, is the third input, an example of which is shown fortriple t1 in Figure 1(b). This query will form our running example and the restof the figure will be explained in the following sections.

1 We reuse the same notation as the SPARQL algebra (section 18 of the spec), andassume, w.l.o.g, that every triple appears in singleton Basic Graph Pattern (BGP).


2.2 Query Flattening

To simplify the planning process, we introduce a function flat(q) to eliminateunnecessary syntactic nesting that might occur in the query. Specifically, sinceeach query q is composed of a set of hierarchically nested graph patterns P,for each graph pattern p ∈ PAND (i.e., the set of AND patterns in q), we flattennested AND patterns because they do not reflect any change in the semantics ofthe SPARQL query. Note that when we flatten the query q, we ensure that anyOPTIONAL pattern associated with a nested AND pattern stays scoped with theAND pattern to make it equivalent to the query.

2.3 Planning Problem Formulation

Given a query q, the SPARQL specification (section 18) defines a transformationof q into an algebraic expression, denoted algebra(q), that corresponds to a validevaluation of the query. The tree on the left in Figure 1(c) shows algebra(flat(q))for our example query.

Due to the guaranteed correctness of the transformation from a query to aSPARQL algebraic expression, the SPARQL Algebra is clearly a good startingpoint to define our notion of a valid execution plan of a query. However, the alge-braic expression generated for a query q suffers from two important limitations.First, it is underspecified: it implies an execution order, but, for example, it doesnot specify the access method to use to access a given triple pattern. Second, theimplied execution order only mirrors the order in which patterns appear in theoriginal query (no join order optimization). Thus, the evaluation order entailedby the generated algebraic expression is likely to be suboptimal.

In this section, we first formally define a valid plan as an annotated SPARQLalgebraic expression. Annotations make an algebraic expression fully specifiedand executable. Annotations indicate, for example, the precise access methodused to access a triple pattern, or, for a JOIN node, whether it is a PRODUCT(i.e. when the two operands of the JOIN node have no variables in common).Then, we present a generalization of the transformation from a SPARQL queryto a SPARQL algebraic expression that, for a given query q, generates a verylarge universe Uq of valid plans of q. Plans in Uq are obtained by considering allpermutations of elements in all AND patterns of the flattened query flat(q) andall valid annotations of all algebraic nodes. Finally, the query planning problemis defined as finding a plan in Uq with the lowest cost.

Annotated SPARQL Algebra The access method annotation function, de-noted am, maps a Basic Graph Pattern (BGP) containing a single triple t inalgebra(flat(q)) to an access method m ∈M to use to evaluate t.

We observe that the JOIN operator is still ambiguous. JOIN(e1, e2) can rep-resent one of the following three concrete operations: (1) a cartesian productoperation if e1 and e2 have no variables in common, (2) an efficient filter (orlinear) join if at least one required variables of e2 is produced by e1(in this casee1 is evaluated first and its results are fed to e2 for evaluation) (3) a regular join


in which e1 and e2 are independently evaluated and then merged. Likewise, aLEFTJOIN(e1, e2) can represent either an efficient filter (or linear) left outer join(when at least one required variables of e2 is produced by e1) or regular left join.

The second annotation function, called join annotation function and denotedjan, maps a join expression to one element of the set A = {PRODUCT, LINEAR,REGULAR} to indicate the precise nature of the join to be performed, and itmaps a left join to an element of A− {PRODUCT}.

Given an access method annotation function am and a join annotation func-tion jan, we define the required variables function, denoted required[am, jan](or simply required when there is no ambiguity), and the available variablesfunction, denoted available[am, jan] (or simply available). For an algebraicsub-expression e in algebra(flat(q)), required(e) is the set of all variables re-quired to evaluate e, and available(e) is the set of all variables available afterthe evaluation of e. These two functions can be defined inductively for all typesof expressions. Due to space limitation, they are presented only for BGP, JOIN,LEFTJOIN, and UNION:

– e = BGP(t): required(e) = R(t,am(e)) and available(e) = R(t,am(e)) ∪P(t,am(e))

– (e = JOIN(e1, e2) or e = LEFTJOIN(e1, e2)) and jan(e) = LINEAR: required(e) =required(e1) ∪ (required(e2)− available(e1)) andavailable(e) = available(e1) ∪ available(e2)

– otherwise (e = OP (e1, e2) andOP ∈ {JOIN, LEFTJOIN,UNION}) : required(e) =required(e1) ∪ required(e2) and available(e) =available(e1) ∪ available(e2)

Definition 1 An annotated SPARQL algebraic expression is a tuple (e,am, jan)such that : (1) e is an SPARQL algebraic expression whose BGP sub-expressionsconsist of a single triple, (2) am is a function that maps each BGP sub-expressionof e to an access method m ∈ M , and (3) jan is a function that maps eachJOIN or LEFTJOIN sub-expression of e to an element of {PRODUCT, LINEAR,REGULAR} such that, for two algebraic expression e1 and e2 :

– jan(LEFTJOIN(e1, e2)) ∈ {LINEAR,REGULAR}– jan(JOIN(e1, e2)) = PRODUCT iff. available(e1) ∩ available(e2) = ∅– (jan(op(e1, e2)) = LINEAR ∧ op ∈ {JOIN, LEFTJOIN}) implies

required(e2) ∩ available(e1) 6= ∅

We can now formally define a query plan as an annotated SPARQL algebraicexpression that does not require any variable

Definition 2 A query plan is an annotated SPARQL algebraic expression (e,am, jan) such that required(e) = ∅. PL denotes the set of all plans.

Universe of Valid Plans Considered For a given query q, we define theset EQq of queries equivalent to q after permuting elements in AND pattern ofthe flattened query flat(q). A query q1 is said to be a syntactic reordering of


q2, denoted q1 ∼ q2, when q1 and q2 are syntactically identical after reorderingelements in AND patterns of q1.

The AND pattern in which an optional pattern op appears in the originalquery, before flattening, defines the scope of op and the set mand(op) of manda-tory variables for the left outer join operation in the algebra. We need to ensurethat an optional pattern is never moved to a position where these mandatoryvariables are not in scope (or bound). For an AND pattern g = AND(p1, ..., pn)and an integer 0 ≤ i ≤ n, we define inscopevars(g, i) as the set of in-scopevariables at the position i of the AND pattern g.

inscopevars(g, i) =

{∅ if i = 0inscopevars(g, i− 1) ∪ inscopevars(pi) otherwise

where inscopevars(pi) corresponds to the set of in-scope variables of pi asdefined in SPARQL specification section 18.2.1. For an optional pattern op inflat(q) appearing at position pos(op) of an AND pattern g, its set of bound vari-ables, denoted bound(op), is defined as bound(op) = inscopevars(g,pos(op)−1) ∩ inscopevars(op)

For a given query q, we can now define a set of equivalent queries EQq asfollows:

EQq = {q′|q′ ∼ flat(q)∧ for each optional pattern op in q′,mand(op) ⊆ bound(op)}

Finally, the universe of plans considered, Uq , is defined as :

Uq = {p = (e,am, jan) ∈ PL|e = algebra(q′) ∧ q′ ∈ EQq}

If q consists of a single AND group with n triple patterns, and, for each triplepattern, there are k possible access methods, the cardinality of Uq can be aslarge as n!kn (assuming only one implementation for joins and left-joins otherthan PRODUCT).

The Planning Problem The planning problem consists in finding a minimalcost plan p ∈ Uq for a query q.

Plans in Uq are obtained by considering all permutations of elements in allAND patterns of the flattened query flat(q); to show the planning problem isNP hard, we show that choosing an ordering in a single AND is NP hard.

Definition 3 (AND Planning Problem) Planning is the process of creatinga plan, i.e. a series of plan steps, that covers all the sub-patterns in a singleAND node. We formulate a planning problem as P(N,M,A,R, C) in terms ofaccess methods (M), available (A) and required (R) variable functions and costs(C), where N is the set of direct subpatterns of a given AND node.

Definition 4 (AND Planning Solution) A solution to the planning problemis a graph G of which the nodes GN are pairs of N ×M such that each n ∈ Noccurs exactly once, i.e. |N | = |GN | ∧ ∀n∈N∃a∈M : n × a ∈ GN . The edges


GE connect the nodes; an edge n1 × a1 → n2 × a2 is allowed only if A(n1 ×a1) ∩R(n2 × a2) 6= ∅. For every node n× a, all required variables are provided,i.e. ∀n×a∈GN

∀v∈R(n×a)∃n1×a1∈GNn1 × a1 → n × a ∈ GE ∧ v ∈ A(n1 × a1). A

topological sort of the this graph represents a plan.

We use the cost of G to mean the sum of all node costs, i.e.∑

n∈GNC(n). A

minimal solution is simply one than which no solution with lower cost exists. Aminimal solution is what we would ideally like to find in query planning.

Definition 5 (TSP Planning) The traveling salesperson problem (TSP) is aclassic NP-complete problem that requires a salesperson to visit each of a set ofcities using the cheapest possible connections and visiting each city exactly once.Formally, a TSP problem can be formulated as a graph T and a cost function Cgiving a cost for every edge in TE. We use CT to denote costs of edges in T . Wetranslate a TSP problem into our planning problem as follows:

N ≡ TNM ≡ {ae1e2 |e1 ∈ TE ∧ e2 ∈ TE ∧ ∃v,v1,v2 : e1 = v1 → v ∧ e2 = v → v2 }

A(n, ae1e2) ≡{ve2 ∃n1, n2 : e1 = n1 → n ∧ e2 = n→ n2

∅ otherwise

}R(n, ae1e2) ≡

{ve1 ∃n1, n2 : e1 = n1 → n ∧ e2 = n→ n2

⊥ otherwise

}C(n× ae1e2) ≡

{CT (e2) ∃ni, nj : e1 = ni → n ∧ e2 = n→ nj∞ otherwise

Theorem 1 (AND Planning is NP-Hard). Finding a minimal solution tothe AND planning problem is NP hard.

Proof. The proof is by reduction from TSP. We show first how to solve TSPas a planning problem, and second that the construction of a planning problemgiven a TSP problem is polynomial. A minimal solution to the TSP planningproblem (Definition 5) for graph T is a solution to the original TSP problem,i.e. it denotes a lowest-cost path that covers all the nodes in the original graphexactly once. For each node in T , the possible nodes in the planning problem aren × ae1e2 for all possible pairs where e1 is an incoming edge of n and e2 is anoutgoing edge of n. Other access methods are not possible since they all require⊥ which is not produced. Since exactly one of these nodes must be in the plansolution, it follows that every solution traverses n precisely once. This is true forevery such node n, hence any solution must traverse each node exactly once, andhence is a tour. All such paths are permitted since each pair of incident edges foreach node is defined as an access method; therefore, we must find the cheapestsuch path by the assumption that planning is not NP hard. Clearly, constructingthe planning problem from the original graph T is polynomial. The sets V (ofvariables) and N are linear in the size of T and M is at most quadratic. Hence,planning must be NP-Hard by reduction from TSP.


3 Integer Linear Programming Approach

For a query q, the universe Uq of all plans defined in section 2 is too large foran exhaustive search of an element with the lowest cost. For q with 15 triplepatterns and 3 access methods for each triple, assuming enough compute powerto generate and cost a billion plans per second, 594 years are needed! To solvethe query planning problem, we cast it as an integer programming problem.

In this section, we present a principled and general approach to correctlysolve an arbitrary complex query planning problem by casting it into an IntegerLinear Programming (ILP) problem. It consists in the following key steps:

– Control-aware Data Flow Construction. The access methods applicableto a given triple pattern depend on the variables that are available (in-scopevariables) when it is evaluated. Since patterns typically share variables, theevaluation of one is often dependent on the evaluation of another. For exam-ple, in Figure 1, the triple pattern t1 shares the variable ?x with triple pat-terns t2 and t3 appearing in the union. Hence, there is an inter-dependencybetween t1 and the union pattern containing t2 and t3 as, depending on theexecution methods used and the order of execution of t1 and UNION(t2, t3),the variable ?x may “flow” from t1 to UNION(t2, t3) or in the reverse direc-tion. The Data Flow Construction step builds a data structure that capturesall potentially valid ways in which variables can “flow” between various partsof the query. This data flow structure is control aware because it explicitlyrules out variable flows that would clearly violate the semantics of controlstatements in the query. For example, the ?y variable shared between t2 andt3 cannot be produced by one and used by the other because it would violatethe semantics of a UNION pattern.

– Constraint Generation: To ensure completeness (i.e., all plans in Uq areconsidered), the Control-aware Data Flow has to capture all potentially validflows and execution orders. Unfortunately, it also contains many invalidflows and execution orders that cannot be ruled out a priori. For exam-ple, it encodes a cyclic flow of variable ?x from t1 to UNION(t2, t3) and fromUNION(t2, t3) to t1. To ensure soundness (i.e., all solutions of the ILP prob-lem can be converted into a valid plan in Uq), the constraint generation stepgenerates, from the Control-aware Data Flow structure, constraints that ”dy-namically” rule out all invalid flows and execution orders (e.g., constraintsruling out cyclic data flows). These constraints constitute the linear con-straints of the ILP problem formulation.

– Cost Function Formulation. The cost function is expressed as a linearexpression of the various elements of the Control-aware Data Flow structure.It is such that, in an optimal plan, cheaper patterns (in terms of estimatedvariable bindings) are evaluated first before feeding their bindings to moreexpensive patterns.

– Solving the Resulting ILP Problem. Using an optimized ILP solver(e.g., IBM ILOG CPLEX), we solve the ILP problem of minimizing the costfunction under the generated set of constraints.


– Conversion of an ILP Solution into a Plan. Finally, an ILP solution isconverted into a valid plan in Uq

3.1 Control-aware Data Flow Construction

Our approach to build a Control-aware Data Flow for any arbitrary complexgraph pattern is inspired from electronic circuit design. A Control-aware DataFlow consists of set of hierarchically nested components. A component c is re-sponsible for the evaluation of an arbitrary complex graph or triple pattern p,which is to the key of c (denoted key(c)). Multiple components may be assignedto the same key. In this case, they represent alternative ways of evaluating theirkey (e.g., multiple access methods for the same triple pattern).

A component can be viewed from the outside (i.e., its external view) as ablack box connected to a set of input pins (one for each variable it may need toperform its function), and a set of output pins (one for each variable that maybecome available to other components as a result of its evaluation). Each pin canbe in one of two states: activated or deactivated. An activated input pin indicatesthat its corresponding variable is indeed available to use inside the black box.An activated output pin indicates that its corresponding variable is available toother components after the evaluation of its black box. Likewise, the black boxto which input and output pins are connected can be activated (i.e., enabledand performing its function) or deactivated (disabled). The external view of thecomponent representing our query example has no input pins and produces ?x,?y, ?z, ?m, ?n.

From the inside (i.e., its internal view), a component responsible for theevaluation an pattern can be viewed as performing its function by:

1. Wiring inputs it receives from the outside and outputs of some of its internalsub-components to inputs of other of its internal sub-components.

2. The exact nature of the wiring is dictated by the semantics of the graphpattern type (e.g. UNION, OPTIONAL, AND, etc.). For example, some com-ponents (e.g., UNION) disallow connections between variables of their sub-components (e.g., variable ?y produced by t2 cannot be fed to ?y in t3)

3. Since we do not know a-priori the optimal data flow inside a component, wehave to conservatively consider all potentially valid wirings. However, in agiven plan (solution to our problem), only a subset of wires will be activated.

A component is formally defined as follows:

Definition 6 Let V be an infinite set of variables. Let T be a finite set of types.A component C of depth d (for d a positive integer) is a triple (EV, IV, t). t ∈ Tis the type of the component C.

EV , called its external view, is defined as a pair (Ge, var) consisting of:

– the directed graph Ge = (V e = IP ∪ {bb} ∪OP,Ee) whose set of vertices V e

is a partition of three disjoint sets( the singleton {bb} containing the blackbox, the set IP of input pins, and the set OP of output pins), and whose set


Top-level component

x y z n m

x y z n m

(t1,acs)

x

x y z n m

(t4,acs)

y

x y z n m

(t5,acs)

z

x y z n m

(t6,aco)

n

x y z n m

UNION

x y

JOIN

x y z n m

x y z n m

x y z n m

(t6,acs)

y

x y z n m

(t5,aco)

y

x y z n m

OPTIONAL

y

x 5PRODUCT

x y z n m

x y z n m

x y z n m

(t1,aco)

y zx n m

(t5,scan)

x

(t1,scan)

y z n m yx z n m

(t4,aco)

y nx z m

(t6,scan)

yx z n m

(t4,scan)

Fig. 2. Top Level Component for q

of edges Ee is as follows:Ee = {(p, bb)|p ∈ IP}∪{bb, p)|p ∈ OP}. bb is calledthe black box of the external view EV . Elements of IP (resp. OP ) are calledinput (resp. output) pins of the external view EV . Ge is called the externalgraph of C.

– the function var maps an element of IP ∪ OP to a variable in V such thatif p1 and p2 are in IP (or in OP ), then var(p1) 6= var(p2).

EV is also uniquely characterized by the 4-tuple (IP ,bb,OP ,var) consisting ofinput pins, black box, output pins, the variable function. The function input(resp. output) maps a component to its set IP (resp. OP ) of input (resp. output)pins. The function blackbox maps a component to its black box.

IV , called the internal view of C, is defined by a pair (SC, Gi) consisting of:

– A finite set of components SC. If d = 0, then SC = ∅; otherwise, SC is madeof components Ck of depth dk such that 0 ≤ dk < d and there is a componentCj ∈ SC whose depth dj = d− 1. Elements of SC are called sub-componentsof C.

– A graph Gi = (V i, Ei), called internal graph of C and representing all po-tentially valid data flows inside C, such that• The set V i of internal vertices consists of vertices in all external graphs of

sub-components of C: V i = {n|n ∈ V ek ∧EVk = (Ge

k = (V ek , E

ek), vark) ∧

Ck = (EVk, IVk, tk) ∈ SC}.• The set Ei of internal edges contains all edges in all external graphs

of sub-components of C: S = {(n, n′)|(n, n′) ∈ Eek ∧ EVk = (Ge

k =(V e

k , Eek), vark) ∧ Ck = (EVk, IVk, tk) ∈ SC} ⊆ Ei

• If (n, n′) ∈ Ei does not belong to the external graph of any sub-componentof C (i.e., (n, n′) /∈ S), then it must be an edge between an output pinn in the external view of a sub-component Ck and an input pin in theexternal view of a sub-component Cj such that Ck 6= Cj and n and n′

are associated with the same variable (i.e., vark(n) = varj(n′).


UNION

x y z n m

x x

P(t1,scan)

x y z n m

P(t1,acs)

y

P(t4,aco)

x y z n m

P(t4,acs)

y z

P(t5,scan)

x y z n m

P(t5,acs)

x y z n m

P(t6,aco)

x y z n m

AND{t2}

x y

P(JOIN)

x y z n m

y

P(t4,scan)

x y z n m

P(t6,acs)

x y z n m

P(t5,aco)

x y z n m

P(OPTIONAL)

y n

P(t6,scan) x 5

x y z n m

AND{t3}

x y

x y

P(PRODUCT)

x y z n m

y z n m y z n m x z n m x z n m

P(t1,aco)

x z mx n m

Fig. 3. Union Component in q

The function internalGraph maps a component to its internal graph Gi. Thefunction subcomp maps a component to its internal sub-components SC.

Definition 7 In the internal view (SC, Gi = (V i, Ei)) of a component C, asub-component P1 = (EV1 = (IP1, bb1, OP1), IV1, var1) is called a potential pre-decessor of a component P2 = (EV2 = (IP2, bb2, OP2), IV2, var2) iff. an outputpin of P1 is connected to an input pin of P2: i.e., there is (op1, ip2) ∈ Ei suchthat op1 ∈ OP1 and ip2 ∈ IP2.

If P1 is a potential predecessor of P2, then, in the internal view of P2, there isa special component P ′1 of type PROXY, whose external view has no input vari-ables and has many output variables as P1. P ′1 represents P1 inside the internalview of P2. The delegator function maps each sub-component of type PROXYof a component P2 to the unique potential predecessor of P2 it represents.

A component C of type PROXY is a direct proxy iff. delegator(p) is not oftype PROXY.

Algorithm 1, invoked with a graph pattern GP to evaluate and an emptyset of potential predecessors, builds components responsible for the evaluationof GP (i.e. components representing alternative evaluation strategies for GP ).

We now illustrate the definition of a component (Definition 6) and Algo-rithms 1 and 2 on our running example. Figure 2 shows all the internal sub-components of the top level component responsible for the evaluation of themain pattern of our query. There are three components associated with eachtriple pattern - one for each access method (lines 4-12 of Algorithm 1). Tripleaccess methods that do not require any variables (e.g., (t1, aco)) only have as out-put variables their produced variables (They are potential independent startingpoints of the evaluation and do not depend on any other components). How-ever, in an AND pattern, a sub-component c that may need at least one variable


(e.g., (t1, acs)) has as output pins all in-scope variables occurring in the ANDpattern. This is needed because its required input variables may be provided byanother sub-component d (corresponding to a join or left join with d) and d mayhave available after its evaluation variables not produced by c, which will alsoremain available after c’s evaluation and should, therefore, appear in the set ofpotential output variables of c. In addition to sub-components responsible forthe evaluation of each sub-pattern of the main pattern, Figure 2 shows two spe-cial components: one product component and five join components. A productcomponent represents a product operation performed on the two componentsconnected to its input pins, whereas a join component corresponds to a regularjoin performed on the two components connected to its input pins. Since the toppattern has 6 sub-patterns with two join variables (x and y), there can be atmost 5 (6-1) regular joins and 1 (2-1) product (construction of join and productcomponents is done in line 18-26 of Algorithm 2) . Figure 2 shows all potentialconnections to the x input pin of the component (t1, acs) (dotted lines) and allpotential connections to the x input pin of the union component (continuouslines). In the internal view of a component associated with an AND group, foreach variable x, all output pins corresponding to a variable x are connected toall input pins corresponding to x ( Lines 29 - 33 of Algorithm 2) .

Figure 3 shows the internal view of the union component, whose externalview is present in the internal view of the top level component. The first threerows of its sub-components (from the top) consist proxy components. Two sub-components (AND (t2) and AND (t3)), called child components, are responsiblefor the evaluation of the two sub-patterns of the union pattern. Finally, Figure 3shows all connections between the external view of all sub-components of theunion component (Lines 13-15 of Algorithm 2). As opposed to the internal viewfor an and component, the connections in the internal view of a union componentare limited to connections from proxy components to child components.

3.2 Constraint Generation

As mentioned earlier, not all data flows captured by components built by algo-rithm 1 are valid. In this section, we introduce a set of constraints to rule outinvalid flows.

Decision Variables and Candidate Solutions Given a set of componentsC responsible for the evaluation of a pattern GP , the function α, referred toas the decision variable function, maps each vertice n (resp. edge (n1, n2)) inthe external or internal graphs of components in C and their direct and indirectsub-components to a unique boolean variable α(n) (resp. α((n1, n2))) that indi-cates whether n (resp. (n1, n2)) is activated. The range of α, denoted range(α),contains to all the decision variables associated with C. A candidate solution is afunction δ from range(α) to {0, 1}. It assigns an activation state (0 or 1) to eachvertice and edge of components directly or indirectly contained in C. Given a can-didate solution δ and a set of components C, the corresponding candidate compo-nent solution, denoted ∆(C), is the set of components obtained by retaining only


Algorithm 1: MakeComponents

Data: (GP,PR) where GP is a graph pattern and PR is the set of componentscontaining all potential predecessors of components to create.

Result: C a set of components responsible for the evaluation of GP1 begin2 C ←− ∅;3 available←− outputVariables(PR) ;4 if type(GP) = TRIPLE then5 foreach acm ∈M do6 input←− R(GP,amacm) ;7 if input = ∅ then available←− P(GP,amacm) else

available←− available ∪ P(GP,amacm) proxies←−makeProxies(PR) ;

8 EV ←− makeExternalView(input, available) ;9 C ←− C ∪ (EV, (proxies,

⋃p∈proxies externalGraph(p)), acm) ;

10 end

11 else12 if type(GP) = OPTIONAL then input←−mand(GP ) else

input←− inscopevars(GP ) available←− available ∪ input ;13 EV ←− makeExternalView(input, available) ;14 IV ←− makeInternalView(GP , PR) ;15 C ←− C ∪ (EV, IV, type(GP )) ;

16 end

17 end


Algorithm 2: MakeInternalView

Data: (GP,PR) where GP is a graph pattern and PR is the set of componentscontaining all potential predecessors of the internal view to create.

Result: (SC, Gi = (V i, Ei)) the internal view of the component responsible forthe evaluation of GP

1 begin2 (V i, Ei,SC, children)←− (∅, ∅, ∅, ∅) ;3 available←− outputVariables(PR) ;4 proxies←−makeProxies(PR) ;

5 V i ←−⋃

p∈proxies vertices(externalGraph(p)) ;

6 Ei ←−⋃

p∈proxies edges(externalGraph(p)) ;

7 foreach sub-pattern SGP of GP do8 foreach (EV = (Ge = (V e, Ee), var), IV, t) ∈

makeComponents(SGP, proxies) do9 (V i, Ei)←− (V i ∪ V e, Ei ∪ Ee) ;

10 children←− children ∪ (EV, IV, t) ;

11 end

12 end13 foreach (EV1 = (IP1, bb1, OP1, var1), IV1, t1) ∈ proxies do14 foreach (EV2 = (IP2, bb2, OP2, var2), IV2, t2) ∈ children do

Ei ←− Ei ∪ {(n1, n2)|n1 ∈ OP1 ∧ n2 ∈ IP2 ∧ var1(n1) = var2(n2)}15 end16 if type(GP ) = AND then17 available←− available ∪ inscopevars(GP ) ;18 for i← 1 to |subPatterns(GP )| − 1 do19 EV = (Ge = (V e, Ee), var)←−

makeExternalView(inscopevars(GP ), available) ;

20 (V i, Ei)←− (V i ∪ V e, Ei ∪ Ee) ;21 children←− children ∪ (EV, (∅, ∅), JOIN) ;

22 end23 for i← 1 to |joinV ariables(GP )| − 1 do24 EV = (Ge = (V e, Ee), var)←−

makeExternalView(inscopevars(GP ), available) ;

25 (V i, Ei)←− (V i ∪ V e, Ei ∪ Ee) ;26 children←− children ∪ (EV, (∅, ∅), PRODUCT ) ;

27 end28 foreach p ∈ children do addPredecessors(p, children− {p}) foreach

C1 = (EV1 = (IP1, bb1, OP1, var1), IV1, t1) ∈ children do29 foreach C2 = (EV2 = (IP2, bb2, OP2, var2), IV2, t2) ∈ children do30 if C1 6= C2 then

Ei ←− Ei ∪ {(n1, n2)|n1 ∈ OP1 ∧ n2 ∈ IP2 ∧ var1(n1) =var2(n2) ∧ (t1 6= PROXY ∨ t2 /∈ {JOIN,PRODUCT})}

31 end

32 end

33 end34 SC ←− proxies ∪ children ;

35 end


activated vertices and edges (i.e. elements of the set ACT = α−1(δ−1({1}))). Itis defined as follows:

∆(C) = {c | δ(α(blackbox(c))) = 1 ∧ c′ ∈ C s.t. blackbox(c) = blackbox(c′) ∧

input(c) = input(c′) ∩ ACT ∧ ouput(c) = ouput(c′) ∩ ACT ∧subcomp(c) = ∆(subcomp(c′)) ∧ ig = internalGraph(c) ∧ ig′ = internalGraph(c′) ∧

vertices(ig) = vertices(ig′) ∩ ACT ∧edges(ig) = {(v, v′) ∈ edges(ig′) | δ(α(v)) = 1 ∧ δ(α(v′)) = 1 ∧ δ(α((v, v′))) = 1}}

Figure 1(d) shows the internal view of the top level component for an op-timum component solution ∆(C). This solution corresponds to the right handplan shown in Figure 1(c).

Constraint Definition and Classification For a query q, whose main graphpattern is GP , and for C returned by Algorithm 1, most candidate solutions δare invalid in the sense that the corresponding candidate component solutions∆(C) cannot be converted into a valid plan in Uq (e.g. if ∆(C) still contains cyclicdata flows). We introduce constraints to rule out invalid candidate solutions. Aconstraint is a logical expression written as a function of decision variables thatexpresses a relation that must hold for all valid candidate solutions. We expressa constraint as a linear inequality of the form: a0 × x0 + ... + ak × xk ≥ b ora0 × x0 + ...+ ak × xk ≤ b, where k is a positive integer, and, for 0 ≤ i ≤ k, aiand b are real number constants and xi are decision variables. Constraints fallin one of the following categories: generic component constraints, generic graphconstraints, predecessor constraints, output pin constraints, and component-typespecific constraints.

Generic component constraints Generic component constraints are appli-cable to the the external view of every component. They enforce the semanticsof an external view as defined in Definition 6.

(C1) If a black box bb is not activated (i.e., α(bb) = 0), then each of its input oroutput pin p is also deactivated (i.e., α(p) = 0): α(p) ≤ α(bb)

(C2) A pin p is connected to its black box bb iff. it is activated:(C2-a) For p an input pin: α((p, bb)) = α(p)(C2-b) For p an output pin: α((bb, p)) = α(p)

(C3) In the internal view of a component c, whose internal graph is G = (V,E),if an input pin ip of a sub-component sc of c is activated, then it must haveat least one activated incoming edge:

∑(op,ip)∈E α((op, ip)) ≥ α(ip)

(C4) Each key k (query fragment) must be executed exactly once:∑c s.t. key(c)=k

α(blackbox(c)) = 1

Components of types JOIN, PRODUCT, and PROXY are not associated withany key, so this constraint does not apply to them.


Generic Graph Constraints . Generic Graph constraints enforce proper dataflow semantics.

(C5) If an edge (n,m) is activated, then nodes n and m must also be activated:α(n) + α(m) ≥ 2× α((n,m))

(C6) The internal graph G = (V,E) of a component c must be acyclic. For eachvertice v ∈ V , we map v to a new integer decision variable representingits position, denoted pos(v) and such that 0 ≤ pos(v) ≤ |V | − 1 (where|V | denotes the cardinality of the set V ). The position associated to eachvertice introduces an implicit ordering that we use to informally express theacyclicity constrain as follows: if an edge (n,m) is activated, then pos(n) +1 ≤ pos(m) (i.e. pos(n) < pos(m)). The formal ILP acyclicity constraint isexpressed as follows for an edge (n,m) ∈ E:

pos(n) + 1 + (|V | × (α((n,m))− 1)) ≤ pos(m)

Note that if (n,m) is activated (i.e., α((n,m)) = 1), the previous constraintbecomes what we wanted (i.e., pos(n) < pos(m)); otherwise, it is alwayssatisfied as pos(n) + 1 − |V | ≤ 0 (by definition of pos(n)) and pos(m) is apositive integer.

Predecessor Constraints . These constraints enforce the semantics of poten-tial predecessors as defined in Definition 7 and additional constraints to ensurethat every valid solution can be converted in a valid plan in Uq.

(C7) In the internal view of a component c = (EVc = (IPc, bbc, OPc, varc), IVc, tc),the proxy component pp representing a potential predecessor p of c (i.e.,delegator(pp) = p) is activated iff. at least one of the output pins of p isconnected to one of the input pins of c. Let EVp = (IPp, bbp, OPp, varp) bethe external view of the potential predecessor p of c and let IV = (SC, G =(V,E)) be the internal view of the component d that has both c and p asits sub-components (i.e., c and p are in SC), the formal ILP constraint is asfollows 2:

α(blackbox(pp)) = max(op,ip) ∈ E s.t. op ∈ OPp ∧ ip ∈ IPc

α((op, ip))

(C8) As explained in section 3.4, in the translation of the solution to the ILPproblem into a plan in Uq, a predecessor of c that is not of type PROXY isjoined (or left joined) with c, whereas as predecessor of type PROXY, simplyallows already bound variables to be used in access methods of inside c.Since join operators in plans in Uq have exactly two operands, we have tolimit the maximum number, M , of direct proxies dp (i.e., delegator(dp) isnot of type PROXY) that an activated component can have to 1 (except forcomponent of type JOIN and PRODUCT that must have exactly two direct

2 a constraint with min and max can easily be translated into a standard linear con-straint, and most ILP solvers directly support them


proxies). Let S be the linear expression of the number of activated directpredecessor proxies of a component c:

S =∑

pp∈subcomp(c) ∧ type(pp)=PROXY∧ type(delegator(pp)) 6=PROXY

α(blackbox(pp))

(C8-a) M is equal to 1 for all components except join and product componentsfor which it is equal to 2. S ≤M × α(blackbox(c))

(C8-b) The minimum number, m, of direct predecessor proxies is 0 for allcompnents except join and product components for which it is 2. S ≥m× α(blackbox(c))

(C9) A component p provides the value of a variable x to a component c with atleast one non-proxy sub-component (i.e. p’s x output pin is connected to c’sx input pin) iff. the x output pin of the predecessor proxy pp representingp in c is connected to another sub-component of c. This constraint ensuresthat variables provided by predecessors are indeed used by sub-components.Let op be the output pin of p for variable x. Let ip be the input pin of c forx. Let G = (V,E) be the internal graph of c and let op′ be the output pinfor x of the proxy pp.The formal ILP constraints are expressed as follows:

(C9-a) |{(op′, v) | (op′, v) ∈ E}| × α((op, ip)) ≥∑

(op′,v)∈E α((op′, v))

(C9-b) α((op, ip)) ≤∑

(op′,v)∈E α((op′, v))Note these constraints do not apply to components without non-proxy sub-components such as join, product, proxy and simple triple access methodcomponents.

(C10) The activation status of output pins are identical in an activated proxy andthe potential predecessor it represents. Let pp be a proxy representing acomponent p (i.e., delegator(pp) = p), for each variable x in the set ofoutput variables of p, let opx be the output pin of p associated with x andlet op′x be the output pin of pp associated with x:

(C10-a) α(opx) + (α(blackbox(pp))− 1) ≤ α(op′x)(C10-a) α(op′x) + (α(blackbox(pp))− 1) ≤ α(opx)

Output Pin Constraints These constraints control the default activation ofoutput pins of components. A variable is available after the execution of a com-ponent c whose type is different from PROXY iff. it is either an in-scope variableof the graph pattern associated with c ( (C11)) or it is a variable provided bya direct predecessor ((C12)). Note the activation status of proxy component iscontrolled by constraints (C10) 3.

(C11) If a non-proxy component c, responsible for the evaluation of a graph patternGP (i.e., key(c) = GP ), is activated, then all the output pins of c associ-ated with in-scope variables of GP must be activated (as these variables are

3 The implementation of a Minus component also overrides these default activationconstraints


available after the execution of c). Let op be an output pin of c for a variablex ∈ inscopevars(GP ): α(op) ≥ α(blackbox(c))

(C12) Let op be an output pin for a variable x of a non-proxy component c thatis either associated with no keys (e.g. join or product) or associated with agraph pattern GP s.t. x /∈ inscopevars(GP ). op is activated iff. at least onedirect predecessor proxy pp in the internal view of c has an activated outputpin associated with x.

α(op) = maxop′∈output(pp) ∧ pp∈subcomp(c) ∧ isDirectProxy(pp)

α(op′)

where isDirectProxy(pp) is defined as isDirectProxy(pp) = (type(pp) =PROXY) ∧ (type(delegator(pp)) 6= PROXY).

Component-type Specific Constraints These constraints are applicable tocomponents of a specific type.

(C13) An activated component c = (EV = (IP, bb,OP, var), IV, acm) associatedwith a triple pattern tp = key(c) (i.e. c’s type acm is in the set of accessmethodsM) must have all its input pins corresponding to required variablesof the access method acm activated. For x ∈ R(tp, acm) and ip ∈ IP s.t.var(ip) = x, α(ip) ≥ α(blackbox(c)),

(C14) Plans in Uq are rooted trees. For a component c of type AND whose internalgraph is G = (V,E), let G′ = (V ′, E′) be the inverse of G restricted tothe set SC′ of non-proxy sub-components with at least one input pin orone output pin (i.e., SC′ = {sc|sc ∈ subcomp(c) ∧ type(sc) 6= PROXY ∧input(sc) ∪ ouput(sc) 6= ∅}). To ensure that a valid ILP candidate solutioncan be translated into a rooted tree, G′ must be a rooted tree. This constraintis enforced by the following two specific constraints:

(C14a) A sub-component in the internal view of c is defined as a sink iff. it isactivated and has no outgoing edges. There must be at most one sinkin the set SC′:

∑sc ∈ SC′

(α(blackbox(sc))− max(op,ip) ∈ E s.t. op ∈ output(sc)

α((op, ip))) ≤ 1

(C14b) A sub-component of sc ∈ SC′ of c can have activated outgoing edges toat most other sub-component of c (i.e. there is at most one activatedpredecessor proxy pp s.t. delegator(pp) = sc across internal views of allsub-components of c). Formally, for sc ∈ SC′,∑

pp ∈ subcomp(s) ∧ s ∈ subcomp(c)

∧ type(pp)=PROXY ∧ delegator(pp)=sc

α(pp) ≤ 1


(C15) The two activated predecessors of a merger component c (i.e. a componentof type JOIN or PRODUCT) must have at least activated output variable incommon if the type of c is JOIN; otherwise (i.e., type(c) = PRODUCT), theyshould have no activated output variable in common. We first introduce theexpression CV that indicates whether the predecessors of c have a commonvariable, and then we use it to express merger component constraints. Todefine CV , for each variable x associated with an input pin of c, we introducea new boolean decision variable CVx. CVx indicates whether all the directproxies in the internal view of c have x as a common variable. CVx satisfiesthe following two constraints: CVx ≥ Sx − 1 and Sx ≥ 2 × CVx, where Sx

is the following expression indicating the number of direct proxies in c thathave an activated output pin associated with x:

Sx =∑

pp ∈ subcomp(c)∧ isDirectProxy(pp)

α(op) if op ∈ output(pp)∧ var(op) = x

0 otherwise

Since a merger component has exactly two activated direct proxies (see con-straints (C8a) and (C8b)), the first constraint on CVx ensures that if xis a variable common to all predecessors, then CVx = 1, and the secondconstraint ensures that if CVx = 1, then x is a variable common to allpredecessors.

CV = maxx|∃ip∈input(c) s.t. var(ip)=x

CVx

(C15a) For c is a component of type JOIN : CV ≥ α(blackbox(c))

(C15b) For c is a component of type PRODUCT : (1− CV ) ≥ α(blackbox(c))

(C16) Plans in Uq are such that join or product operations have at most oneoperand that is either a join, product or left outer join operation. In ourtranslation of an ILP solution into a valid plan in Uq (see section 3.4 formore details), a join operation can be introduced in a plan through thetranslation of an explicit join component or when a component c has as pre-decessor a non-proxy component p (this results in translation of c as a joinbetween translation of p and translation of the internal view of c). Thus,to be able to convert an ILP solution into a plan in U , we need to ensurethat a join or product component has at most one non-proxy predecessorp such that p itself has non-proxy predecessors. For a component c, theexpression hasNPP (c) indicates whether c has at least one non-proxy pre-decessor: if c has no direct proxy sub-component hasNPP (c) = 0; otherwise,hasNPP (c) = maxpp ∈ subcomp(c) ∧isDirectProxy(pp) α(blackbox(pp)).

For a component c of type JOIN or PRODUCT, the constraint can now beformally expressed as:∑

pp ∈ subcomp(c) ∧ isDirectProxy(pp)∧ delegator(pp)=p

min(α(blackbox(pp)), hasNPP (p)) ≤ 1


The last three constraints enforce the proper semantics of an optional patternand the left join operation it is translated into (see section 3.4 for more detailsabout the translation of optional components).

(C17) An activated direct proxy pp in the internal view of a component c of typeOPTIONAL must have all the mandatory variables of the optional patternkey(c) associated with c activated in its output. Let pp be a direct proxysub-component of c,

(C17a) if pp output variables do not contain all mandatory variables of key(c)(i.e., mand(key(c)) * vars(ouput(pp))), then α(blackbox(pp) = 0

(C17b) otherwise (i.e., mand(key(c)) ⊆ vars(ouput(pp)));∑x ∈ mand(key(c))∧ op ∈ output(pp)∧ var(op)=x

α(op) ≥ α(blackbox(pp))× |mand(key(c))|

(C18) If an activated component c of type OPTIONAL has no non-predecessorproxies, then it must be a predecessor of a component j of type JOIN. LetG = (E, V ) be the internal graph of the component d that has c as one ofits sub-components.

∑j ∈ subcomp(d)∧ j=(EV,IV,JOIN)∧ EV =(IP,bb,OP,var)

max(op,ip) ∈ E

s.t. op ∈ output(c)∧ ip ∈ IP

α((op, ip)) ≥ (α(blackbox(c))−hasNPP (c))

where hasNPP (c) is defined in (C16)(C19) A join component c can have at most one predecessor p of type OPTIONAL

such that p has no non-proxy predecessors.

∑pp∈subcomp(c) ∧

isDirectProxy(pp) ∧ delegator(pp)=p

min(α(blackbox(pp)), hasNPP (p)) ≤ 1

(20) If a join component c has a predecessor p1 of type OPTIONAL such thatp1 has no non-proxy predecessors, then the other predecessor p2 must haveall the mandatory variables of the optional pattern key(p1) associated withp1 activated on its output pins. This constraint is formally expressed asfollows. Let pp1 be a proxy sub-component of a join component c such thatdelegator(pp1) = p1 and type(p1) = OPTIONAL, and let pp2 be any otherdirect proxy sub-component of c different from pp1:

(C20a) if pp1 output variables do not contain all mandatory variables of key(p1)(i.e., mand(key(p1)) * vars(ouput(pp2))), then α(blackbox(pp2) = 0

(C20b) otherwise (i.e., mand(key(p1)) ⊆ vars(ouput(pp2)));∑x ∈ mand(key(p1))∧ op ∈ output(pp2)∧ var(op)=x

α(op) ≥ α(blackbox(pp2))× |mand(key(p1))|


3.3 Cost Function Formulation

For each component c, we associate a new positive real number variable, denotedcost(c), for the cost of c. The cost structure of a component c whose type isdifferent from PROXY is defined as:

cost(c) = λ0 +∑

sc ∈ subcomp(c)

∧ type(sc)6=PROXY

λsc× cost(sc) +∑

sc ∈ subcomp(c)∧ isDirectProxy(sc)

λ′sc× cost(sc)

where λ0, λsc, and λ′sc for sc ∈ subcomp(c) are positive real number constantswhose values depend on the type of c and its sub-components.

For example, for a component c associated with a triple pattern (i.e., key(c)is a triple pattern) with the access method acm ∈M, λ0 is the cost of evaluatingthe triple pattern key(c) using access method acm.

For a component c of type PROXY representing a component p (i.e.,delegator(c) = p),

1. if c is activated, then cost(c) = cost(p). This is expressed using the followingtwo ILP constraints:

(a) cost(c) +MAXCOST × (α(blackbox(c))− 1) ≤ cost(p)(b) cost(p) +MAXCOST × (α(blackbox(c))− 1) ≤ cost(c)

2. if c is not activated, then cost(c) = 0, which is expressed using the followingILP constraint: cost(c)−MAXCOST × (α(blackbox(c)) ≤ 0

where MAXCOST is an upper bound of the cost of all components. A value ofMAXCOST can be computed by conservatively assuming that all componentsare activated. However, in practice, instead of relying on the previous threelinear constraints, we use explicit ifthen constraints provided by LP solver suchas IBM CPLEX. It avoids numerical instabilities that could occur due to thepotentially large value of MAXCOST .

For a query q whose main graph pattern is GP and such that the invocationof Algorithm 1 with arguments GP and ∅ returns a set C of components, the ILPproblem to solve is as follows: minimize

∑c∈C cost(c) subject to all constraints

defined in section 3.2 and cost constraints defined in this section.

3.4 Soundness and Completeness

Before presenting soundness and completeness results, we briefly introduce im-portant notations. Let q be a query whose main graph pattern is GP . Let Cbe the set of components returned by the invocation of Algorithm 1 with ar-guments GP and ∅. The set Φq denotes the set of constraints generated for Cand presented in section 3.2. The set of candidate solutions satisfying all con-straints in Φq is denoted ILPSq. For δ ∈ ILPSq, cost(δ) is defined as thecost(δ) =

∑c∈C δ(cost(c))


Finally in a plan p ∈ Uq, for some operators (REGULARJOIN,PRODUCT,and UNION) the order of evaluation of their operands does not affect the to-tal estimated cost 4. We say that two plans p1 and p2 are cost equivalent,denoted p1 ≈ p2, iff. one can be transformed into the other by a sequence ofapplications of commutative operation com and associative operation asso op-eration on REGULARJOIN,PRODUCT, and UNION. For op ∈ { REGULAR JOIN,PRODUCT, UNION }, com(op(e1, e2)) = op(e2, e1) and asso(op(e1, op(e2, e3)) =op(op(e1, e2), e3).

The soundness and completeness of the ILP approach is established by thefollowing Theorem:

Theorem 2. Let q be a query whose main graph pattern is GP . There exists apair of functions (β, σ) such that β is a function from Uq to ILPSq and σ is afunction from ILPSq to Uq such that:

1. if p ∈ Uq, then σ(β(p)) ≈ p2. if δ ∈ ILPSq, then cost(β(σ(δ))) = cost(δ)

Proof. Algorithm 4 shows a concrete implementation of β which converts a planin Uq into a candidate solution in ILPSq.

Algorithm 3 shows a concrete implementation of σ which converts a candi-date solution in ILPSq into a plan in Uq. For REGULARJOIN, PRODUCT, andUNION, whose order of evaluation of operand does not affect the total estimatedcost, the plan produced by σ is such that operands with lower estimated costare evaluated first (thanks to sorting in increasing cost performed at lines 6, 16,and 36).

Properties (1) and (2) are satisfied by (β, σ) because (a) β is such that if p1

and p2 in Uq and β(p1) = β(p2) then p1 ≈ p2, and (b) two distinct candidatesolutions δ1, δ2 in ILPSq are mapped to the same plan p only when they differby the proxy predecessors (i.e., predecessors of type PROXY) used to access analready bound variables, and (c) proxy predecessors result in indirect proxies inthe internal view of their successors (which are not costed - see section 3.3).

4 Evaluation

To examine the effectiveness of the ILP based planner as a testing framework, weconducted experiments with 5 different benchmarks: LUBM [5], SP2Bench [17],DBpedia [13], UOBM [11] and a private benchmark PRBench used in earlierwork [1]. Our focus in this paper was to determine whether the ILP based plannercould be used to test the greedy approach outlined in the DB2RDF system [1],given that this is one relatively mature implementation of a greedy approach toSPARQL planning. Our evaluation of the ILP testing framework had two goals:(1) to demonstrate that the framework can actually compute optimal plans for

4 Note that this is not the case for a LINEAR JOIN


Algorithm 3: toAnnotatedExpr

Data: (δ, c) where δ is an candidate solution satisfying all the ILP constraints, cis an activated component for the solution δ, and ∆(c)’s internal viewIV = (SC, G = (V0 ∪ V1, E1)) has vertices in V0 without input or outputpins and vertices in V1 with at least one pin.

Result: (e,am, jan) the annotated algebraic expression of ∆(c)1 begin2 switch type(c) do3 case type(c) ∈M: (e,am(e), ∅)←− (key(c), type(c), ∅) case

OPTIONAL: (e,am, jan)←− toAnnotatedExpr(δ,uniqueNonProxy(SC)) case UNION:

4 l←− sortByCost(nonProxies(SC)) ;5 (e,am, jan)←− (null, ∅, ∅) ;6 for i← 2 to |l| − 1 do7 (ei,ami, jani)←− toAnnotatedExpr(δ, li) ;8 e←− e = null ? ei : UNION(e, ei) ;9 (am, jan)←− union((am, jan), (ami, jani)) ;

10 end

11 endsw12 case AND:13 (e,am, jan)←− (null, ∅, ∅) ;14 foreach sc s.t. blackbox(sc) ∈ sortByCost(V0) do15 (esc,amsc, jansc)←− toAnnotatedExpr(δ, sc) ;16 e←− e = null ? esc : PRODUCT (e, esc) ;17 (am, jan)←− union((am, jan), (amsc, jansc)) ;

18 end19 l←− topologicalSort(G1 = (V1, E1), {sc|blackbox(sc) ∈ V1}) ;20 map←− ∅ ;21 for i← 0 to |l| − 1 do22 if type(li) /∈ {JOIN,PRODUCT} then23 (ei,ami, jani)←− toAnnotatedExpr(δ, li) ;24 (am, jan)←− union((am, jan), (ami, jani)) ;25 if i = 0 ∧ e 6= null then ei ←− PRODUCT (e, ei) p←−

uniqueNonProxyPredecessor(li) ;26 if p 6= null then27 ep ←− getValue(map, p) ;28 if type(li) = OPTIONAL then ei ←− LEFTJOIN(ep, ei)

else ei ←− JOIN(ep, ei) jan(ei)←− LINEAR ;

29 end

30 end31 (p1, p2)←− sortByCost(nonProxyPredecessors(li)) ;32 (a, b)←−( getValue(map, p1) , getValue(map, p2) ) ;33 if type(li) = PRODUCT then ei ←− PRODUCT (a, b) else34 if p1&p2 not optional w/o non-proxy predecessors then

ei ←− JOIN(a, b) else35 // p2 is the optional without non-proxy predecessors;36 ei ←− LEFTJOIN(a, b) ;

37 end38 jan(ei)←− REGULAR ;

39 end40 map←− (li, ei) ; e←− ei ;

41 end

42 endsw

43 endsw

44 end


Algorithm 4: toILPSolution

Data: (ae = (e,am, jan), ρ, C, p,PP) where ae is an annotated algebraicexpression, ρ is a function that associates a SPARQL algebraic expressionu to the set of SPARQL graph patterns it represents, C is a set ofcomponents c such that key(c) contains all elements in ρ(e), p is theunique activated non-proxy predecessor of the activated component in C,and PP is a set of activated potential proxy predecessors .

Result: δ an candidate solution satisfying all the ILP constraints1 begin2 switch type(e) do3 case e = BGP(t)4 foreach c ∈ C do5 if type(c) 6= am(e) then δ(α(blackbox(c)))←− 0 else6 δ(α(blackbox(c)))←− 1 ;7 setInputPins(c, required(e))) ;8 connectToPredecessors(c, {p},PP, required(e)) ;9 setOutputPins(c, vars(output(p)) ∪ available(e)) ;

10 end

11 end

12 endsw13 case e = LEFTJOIN(e1, e2) or e = JOIN(e1, e2)14 c←− unique element of C ;15 if δ(α(blackbox(c))) is not defined then16 δ(α(blackbox(c)))←− 1 ;17 setOutputPins(c, vars(output(p))) ;18 foreach m ∈ subcomp(c) s.t. type(m) ∈ {JOIN,OPTIONAL} do

δ(α(blackbox(m)))←− 019 end20 setInputPins(c, required(e)) ;21 setOutputPins(c,available(e)) ;22 C1 ←− findComponents(e1, ρ, C) ;23 C2 ←− findComponents(e2, ρ, C) ;24 δ1 ←− toILPSolution((e1,am, jan), ρ, C1, p,PP) ;25 if jan(e) = LINEAR then δ2 ←−

toILPSolution((e2,am, jan), ρ, C2,∆1(C1),PP) else26 δ2 ←− toILPSolution((e2,am, jan), ρ, C2, null,PP) ;27 if jan(e) = PRODUCT then j ←−

getInactivatedProductSubComp(c) else j ←−getInactivatedJoinSubComp(c) δ(α(blackbox(j)))←− 1 ;

28 jinput←− vars(output(∆1(C1))) ∪ vars(output(∆2(C2))) ;29 setInputPins(j, jinput) ;30 connectToPredecessors(j, {∆1(C1),∆2(C2)}, ∅, jinput) ;31 setOutputPins(j, jinput) ;

32 end33 δ ←− union(δ, δ1, δ2) ;

34 endsw35 case e = UNION(e1, e2)36 c←− unique element of C ;37 if δ(α(blackbox(c))) is not defined then38 δ(α(blackbox(c)))←− 1 ;39 setOutputPins(c, vars(output(p))) ;

40 end41 setInputPins(c, required(e)) ;42 setOutputPins(c,available(e)) ;43 connectToPredecessors(c, {p},PP, required(e))) ;44 C1 ←− findComponents(e1, ρ, C) ;45 C2 ←− findComponents(e2, ρ, C) ;46 δ1 ←− toILPSolution((e1,am, jan), ρ, C1, p,PP) ;47 δ2 ←− toILPSolution((e2,am, jan), ρ, C2, p,PP) ;48 δ ←− union(δ, δ1, δ2) ;

49 endsw

50 endsw

51 end


a wide variety of queries, (2) to determine if the framework could be used touncover optimization opportunities in a mature planner.

We describe each of the benchmarks briefly:

• LUBM: The LUBM benchmark queries consist of the 12 queries, and anontology that was modified to OWL QL expressivity.

• UOBM: The UOBM benchmark queries consist of the 14 queries, and anontology that was modified to OWL QL expressivity. OWL QL query expansionis applied to LUBM and UOBM queries.

• SP2Bench: SP2Bench is an extract of DBLP data with corresponding SPARQL

queries. We used this benchmark as is, with no modifications.

• DBpedia: The DBpedia SPARQL benchmark is a set of query templates derivedfrom actual query logs against the public DBpedia SPARQL endpoint [13]. We usedthese templates with the DBpedia 3.7 dataset, and obtained 20 queries that hadnon-empty result sets.

• PRBench: The private benchmark reflects data from a tool integration sce-nario where specific information about the same software artifacts are generatedby different tools, and RDF data provides an integrated view on these artifactsacross tools.The benchmark has fairly complex queries, and is therefore a goodtest for the ILP planner because the search space is large.

Experiments were conducted on a machine with two 2.3 GHz processors, eachwith 6 cores, and 64 GB of RAM (the max. java heap size allocated was 5G)running 64-bit Linux. The ILP solver used is IBM ILOG CPLEX Version 12.5.

For each query, we first computed an optimal plan op using our ILP approach.Then, we translated the plan returned by the greedy planner into a solution sto the ILP planning problem. Finally, we compared the cost of s with the costof the optimal solution op to check the optimality of the greedy plan. Table 1shows a summary of the ILP results on the 5 benchmarks. As shown in the table,the average time for ILP query planning of all 91 queries indicates that the ILPapproach is very practical for testing SPARQL planners. As shown in the table,the average time for queries ranged from 0.45-27.2 s on all the benchmarks,which is impressive if one considers the size of the search space for many of the91 queries that were tested. A significant number of queries (66/91) ran under1 second, as shown in the table. The slowest planning problem took 11 minutesfor the largest query in PRBench (a 1005 line query).

Further, as shown in the Figure, it helped identify 7 cases where the greedyplans were suboptimal. For at least one of those cases, the ILP planner’s optimalplan helped us identify obvious opportunities for improving the greedy algorithm.Specifically, the greedy planner in DB2RDF missed opportunities for exploitingstar queries (i.e., queries on the same entity for which DB2RDF [1] data layout isdesigned to provide a very efficient evaluation without any join) due to heuristicsthat did not adequately reflect the performance gain from stars. Once the optimalplans highlighted the problem, we were able to tune the greedy planner withbetter heuristics and verify that these new heuristics made that plan optimalwith negligible added overhead. In the other 6 cases, it was quite clear that anygreedy approach would arrive at a suboptimal plan.


Dataset #Queries Avg Time(s) StDev (s) Min - Max (s) # Queries < 1 s #Suboptimal

LUBM 12 0.4 0.5 0.02 - 1.3 9 1UOBM 14 2 6 0.02 - 24 11 2SP2Bench 17 1.7 3.2 0.01 - 10 8 2DBpedia 20 1 2.2 0.02 - 6.5 15 2PRBench 28 27.2 127 0.06 - 673 23 None

Table 1. Query optimality results

5 Related Work and Conclusion

Query optimization has been researched in the context relational databases.Greedy and heuristic based algorithms have been introduced to avoid the expo-nential cost of producing all possible plans, at the expense of query performance.At the base of relational optimizers is the System-R optimization framework [18]which was subsequently extended to Starburst [6] and Volcano/Cascade [3, 4].Numerous techniques have been proposed to improve query performance in rela-tional databases and are referenced in [9, 10, 2]. There is currently a large bodyof research in SPARQL query optimization based on graph-specific greedy andheuristic algorithms. Typical approaches perform bottom-up SPARQL query op-timization, i.e., individual triples [19] or conjunctive SPARQL patterns [7, 15] areindependently optimized, and then the optimizer orders and merges these indi-vidual plans into one global plan. These approaches rely on statistics [12] to as-sign costs to query plans. Optimization here is restricted to conjunctive patterns.The work in [20] contrasts with these approaches and adopts a heuristic based op-timization mechanism where the statistics are ignored. [1] focusses on importantcharacteristics of SPARQL queries, often with deep, nested sub-queries whoseinter-relationships are lost when optimizations are limited by the scope of singletriple or individual conjunctive patterns (as in prior work). The query optimizercaptures the inherent inter-relationships due to the sharing of common variablesor constants of different query components. These inter-relationships often spanthe boundaries of simple conjuncts and are often across the different levels ofnesting of a query, i.e., they are not visible to existing bottom-up optimizers.

In this paper, we investigated the optimal SPARQL query planning problem,in the context of offline query optimization and planner testing. We formallyintroduced the universe of alternative query plans for an input query q . Toefficiently solve the planning problem, we devised an approach that casts ourplanning problem as an ILP problem. We experimented with well-known datasetsand large numbers of queries and illustrated that our approach consistently findsoptimal plans in reasonable amount of time (in a few minutes in the worst case).

References

1. Bornea, M., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea,O., Bishwaranjan, B.: Building an efficient rdf store over a relational database. In:


Proceedings of the ACM SIGMOD conference. SIGMOD 2013 (2013)2. Chaudhuri, S.: An overview of query optimization in relational systems. In:

SIGACT-SIGMOD-SIGART. pp. 34–43 (1998)3. Graefe, G.: The cascades framework for query optimization. Data Engineering

Bulletin 18 (1995)4. Graefe, G., DeWitt, D.J.: The exodus optimizer generator. pp. 160–172 (1987)5. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys-

tems. Journal of Web Semantics 3(2–3), 158–182 (2005)6. Haas, L.M., Freytag, J.C., Lohman, G.M., Pirahesh, H.: Extensible query process-

ing in starburst. SIGMOD Rec. pp. 377–388 (1989)7. Hartig, O., Heese, R.: The SPARQL Query Graph Model for Query Optimization.

pp. 564–578 (2007)8. Ibaraki, T., Kameda, T.: On the optimal nesting order for computing n-relational

joins. ACM Trans. Database Syst. 9(3), 482–502 (Sep 1984), http://doi.acm.org/10.1145/1270.1498

9. Ioannidis, Y.E.: Query optimization. In: The Computer Science and EngineeringHandbook, pp. 1038–1057 (1997)

10. Jarke, M., Koch, J.: Query optimization in database systems. ACM Comput. Surv.pp. 111–152 (1984)

11. Ma, L., Yang, Y., Qiu, Z., Xie, G., Pan, Y., Liu, S.: Towards a complete owl on-tology benchmark. pp. 125–139. ESWC’06 (2006), http://dx.doi.org/10.1007/11762256_12

12. Maduko, A., Anyanwu, K., Sheth, A., Schliekelman, P.: Estimating the cardinalityof rdf graph patterns. In: WWW. pp. 1233–1234 (2007)

13. Morsey, M., Lehmann, J., Auer, S., Ngonga Ngomo, A.C.: DBpedia SPARQLBenchmark – Performance Assessment with Real Queries on Real Data. In: ISWC2011 (2011)

14. Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivityfactors for multi-dimensional queries. In: SIGMOD. pp. 28–36 (1988)

15. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDFdata. The VLDB Journal 19(1), 91–113 (Feb 2010)

16. Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms forselectivity estimation of range predicates. In: SIGMOD. pp. 294–305 (1996)

17. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2Bench: A SPARQL Perfor-mance Benchmark. CoRR abs/0806.4627 (2008)

18. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Accesspath selection in a relational database management system. In: SIGMOD (1979)

19. Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL basicgraph pattern optimization using selectivity estimation. In: WWW (2008)

20. Tsialiamanis, P., Sidirourgos, L., Fundulaki, I., Christophides, V., Boncz, P.:Heuristics-based query optimisation for SPARQL. In: EDBT. pp. 324–335 (2012)

an o ine optimal sparql query planning approach to ... · 2.2 query flattening to simplify the...

Documents