centro de investigacio´n de la web sem´antica center for...

Centro de Investigacion de la Web SemanticaCenter for Semantic Web Research

www.ciws.cl, @ciwschile

Four institutions

PUC, Chile

◮ Marcelo Arenas (Boss)◮ SW, data exchange,

semistrucured data

◮ Juan Reutter◮ SW, graph DBs, DLs

◮ Cristian Riveros◮ data exchange, semistr. data

◮ Jorge Baier◮ planning, search

Univ. of Talca

Renzo Angles (SW)

UTFSM

Carlos Buil (SW)

Univ. of Chile

◮ Pablo Barcelo (Deputy)◮ graph DBs, DB theory

◮ Claudio Gutierrez◮ SW, graph DBs

◮ Jorge Perez◮ SW, data exchange

◮ Aidan Hogan◮ SW, semistr. data

◮ Barbara Poblete◮ web mining, SNA

◮ Benjamın Bustos◮ multimedia

Query Languages for Graph DBs(Bridging the Gap Between Theory and Practice)

Pablo Barcelo and Aidan HoganDCC, Universidad de Chile

Center for Semantic Web Research (www.ciws.cl)

BACKGROUND AND OBJECTIVES

Graph databases

Trendy applications:

◮ Social network analysis

◮ Semantic web

◮ Scientific databases

◮ Software bug localization

◮ Geo-routing

Graph databases

Trendy applications:

◮ Social network analysis

◮ Semantic web

◮ Scientific databases

◮ Software bug localization

◮ Geo-routing

More in general:

◮ Wherever connections are as important as data

What is a graph database?

A data management system that exposes a graph data model.1

1Graph databases. Robinson, Webber, & Eifrem. O’Reilly, 2013.

What is a graph database?

A data management system that exposes a graph data model.1

Several existing graph DB engines and query languages:

◮ DEX/Sparksee - basic algebra

◮ IBM System G - Gremlin

◮ Neo4J - Cypher

◮ Oracle PGX - PGQL

◮ RDF stores (Virtuoso, AllegroGraph, Oracle, IBM) - SPARQL

1Graph databases. Robinson, Webber, & Eifrem. O’Reilly, 2013.

What graph databases are good for?

◮ Flexible modelling of interconnected data

◮ Agile evolution of the data model

◮ Scalable evaluation of join-intensive queries

My personal story

◮ Since 2009: Working on theory of query languages for graph DBs

◮ Since 2015: Working group of LDBC for the design of such language

My personal story

◮ Since 2009: Working on theory of query languages for graph DBs

◮ Since 2015: Working group of LDBC for the design of such language

Conclusion:

◮ Theory and practice are more connected than expected

Objectives

◮ Identify topics of common interest for theoreticians and developers

◮ Formalize relevant concepts (syntax, semantics, terminology, etc)

◮ Understand tradeoff expressiveness/efficiency

Overview of the course

////////Aidan/////will///////teach///////////Monday,///////////Tuesday,//////and///////////////Wednesday:

◮ ///////////Semantic//////web,////////RDF,/////and///////////SPARQL

◮ /////Half//////class////////based////on////////////////presentation,//////half///////based////on/////lab

Pablo will teach on Thursday and Friday:

◮ Property graphs, graph patterns, path queries

◮ Lectures presenting how theory of graph DBs relates to practice

Syllabus

Syllabus

//////////Monday:////////////Semantic///////web,//////RDF//////data/////////model,////////////Wikidata

//////////Tuesday:////////////SPARQL/////1.0,///////basic////////graph///////////patterns,///////basic/////////////operations//////////(union,/////////////projection,////////////optional),//////////solution/////////////modifiers,///////graph////////////construct/

//////////////Wednesday:////////////SPARQL/////1.1,///////////property////////paths,////////////negation,////////////////aggregation,//////////////distribution,///////////updates

Syllabus




Thursday: Graph DBs, property graphs, graph patterns, expressiveness,complexity of evaluation

Syllabus




Thursday: Graph DBs, property graphs, graph patterns, expressiveness,complexity of evaluation

Friday: Path query languages, regular path queries, ungrouping,expressiveness, complexity

Evaluation

35% in-class labs, 15% applied assignment, 50% theoretical assignment

THE DATA MODEL:PROPERTY GRAPHS

The data model is important as it must be:

◮ Flexible enough to accomodate scenarios of practical interest

◮ Simple enough to allow for a clean presentation

◮ Expressive enough for theoretical issues to appear in full force

The data model is important as it must be:

◮ Flexible enough to accomodate scenarios of practical interest

◮ Simple enough to allow for a clean presentation

◮ Expressive enough for theoretical issues to appear in full force

This is accomplished by the model of property graphs

A bit of history and context is needed


The simplest possible graph data model: Directed graphs


The simplest possible graph data model: Directed graphs

Clint EastwoodThe Bridges of

Madison County

Meryl Streep Out of Africa

Any problems with this data model?


It does not allow to classify relationships


It does not allow to classify relationships


Madison County

C. Eastwood acts in and directed this movie

Edge-labeled graphs

Allow for such classification

Edge-labeled graphs

Allow for such classification


Madison County

acts in

directs

Meryl Streepacts in

Formal definition of edge-labeled graphs

An edge-labelled graph G is a pair (V ,E ), where:

1. V is a finite set of vertices (or nodes).

2. E is a finite set of labeled edges; formally,

E ⊆ V × Lab× V ,

where Lab is a set of labels.

What if actors play different roles in movies?

What if actors play different roles in movies?

Peter Sellers

Lionel Mandrake

plays

Merkin Muffley

plays

Dr.Strangelove

movie

movie

18 minutes

34 minutes

screentime

screentime

Any problems with this approach?


We might need to modify the schema every time we add attributes



More convenient to allow attributes in nodes/edges explicitly



More convenient to allow attributes in nodes/edges explicitly

◮ This corresponds to what property graphs do

A property graph

role = Scorpio

ref = IMDb

name = Clint Eastwood

gender = male

n1 : Person

title = The Bridges of

Madison County

n2 : Movie

role = Robert Kincaid

ref = IMDb

e1 : acts in

e2 : directs

title = Dirty Harry

n5 : Movie

role = Harry Callahan

ref = IMDb

e5 : acts in

name = Andy Robinson

gender = male

n6 : Person

e6 : acts in

Another property graph

firstName = Julie

lastName = Freud

country = Chile

n1 : Person

firstName = John

lastName = Cook

gender = male

country = Chile

n2 : Person

e1 : knows

e2 : knows

name = U2

n3 : Tag

e3 : hasInterest

content = I love U2

language = en

n4 : Post

e4 : hasTag

content = Queen is awesome

n5 : Post

date = 14-09-15

e5 : likes

date = 15-03-14

e6 : likes

date = 23-10-15

e7 : likes

Yet another property graph

authored by

Author

name: Leonid Libkin

affil: U. of Edinburgh

Author

name: Robert Tarjan

affil: Princeton Univ.

Article

name: jacmHopcroft74

year: 1974

Venue

name: FOCS67

authored by

corresponding: YES

Author

Author

Author

Author

name: John Hopcroft

affil: Cornell Univ.

name: Jeff Ullman

affil: Stanford Univ.

name: Ron Fagin

affil: IBM Almaden

name: Moshe Vardi

affil: Rice University

Author

name: Limsoon Wong

affil: NU Singapore

Article

name: FOCSHopcroft67

year: 1967

Article

name: PODSUllman89

year: 1989

Article

name: PODSFaginUV83

year: 1983

Article

name: PODSVardi95

year: 1995

Article

name: PODSLibkin95

year: 1995

Article

name: IPLLibkinW95

year: 1995

Journal

name: JACM

Venue

Venue

Venue

Journal

name: PODS89

name: PODS83

name: PODS95

name: IPL

journal

journal

part of

part of

part of

part of

Conference

Conference

name: FOCS

name: PODS

series

series

part of

serie

s

series

authored by

authored by

authored by

authored by

authored by

corresponding: YES

corresponding: YES

authored by

authored by

cooresponding: YES

cooresponding: YES

authored by

authored by

authored by

cooresponding: YES

What is a property graph then?

◮ It is a graph

◮ It is directed

◮ It is labeled in nodes and edges

◮ Nodes and edges can be attributed

A formal definition

A property graph G is a tuple (V ,E , ρ, λ, σ), where:

1. V is a finite set of vertices (or nodes).

2. E is a finite set of edges.

3. ρ : E → (V × V ) is a total function.◮ ρ(e) = (v1, v2) indicates that e is an edge from v1 to v2.

4. λ : (V ∪ E )→ Lab is a total function with Lab a set of labels.◮ If λ(m) = ℓ, then ℓ is the label of node or edge m.

5. σ : (V ∪ E )× Prop→ Val is a partial functionwith Prop a finite set of properties and Val a set of values.

◮ If σ(m, p) = v , then v is the value of property p in m.

How would you represent property graphs?

How would you represent property graphs?

Labeled graphs: set of triples

Property graphs:set of triples + special triples describing property/value pairs of objects

GRAPH PATTERNS

Basic graph patterns are:

The basic unit for querying property graphs



A graph-structured query that is to be matched against a property graph



A graph-structured query that is to be matched against a property graph

Definition of basic graph pattern (bgp):

◮ A property graph that uses variables and constants

◮ Variables can appear as:node and edge names, labels, properties, and values

Examples of basic graph patterns

x3

x1 x2

acts in acts in

firstName = x1

lastName = x2

x10 : Person

firstName = x3

lastName = x4

x11 : Person

x12 : knows

x13 : knows

x8 = x9

x16 : x7

date = x5

x14 : likes

date = x6

x15 : likes

Evaluation of bgps

Set p(G ) of all possible “matchings” of bgp p in property graph G

Evaluation of bgps

Set p(G ) of all possible “matchings” of bgp p in property graph G

Matching:

◮ Replacement of variables in p by constants such that resultingproperty graph G ′ is “contained” in G

Example of evaluation



x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7



x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e2 directs n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry

n1 e5 acts in n5 Dirty Harry e2 directs n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C.

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C.

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C.

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry

Which semantic is the right semantics?

◮ Homomorphism-based semantics: All matchings are valid

◮ Isomorphism-based semantics: Only 1-1 matchings are valid

◮ Node/edge injective semantics: Self-defining

Different semantics by example

Homomorphism-based semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e2 directs n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry X

n1 e5 acts in n5 Dirty Harry e2 directs n2 The Bridges of Madison C. X

n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry X

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C. X

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. X

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. X

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. X

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. X

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry X


Isomorphism-based semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9



n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry ✗

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C. ✗

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. ✗

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. ✗

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. ✗

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. ✗

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry ✗


Node-injective semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9





n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. ✗

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. ✗





Edge-injective semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9





n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. X

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. X




To make bgps usable we need more features

◮ Projection: Only some variables appear in the output, i.e., π(p(G ))

◮ Filter: Prune based on the values of variables, i.e., σ(π(G ))

◮ Union: Take the union of two bgps, i.e., p1(G ) ∪ p2(G )

◮ Difference: Take the difference of two bgps, i.e., p1(G ) \ p2(G )

◮ Optional: Extend matchings of p1 with those in p2, if possible

Projection

Some variables in the output are distinguished

Projection

Some variables in the output are distinguished

Evaluation under projection:

◮ Consider all possible matchings of the graph pattern

◮ Remove columns that do not correspond to distinguished variables

Example of projection


x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7



x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e2 directs n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry

n1 e5 acts in n5 Dirty Harry e2 directs n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C.

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C.

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C.

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry



x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7

x3 x5 x7 x9

directs The Bridges of Madison C. acts in Dirty Harry

acts in Dirty Harry directs The Bridges of Madison C.

acts in The Bridges of Madison C. acts in Dirty Harry

acts in Dirty Harry acts in The Bridges of Madison C.

acts in The Bridges of Madison C. directs The Bridges of Madison C.

directs The Bridges of Madison C. acts in The Bridges of Madison C.

acts in The Bridges of Madison C. acts in The Bridges of Madison C.

directs The Bridges of Madison C. directs The Bridges of Madison C.

acts in Dirty Harry acts in Dirty Harry

Example of filter


x1 : Person

title = x5

year = z

x4 : Movie

x2 : x3title = x9

year = y

x8 : Movie

x6 : x7

FILTER x5 6= x9 AND y < 1985 AND z < 1985

Examples of union and difference

◮ Movies in which C. Eastwood acted or directed

◮ Movies in which C. Eastwood acted but not directed

OPTIONAL operator

Allows to match parts of the data only if available

◮ Important in the context of semistructured data

◮ Developed by the RDF community

◮ Corresponds to left-outer join in relational algebra

An example with OPTIONAL

The property graph

evaluated in

Paper

year: 1985

name: AKV85

Conference

name: STOC85

published in

Paper

year: 1985

name: Saf85

Review

text: "This paper..."

publ

ished

in


The property graph

evaluated in

Paper

year: 1985

name: AKV85

Conference

name: STOC85

published in

Paper

year: 1985

name: Saf85

Review


publ

ished

in

The pattern with optional

(x ,published in, y) OPTIONAL (y , evaluated in, z)


A matching

name: STOC85

Paper

year: 1985

name: AKV85

published in

Paper

year: 1985

name: Saf85

Review


publ

ished

in

evaluated in

Conference


(x ,published in, y) OPTIONAL (x , evaluated in, z)


Another matching

name: STOC85

published in

Paper

year: 1985

name: Saf85

Review


publ

ished

in

evaluated in

Paper

year: 1985

name: AKV85

Conference


(x ,published in, y) OPTIONAL (x , evaluated in, z)

Pattern trees

A way to formalize bgps + optional operator

Pattern trees

A way to formalize bgps + optional operator

A well-designed pattern tree (WDPT) is a tuple (T , λ) such that:

1. T is a rooted tree

2. λ maps each node t in T to a bgp

3. for every variable y , the set of nodes of T where y appears isconnected (well-designedness)

Example of WDPT

WDPT t

{(x , rec by , y), (x , publ in, “after2010”)}

{(x ,NMErank , z)} {(y , formed in, z ′)}

Example of WDPT

WDPT t

{(x , rec by , y), (x , publ in, “after2010”)}

{(x ,NMErank , z)} {(y , formed in, z ′)}

It represents the following graph pattern:

((x , rec by , y), (x , publ in, “after2010”) OPT (x ,NMErank , z)

)

OPT (y , formed in, z ′)

Evaluation of WDPTs

Consider WDPT t = (T , λ) with root r and subtree T ′ of T rooted in r :

Evaluation of WDPTs


pT ′ := bgp formed by all triples in the nodes of T ′.

Evaluation of WDPTs


pT ′ := bgp formed by all triples in the nodes of T ′.

A matching from p to G is a replacement hof some variables of p with constants such that:

1. There is subtree T ′ of T rooted in r such thath is a matching from pT ′ to G .

2. h cannot be extended any further.

COMPLEXITY OF EVALUATION FOR BGPS

A simplified scenario

◮ We consider edge-labeled graphs only

◮ Variables only occur as node names

◮ Bgps are equipped with projection

The decision problem we study

PROBLEM : Evaluation

INPUT : An edge-labeled graph G and a bgp p

QUESTION : Is it the case that p(G ) 6= ∅?

The decision problem we study

PROBLEM : Evaluation

INPUT : An edge-labeled graph G and a bgp p

QUESTION : Is it the case that p(G ) 6= ∅?

Recall that:

◮ p(G ) 6= ∅ ⇐⇒ there is a matching from p to G .

◮ Matching: replacement of variables in p by constants such thatresulting graph G ′ is contained in G (homomorphism)

How many such replacements we have to consider?


At most nm, where:

◮ n = number of nodes in G

◮ m = number of variables in p


At most nm, where:



This is exponential in the size of the input!


At most nm, where:




But can we do better?


At most nm, where:




But can we do better?

◮ Most likely no, under complexity-theoretical assumptions

The complexity of evaluation

Proposition (Chandra and Merlin, 1977)

The evaluation problem is NP-complete

The complexity of evaluation

Proposition (Chandra and Merlin, 1977)

The evaluation problem is NP-complete

◮ For the upper bound:Guess a replacement, check it is a matching

◮ For the lower bound:Reduce from 3-colorability

Proof of the lower bound

Given an undirected graph G = (V ,E ), where:

◮ V = {1, . . . , n} and E ⊆ V × V (symmetric)




Represent G as the bgp p:

{(xi ,E , xj ) | (i , j) ∈ E}





{(xi ,E , xj ) | (i , j) ∈ E}

Consider edge-labeled graph G ′ given as:

{(a,E , r ), (r ,E , a), (a,E , g ), (g ,E , a), (g ,E , r ), (r ,E , g)}





{(xi ,E , xj ) | (i , j) ∈ E}

Consider edge-labeled graph G ′ given as:

{(a,E , r ), (r ,E , a), (a,E , g ), (g ,E , a), (g ,E , r ), (r ,E , g)}

=⇒ p(G ′) 6= ∅ iff G is 3-colorable

Does this proof represent a practical situation?


The bgp p is bigger than the edge-labeled graph G ′

◮ In practice, G ′ is large and p is much smaller


The bgp p is bigger than the edge-labeled graph G ′

◮ In practice, G ′ is large and p is much smaller

Question:Should p and G be be assigned the same “relevance” in evaluation?

A radical solution

Consider only G to be part of the input: The bgp p is fixed

A radical solution


This corresponds to the notion of data complexity (Vardi, 1981)

A radical solution



Data complexity of bgp evaluation:

◮ Can be solved in polynomial time O(|G ||p|), since now p is fixed

◮ Actually, it belongs to a low complexity class inside LOGSPACE

◮ Evaluation of bgps in data complexity is “parallelizable”

A radical solution



Data complexity of bgp evaluation:

◮ Can be solved in polynomial time O(|G ||p|), since now p is fixed

◮ Actually, it belongs to a low complexity class inside LOGSPACE

◮ Evaluation of bgps in data complexity is “parallelizable”

Data complexity: good explanation to success of evaluation in practice

Still, if G is very large...

... a running time of O(|G ||p|) is not so good

Still, if G is very large...

... a running time of O(|G ||p|) is not so good

A different idea:

◮ Understand the structure of bgps used in practice

◮ We have good indications that many of them are nearly-acyclic

Acyclic bgps

Those whose underlying undirected graph is acyclic

Acyclic bgps

Those whose underlying undirected graph is acyclic

◮ (x , a, y), (y , b, z), (x , c , z ′) is acyclic

◮ (x , a, y), (y , b, z), (x , c , z ′), (z ′, d , x), (x , e, x) is also acyclic

◮ (x , a, y), (y , b, z), (x , c , z) is not acyclic

Evaluation of acyclic bgps

Theorem (Yannakakis, 1981)

Acyclic bgps can be evaluated in linear time O(|G | · |p|).

Evaluation of acyclic bgps

Theorem (Yannakakis, 1981)

Acyclic bgps can be evaluated in linear time O(|G | · |p|).

Idea of the proof:

◮ Bottom-up dynamic programming algorithm

◮ Labels each node v with the set

{h(v) | h is a matching of subtree of p rooted in v}

In practice bgps are not acyclic, but “nearly” acyclic

Question: How do we measure the degree of acyclicity of a bgp?

In practice bgps are not acyclic, but “nearly” acyclic

Question: How do we measure the degree of acyclicity of a bgp?

Approach taken in the literature:

◮ Use notion of treewidth: measures the degree of acyclicity of a graph

◮ Measure the degree of acyclicity of the underlying graph of the bgp

Treewidth of a graph

Measures how much an undirected graph “resembles” a tree:

◮ Acyclic graphs have treewidth one

◮ Cycles have treewidth two

◮ The complete graph on k elements has treewidth (k − 1)

◮ The k × k grid has treewidth k

Tree decompositions

Notion used to define treewidth

◮ It decomposes a graph as a tree while preserving connectivity

Tree decompositions



A tree decomposition of G = (V ,E ) is a pair (T , λ) such that:

◮ T is a tree

◮ λ assigns a set of nodes in V to each node t in T

◮ If {u, v} ∈ E , then u, v appear together in one such λ(t)

◮ For u ∈ V , the set of λ(t)’s that contain u is connected

Tree decompositions



A tree decomposition of G = (V ,E ) is a pair (T , λ) such that:

◮ T is a tree

◮ λ assigns a set of nodes in V to each node t in T

◮ If {u, v} ∈ E , then u, v appear together in one such λ(t)

◮ For u ∈ V , the set of λ(t)’s that contain u is connected

The width of (T , λ) is(max {|λ(t)| : t ∈ T} − 1

)

Definition of treewidth

The treewidth of G = (V ,E ) is

◮ The minimum width of its tree decompositions

Definition of treewidth

The treewidth of G = (V ,E ) is

◮ The minimum width of its tree decompositions

The treewidth of a bgp is the treewidth of its underlying graph

Bgps of small treewidth are well-behaved

Theorem (Chekuri and Rajaraman, 1997)

Evaluation for bgps of treewidth k can be solved in time

O(|G |k · |p|).

Bgps of small treewidth are well-behaved

Theorem (Chekuri and Rajaraman, 1997)

Evaluation for bgps of treewidth k can be solved in time

O(|G |k · |p|).

Idea of the proof:

◮ Compute a tree decomposition (T , λ) of p of width k

◮ Perform a dyn programming procedure on (T , λ) as for acyclic bgps

Can one improve on bounded treewidth?

Can one improve on bounded treewidth?

Grohe’s Theorem (Grohe, 2002): The following are equivalent

◮ A class C of bgps is tractable

◮ The treewidth of the bgps in C is bounded modulo equivalence

OPTIMIZATION OF BGPS

A theoretical notion of optimization

Find an equivalent bgp of minimal size



Bgps p and p′ are equivalent if

◮ p(G ) = p′(G ), over every edge-labeled graph



Bgps p and p′ are equivalent if

◮ p(G ) = p′(G ), over every edge-labeled graph

We assume that bgps are equipped with projection

◮ We write p(x) to denote that x are the distinguished variables of p

Equivalence of bgps is the same than evaluation

Theorem (Chandra and Merlin, 1977)

The following are equivalent for bgps p(x) and p′(x ′):

◮ p(x) and p′(x ′) are equivalent

◮ x ∈ p′(p) and x ′ ∈ p(p′)





◮ x ∈ p′(p) and x ′ ∈ p(p′)

As a corollary:

◮ Checking equivalence of bgps is NP-complete





◮ x ∈ p′(p) and x ′ ∈ p(p′)

As a corollary:

◮ Checking equivalence of bgps is NP-complete

This is considered to be a positive result (bgps are small)

Minimization of bgps

Theorem (Hell and Nesetril, 1992)

A minimal equivalent bgp of p is

◮ unique (up to renaming of variables)

◮ always contained in p

◮ corresponds to the core of p

Minimization of bgps

Theorem (Hell and Nesetril, 1992)

A minimal equivalent bgp of p is

◮ unique (up to renaming of variables)

◮ always contained in p

◮ corresponds to the core of p

As a corollary:

◮ Minimization of bgps can be carried out in PNP (feasible)

NAVIGATION

Graphs are there to be navigated

Traversing the edges of the graph recursively is useful for:

◮ Detecting implicit relations over the data

◮ Allowing more flexibility at the moment of querying

Recall our property graph?

authored by

Author

name: Leonid Libkin

affil: U. of Edinburgh

Author

name: Robert Tarjan

affil: Princeton Univ.

Article

name: jacmHopcroft74

year: 1974

Venue

name: FOCS67

authored by

corresponding: YES

Author

Author

Author

Author

name: John Hopcroft

affil: Cornell Univ.

name: Jeff Ullman

affil: Stanford Univ.

name: Ron Fagin

affil: IBM Almaden

name: Moshe Vardi

affil: Rice University

Author

name: Limsoon Wong

affil: NU Singapore

Article

name: FOCSHopcroft67

year: 1967

Article

name: PODSUllman89

year: 1989

Article

name: PODSFaginUV83

year: 1983

Article

name: PODSVardi95

year: 1995

Article

name: PODSLibkin95

year: 1995

Article

name: IPLLibkinW95

year: 1995

Journal

name: JACM

Venue

Venue

Venue

Journal

name: PODS89

name: PODS83

name: PODS95

name: IPL

journal

journal

part of

part of

part of

part of

Conference

Conference

name: FOCS

name: PODS

series

series

part of

serie

s

series

authored by

authored by

authored by

authored by

authored by

corresponding: YES

corresponding: YES

authored by

authored by

cooresponding: YES

cooresponding: YES

authored by

authored by

authored by

cooresponding: YES

An example of navigation in such graph

Find pairs of authors linked by a coauthorship sequence

An example of navigation in such graph

Find pairs of authors linked by a coauthorship sequence

(x , (authored by−1 · authored by)∗ , y

)

Regular path queries (RPQs)

Expressions of the form (x , r , y), where r is a regular expression:

r , r ′ := ∅ | ǫ | a | a−1 | r ∪ r ′ | r · r ′ | r∗ (a ∈ Lab)

Regular path queries (RPQs)

Expressions of the form (x , r , y), where r is a regular expression:

r , r ′ := ∅ | ǫ | a | a−1 | r ∪ r ′ | r · r ′ | r∗ (a ∈ Lab)

Evaluation:

◮ Set r(G ) of pairs of nodes linked by a path whose label is in L(r)

Paths by example

A path in the edge-labeled graph:

:Ronald FagininPods:83

:John E. HopcroftinFocs:FOCS8

journal:jacm Jacm:HopcroftT74 :Robert E Tarjan

:Jeffrey Ullman

conf:focs Focs:HopU67a

:Moshe Y. Vardi

series

series

journal

partOf

partOf

creatorcreator

creatorcreator

creator

creator

Pods:FaginUV83creator

creator:Leonid Libkin

partOf

creatorcreator

:Limsoon Wongjournal

inPods:89

series

Pods:Ullman89creatorpartOf

Pods:Vardi95

conf:pods

partOf

IPL:LibkinW95

inPods:95

Pods:Libkin95

journal:IPL

creator

series

Labels of path by example

The label of such path is partOf series:




:Jeffrey Ullman


:Moshe Y. Vardi

series

series

journal

partOf

partOf

creatorcreator

creatorcreator

creator

creator



partOf

creatorcreator


inPods:89

series

Pods:Ullman89creatorpartOf

Pods:Vardi95

conf:pods

partOf

IPL:LibkinW95

inPods:95

Pods:Libkin95

journal:IPL

creator

series

More formally

Given an edge-labeled graph G = (V ,E ) such that

◮ E ⊆ V × Lab× V

More formally


◮ E ⊆ V × Lab× V

A path in G is a sequence of consecutive edges in G , i.e.,

v1a1−→ v2

a2−→ v3 · · · vkak−→ vk+1

More formally


◮ E ⊆ V × Lab× V

A path in G is a sequence of consecutive edges in G , i.e.,

v1a1−→ v2

a2−→ v3 · · · vkak−→ vk+1

The label of such path is the word

a1a2 · · · ak ∈ Lab∗

Evaluation problem for RPQs

PROBLEM : Evaluation for RPQs

INPUT : An edge-labeled graph G ,

a pair (u, v) of nodes in G , and a RPQ (x , r , y)

QUESTION : Is it the case that (u, v) ∈ r(G )?

(i.e., is there a path from u to v with label in L(r)?)

The complexity of evaluation for RPQs

Theorem (Folklore)

The evaluation of RPQs can be solved in linear time O(|G | · |r |)

The complexity of evaluation for RPQs

Theorem (Folklore)


For the proof:Exploit the connection between edge-labeled graphs, RPQs, and NFAs

Non-deterministic finite automata (NFAs)

Tuple A = (Q, q0,F , δ), where:

◮ Q is a finite set of states

◮ q0 ∈ Q is the initial state

◮ F ⊆ Q are the final states

◮ δ ⊆ Q × Lab× Q is the transition relation

Acceptance by NFAs

L(A) = set of words “accepted” by A

Acceptance by NFAs


When is the worda1a2 . . . ak ∈ Lab∗

accepted by A?

Acceptance by NFAs


When there are states

q0a1q1a2q2 . . . qk−1akqk ∈ Lab∗

such that qk ∈ F and (qi−1, ai , qi ) ∈ δ for each 1 ≤ i ≤ k

The emptiness problem for NFAs

Given an NFA A, is L(A) 6= ∅?



◮ Can be solved in time O(|A|) by DFS




Given NFAs A1 and A2, is L(A1) ∩ L(A2) 6= ∅?




Given NFAs A1 and A2, is L(A1) ∩ L(A2) 6= ∅?

◮ Construct the cross product A1 ×A2 in time O(|A1| · |A2|)

◮ Check if L(A1 ×A2) 6= ∅ in time O(|A1| · |A2|)

Edge-labeled graphs can be seen as NFAs

◮ Nodes are states

◮ Edges ua−→ v are transitions

◮ There are no initial and final states

Regular expressions can be turned into NFAs

Proposition

There is an algorithm that given a regular expression r

◮ constructs in time O(|r |) an NFA A such that L(A) = L(r)

Proof of the theorem on evaluation of RPQs

Theorem


Proof of the theorem on evaluation of RPQs

Theorem


Proof idea:

◮ Compute in linear time from r an equivalent NFA A◮ Compute in linear time the NFA (G , u, v):

◮ Obtained from G by setting u and v as initial and final state, resp.

◮ Then (u, v) ∈ r(G ) iff NFA (G , u, v)×A accepts a word

◮ Check the latter in time O(|G | · |A|) = O(|G | · |r |)

A semantics based on simple paths?

Simple paths: No node is repeated

A semantics based on simple paths?

Simple paths: No node is repeated

Simple paths semantics:

◮ Motivated by apps for which simple paths are more natural

◮ Studied in the context of semistructured data in the early 90s

◮ Revival due to application in early versions of SPARQL 1.1

RPQs under simple paths semantics

Problem: Evaluation for RPQs under simple paths

Input: An edge-labeled graph G ,

a pair (u, v) of nodes in G , and a RPQ (x , r , y)

Question: Is there a simple path from

u to v in G whose label is in L(r)?

Complexity of finding regular simple paths

Theorem (Mendelzon and Wood, 1989)

Evaluation for RPQs under simple paths is NP-complete

◮ This holds even if r = (aa)∗

◮ In fact, even if r = w∗, for w any word of length ≥ 2

Complexity of finding regular simple paths

Theorem (Mendelzon and Wood, 1989)

Evaluation for RPQs under simple paths is NP-complete

◮ This holds even if r = (aa)∗

◮ In fact, even if r = w∗, for w any word of length ≥ 2

Proof:

◮ The following is NP-complete (Lapaugh and Papadimitriou, 1984)• Given a digraph G = (V ,E ) and two nodes u, v ,

is there a directed path from u to v of even length?

In conclusion

◮ Evaluation of RPQs under simple paths is NP-complete

◮ This holds even in data complexity

◮ This casts serious doubts on the applicability of the semantics

EXTENDING BGPS WITH NAVIGATION

Conjunctive regular path queries (CRPQs)

Obtained by combining bgps + RPQs

Example of C2RPQ

The C2RPQ

(x , creator−, y), (y , partOf · series, z), (y , creator , u)

computes pairs (a1, a2) that are coauthors of a conference paper



conf:pods


:Jeffrey Ullman


:Moshe Y. Vardi

series

series

journal

partOf

partOf

creatorcreator

creatorcreator

creator

creator



partOf

creatorcreator


creatorPods:Ullman89inPods:89

partOf

series

Pods:Libkin95

IPL:LibkinW95

partOf creatorinPods:95

journal:IPL

series

Pods:Vardi95

Example of C2RPQ

The C2RPQ



creator

inPods:83



:Jeffrey Ullman

conf:focs Focs:HopU67apartOf

creatorcreator

creatorcreator

Pods:FaginUV83


partOf

creatorcreator


creator

partOfseries creatorconf:pods :Ronald Fagin

zy

u

x

:Moshe Y. Vardi

series

journal

series

partOfPods:Ullman89

creatorinPods:89

Pods:Vardi95inPods:95partOf

IPL:LibkinW95

Pods:Libkin95

series

journal:IPL

creator

Example of C2RPQ

The C2RPQ



inPods:83


conf:pods


:Jeffrey Ullman


:Moshe Y. Vardi

series

journal

partOf

creatorcreator

creatorcreator

creator

Pods:FaginUV83


partOf

creatorcreator


creator

partOfseries creator:Ronald Fagin

a1

a2creator

inPods:89 Pods:Ullman89

series

partOf

IPL:LibkinW95

series

journal:IPL

creatorpartOf

Pods:Libkin95

Pods:Vardi95inPods:95

Formal definition of CRPQS

Set q of the form:

{(x1, L1, y1), . . . , (xm, Lm, ym)},

such that the xi , yi ’s are variables and the ri ’s reg exps

Formal definition of CRPQS

Set q of the form:

{(x1, L1, y1), . . . , (xm, Lm, ym)},

such that the xi , yi ’s are variables and the ri ’s reg exps

Evaluation: Set q(G ) of all possible “matchings” of q in G , i.e.,

◮ Replacements of the xi , yi by constants ci , disuch that (ci , di ) is linked by path with label in L(ri )

CRPQs vs RPQs

The CRPQ


is not expressible as a RPQ over the graph database:



conf:pods


:Jeffrey Ullman


:Moshe Y. Vardi

series

series

journal

partOf

partOf

creatorcreatorcreatorcreator

creator

creator



partOf

creatorcreator


creatorPods:Ullman89inPods:89

partOf

series

Pods:Libkin95

IPL:LibkinW95

partOf creatorinPods:95

journal:IPL

series

Pods:Vardi95

Evaluation for CRPQs

PROBLEM : Evaluation for CRPQs

INPUT : An edge-labeled graph G and a CRPQ q

QUESTION : Is it the case that q(G ) 6= ∅?

Complexity of evaluation for CRPQs

Proposition (Folklore)

Evaluation of CRPQs is NP-complete

◮ But polynomial in data complexity

Complexity of evaluation for CRPQs

Proposition (Folklore)

Evaluation of CRPQs is NP-complete


Restrictions on degree of acyclicity yield tractability

QUERIES ON GRAPHS WITH DATA

Querying the data

So far queries only talk about the topology of the data

◮ We have restricted ourselves to data graphs

Querying the data



Queries that combine topology and data are important in practice:

◮ Example: People of the same age connected by professional links

Querying the data



Queries that combine topology and data are important in practice:

◮ Example: People of the same age connected by professional links

We need to develop navigational queries for property graphs

Data graphs and data paths

We work with data graphs as a simplification of property graphs



A data graph G is a tuple (V ,E , κ), where:

◮ (V ,E ) is an edge-labeled graph

◮ κ : V → D, where D is a set of data values



A data graph G is a tuple (V ,E , κ), where:

◮ (V ,E ) is an edge-labeled graph

◮ κ : V → D, where D is a set of data values

With each pathv1

a1−→ v2 · · · vkak−→ vk+1

we associate a data path

κ(v1)a1−→ κ(v2) · · · κ(vk)

ak−→ κ(vk+1)

The choice of a formalism

Formalism for querying data paths has to be chosen with care:

Theorem (Libkin and Vrgoc, 2012)

The following problem is NP-complete:

◮ Is there a data path from u to v with no repeated data value?

The choice of a formalism

Formalism for querying data paths has to be chosen with care:

Theorem (Libkin and Vrgoc, 2012)

The following problem is NP-complete:

◮ Is there a data path from u to v with no repeated data value?

If a language expresses this property:

◮ It is NP-hard in data complexity ⇒ impractical

Regular expressions with memory (REMs)

◮ They permit to specify when data values are remembered and used

◮ Data values are remembered in k registers {x1, . . . , xk}

◮ At any point we can compare a data value with one in the registers

◮ Based on the notion of register automata

REM: Example

Consider the REM ↓x .a+[x=].

Intuition:

◮ Store the current data value d in register x

◮ After reading a word in a+ check that d is seen again

Evaluation: Pairs (v , v ′) of nodes:

◮ Linked by a path labeled in a+.

◮ v and v ′ contain the same data value.

REM: Conditions

Conditions: Compare a data value with the ones in the registers

Conditions over {x1, . . . , xk} are given by the grammar:

c := x=i | ¬c | c ∧ c (1 ≤ i ≤ k)

Then (d , τ) |= c for d ∈ D and τ = (d1, . . . , dk) ∈ Dk :

◮ (d , τ) |= x=i iff d = di

◮ Boolean combinations are standard

REMs: Syntax and semantics (Intuition)

REMs over Σ and {x1, . . . , xk} are defined by grammar:

e := ε | a | e ∪ e | e · e | e+ | e[c] | ↓ x .e

where a ∈ Lab, c condition, and x tuple in {x1, . . . , xk}



e := ε | a | e ∪ e | e · e | e+ | e[c] | ↓ x .e


Intuition: Evaluation of REM e on data graph G is:• pairs (v , v ′) of nodes linked by a data path ρ such that ρ |= e, where:



e := ε | a | e ∪ e | e · e | e+ | e[c] | ↓ x .e



◮ ρ |= e[c] if and only if

τ ∈ Dk st (κ(vk+1), τ) |= c

κ(v1)a1−→ κ(v2) · · · κ(vk)

ak−→ κ(vk+1)︸︷︷︸

can be parsed wrt estarting from empty registers

ρD

finishing in register value



e := ε | a | e ∪ e | e · e | e+ | e[c] | ↓ x .e



◮ ρ |=↓ x .e if and only ifρD

κ(v1)a1−→ κ(v2) · · · κ(vk)

ak−→ κ(vk+1)︸︷︷︸

can be parsed wrt estarting from the register valuethat assigns κ(v1) to each x ∈ x

REM: No closure under complement

Consider the REM Σ∗ · (↓x .Σ+[x=]) · Σ∗:



◮ Defines pairs of nodes linked by data path ρ such that:• ρ contains the same data value twice.

◮ The complement of this language is the NP-complete propertymentioned before



◮ Defines pairs of nodes linked by data path ρ such that:• ρ contains the same data value twice.

◮ The complement of this language is the NP-complete propertymentioned before

Corollary:

◮ REMs are not closed under complement

Complexity of REM evaluation

Theorem (Libkin, Vrgoc, 2012)

Evaluation for REMs is Pspace-complete


Complexity of REM evaluation

Theorem (Libkin, Vrgoc, 2012)

Evaluation for REMs is Pspace-complete


Both bounds extend to the class of conjunctive REMs

PATH QUERIES

CRPQs and path queries

CRPQs fall short of expressive power for applications that need:

◮ to include paths in the output of a query, and

◮ to define complex relationships among labels of paths.

CRPQs and path queries

CRPQs fall short of expressive power for applications that need:

◮ to include paths in the output of a query, and

◮ to define complex relationships among labels of paths.

Examples:

◮ Semantic Web queries:• establish semantic associations among paths.

◮ Biological applications:• compare paths based on similarity.

◮ Route-finding applications:• compare paths based on length or number of occurences of labels.

◮ Data provenance and semantic search over the Web:• require returning paths to the user.

Path comparisons

We use a set S of relations on words.

◮ Example: S may contain• Unary relations: Regular, context-free languages, etc.• Binary relations: prefix, equal length, subsequence, etc.

◮ Comparisons among labels of paths = Pertenence to some S ∈ S.• Example: w1 is a substring of w2.

◮ We assume S contains all regular languages.

Extended CRPQs

The S-extended CRPQs (ECRPQ(S)) are rules obtained from a CRPQ:

Ans(z , ) ← (x1, L1, y1), . . . , (xm, Lm, ym),

◮ by joining each pair (xi , yi ) with a path variable πi ,

◮ comparing labels of paths in πj wrt Sj ∈ S• for πj a tuple of path variables among the πi ’s,

◮ projecting some of πi ’s as a tuple χ in the output.

Extended CRPQs


Ans(z, ) ← (x1, π1, y1), . . . , (xm, πm, ym),




Extended CRPQs


Ans(z, ) ← (x1, π1, y1), . . . , (xm, πm, ym),∧

1≤j≤t Sj(πj )




Extended CRPQs


Ans(z, χ) ← (x1, π1, y1), . . . , (xm, πm, ym),∧

1≤j≤t Sj(πj )




Extended CRPQs and our requirements

ECRPQs meet our requirements:

Ans(z, χ) ← (x1, π1, y1), . . . , (xm, πm, ym),∧

1≤j≤t Sj(πj)



Ans(z, χ) ← (x1, π1, y1), . . . , (xm, πm, ym),∧

1≤j≤t Sj(πj)

◮ They allow to export paths in the output.

◮ They allow to compare labels of paths with relations Sj ∈ S.



Ans(z, χ) ← (x1, π1, y1), . . . , (xm, πm, ym),∧

1≤j≤t Sj(πj )

◮ They allow to export paths in the output.

◮ They allow to compare labels of paths with relations Sj ∈ S.

Evaluation of ECRPQs

Evaluation of the ECRPQ(S)

θ(z, χ) : Ans(z , χ) ← (x1, π1, y1), . . . , (xm, πm, ym),∧

j Sj(πj )

is defined in terms of “matchings”




j Sj(πj )


Matching for ECRPQs: Same than for CRPQs but:

◮ Each πi is mapped to a path ρi in the graph DB.

◮ If πj = (πj1 , . . . , πjk ) then: (Lab(ρj1), . . . , Lab(ρjk ))︸︷︷︸

the labels of (ρj1 , . . . , ρjk )

∈ Sj .




j Sj(πj )


Matching for ECRPQs: Same than for CRPQs but:

◮ Each πi is mapped to a path ρi in the graph DB.

◮ If πj = (πj1 , . . . , πjk ) then: (Lab(ρj1), . . . , Lab(ρjk )) ∈ Sj .

The evaluation θ(G) of θ(z, χ) on G is the set:

{(σ(z), σ(χ)) | σ is matching from θ to G}.

Considerations about ECRPQ(S)

• ECRPQ(S) extends the class of CRPQs.

◮ Ans(z)←∧

i(xi , Li , yi ) = Ans(z) ←∧

i(xi , πi , yi ), Li (πi ).

• Expressiveness and complexity of ECRPQ(S):

◮ Depends on the class S.

• We study two such classes with roots in formal language theory:

◮ Regular relations (Elgot, Mezei (1965)).

◮ Rational relations (Nivat (1968)).

COMPARING PATHS WITH REGULAR RELATIONS

Introduction

• Regular relations: Regular languages for relations of any arity

◮ REG: Class of regular relations.

• Bottomline:

ECRPQ(REG): Reasonable expressiveness and complexity.

Regular relations

n-ary regular relation:

Set of n-tuples (w1, . . . ,wn) of stringsaccepted by synchronous automaton over Σn.

Regular relations

n-ary regular relation:

Set of n-tuples (w1, . . . ,wn) of stringsaccepted by synchronous automaton over Σn.

◮ The input strings are written in the n-tapes.

◮ Shorter strings are padded with symbol ⊥.

◮ At each step:The automaton simultaneously reads next symbol on each tape.

Synchronous automata

w1 = a a b · · · a b c

w2 = a b a · · · a

w3 = b b · · ·...

...

wn = a b b · · · a c


w1 = a a b · · · a b c

w2 = a b a · · · a ⊥ ⊥

w3 = b b ⊥ · · · ⊥ ⊥ ⊥...

...

wn = a b b · · · a c ⊥


w1 = a a b · · · a b c

w2 = a b a · · · a ⊥ ⊥

w3 = b b ⊥ · · · ⊥ ⊥ ⊥...

...

wn = a b b · · · a c ⊥

⇑

Examples of regular relations

• All regular languages.

• The prefix relation defined by:

( ⋃

a∈Σ

(a, a))∗·( ⋃

a∈Σ

(a,⊥))∗.

• The equal length relation defined by:

( ⋃

a,b∈Σ

(a, b))∗.

• Pairs of strings at edit distance at most k , for fixed k ≥ 0.

Examples of regular relations

• All regular languages.

• The prefix relation defined by:

( ⋃

a∈Σ

(a, a))∗·( ⋃

a∈Σ

(a,⊥))∗.

• The equal length relation defined by:

( ⋃

a,b∈Σ

(a, b))∗.

• Pairs of strings at edit distance at most k , for fixed k ≥ 0.

Proposition

The subsequence, subword and suffix relations are not regular.

ECRPQ(REG)

ECRPQ(REG): Class of queries of the form

Ans(z, χ) ←∧

i (xi , πi , yi ),∧

j Sj(πj),

where each Sj is a regular relation (B., Libkin, Lin, Wood (2012)).

ECRPQ(REG)


Ans(z, χ) ←∧

i (xi , πi , yi ),∧

j Sj(πj),


Example: The ECRPQ(REG) query

Ans(x , y) ← (x , π1, z), (z , π2, y), a∗(π1), b

∗(π2), equal length(π1, π2)

computes pairs of nodes linked by a path labeled in {anbn | n ≥ 0}.

ECRPQ(REG)


Ans(z, χ) ←∧

i (xi , πi , yi ),∧

j Sj(πj),


Example: The ECRPQ(REG) query

Ans(x , y) ← (x , π1, z), (z , π2, y), a∗(π1), b

∗(π2), equal length(π1, π2)

computes pairs of nodes linked by a path labeled in {anbn | n ≥ 0}.

Corollary

ECRPQ(REG) properly extends the class of CRPQs.

Complexity of evaluation of ECRPQ(REG)

Theorem (B., Libkin, Lin and Wood, 2012)

Evaluation for ECRPQ(REG) is Pspace-complete.


COMPARING PATHS WITH RATIONAL RELATIONS

Introduction

ECRPQ(REG) queries are still short of expressive power:

◮ RDF or biological networks:• Compare strings based on subsequence and subword relations.

◮ These relations are rational: Accepted by “asynchronous” automata.• RAT: Class of rational relations.

Bottomline:

◮ ECRPQ(RAT) evaluation:• Undecidable or very high complexity.

◮ Restricting the syntactic shape of queries yields tractability.

Rational relations

n-ary rational relation:Set of n-tuples (w1, . . . ,wn) of stringsaccepted by asynchronous automaton with n heads.

Rational relations



◮ At each step:The automaton enters a new state and move some tape heads.

Rational relations



◮ At each step:The automaton enters a new state and move some tape heads.

n-ary rational relation:Described by regular expression over alphabet (Σ ∪ {ǫ})n.

Examples of rational relations

• All regular relations.

• The subsequence relation �ss defined by:

(( ⋃

a∈Σ

(a, ǫ))∗

⋃

b∈Σ

(b, b)

)∗

·( ⋃

a∈Σ

(a, ǫ))∗.

• The subword relation �sw defined by:

( ⋃

a∈Σ

(a, ǫ))∗·( ⋃

b∈Σ

(b, b))∗·( ⋃

a∈Σ

(a, ǫ))∗.

Examples of rational relations

• All regular relations.

• The subsequence relation �ss defined by:

(( ⋃

a∈Σ

(a, ǫ))∗

⋃

b∈Σ

(b, b)

)∗

·( ⋃

a∈Σ

(a, ǫ))∗.

• The subword relation �sw defined by:

( ⋃

a∈Σ

(a, ǫ))∗·( ⋃

b∈Σ

(b, b))∗·( ⋃

a∈Σ

(a, ǫ))∗.

Proposition

The pairs (w1,w2) such that w1 is the reversal of w2 is not rational.

ECRPQ(RAT)

ECRPQ(RAT): Class of queries of the form

Ans(z, χ) ←∧

i (xi , πi , yi ),∧

j Sj(πj),

where each Sj is a rational relation (B., Figueira, Libkin (2012)).

Example: The ECRPQ(RAT) query

Ans(x , y) ← (x , π1, z), (y , π2,w), π1 �ss π2

computes x , y that are origins of paths ρ1 and ρ2 such that:

◮ λ(ρ1) is a subsequence of λ(ρ2).

Evaluation of ECRPQ(RAT) queries

Evaluation of queries in ECRPQ(RAT) is undecidable, but:

◮ True if we allow only practically motivated rational relations?• For example, �ss and �sw.




Adding subword relation to ECRPQ(REG) leads to undecidability:

Theorem (B., Figueira and Libkin, 2012)

Evaluation for ECRPQ(REG ∪{�sw})) is undecidable.




Adding subword relation to ECRPQ(REG) leads to undecidability:


Evaluation for ECRPQ(REG ∪{�sw})) is undecidable.

Adding subsequence preserves decidability, but at a very high cost:


Evaluation for (ECRPQ(REG ∪{�ss})) is decidable,but non-primitive-recursive.

Acyclic ECRPQ(RAT) queries

Acyclic ECRPQ(RAT) queries yield tractable data complexity.

◮ Queries of the form:

Ans(z)←∧

i≤k

(xi , πi , yi ), Li (πi ),∧

j

Sj(πj1 , πj2),

where the graph on {1, . . . , k} defined by edges (πj1 , πj2) is acyclic.

Acyclic ECRPQ(RAT) queries

Acyclic ECRPQ(RAT) queries yield tractable data complexity.

◮ Queries of the form:

Ans(z)←∧

i≤k

(xi , πi , yi ), Li (πi ),∧

j

Sj(πj1 , πj2),

where the graph on {1, . . . , k} defined by edges (πj1 , πj2) is acyclic.

Acyclic ECRPQ(RAT) is not more expensive than ECRPQ(REG):


Evaluation of acyclic ECRPQ(RAT) queries is Pspace-complete.


What are your conclusions?

centro de investigacio´n de la web sem´antica center for...

Documents