xpath query processing dbpl9 tutorial, sept. 8, 2003, part 2 georg gottlob, tu wien christoph koch,...

XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2

Georg Gottlob, TU Wien

Christoph Koch, U. Edinburgh

Based on joint work with R. Pichler

ContentsPart 1• Xpath Basics• Axis Evaluation• Experiments with current systems • Polynomial-time evaluation of Core Xpath• Core XPath and datalog • Polynomial-time evaluation of full Xpath

Part 2• Context simplification and efficient evaluation of Xpath• Parallel complexity of Xpath• Automata-based techniques:

– Xpath on Streaming XML– Expressive queries and automata.

• Further relevant work

Context Simplification and Efficient Evaluation of XPath

Alternative context representation

• Contexts represented as (“previous context node, “current context node”)rather than (“context node”, “position”, “size”).

• Need to recompute “position” and “size” on demand.

• Complexity lowered to time O(|data|4 * |query|2), space O(|data|3 * |query|2).

//a/b[position() + 1 = size()]

1:a 5:a

6:b 7:b2:b 3:b 4:b

0:c

child::b … { (1,2), (1,3), (1,4), (5,6), (5,7) }

child::b[position()+1=size()] … { (1,3), (5,6) }

Context Simplification Technique

1. Only materialize relevant context.

2. Core Xpath evaluation algorithm for outermost and innermost paths //a/b/c//d[…]/e[…(a/b/c)].

3. Treating “position” and “size” in a loop.• Because of tree shape of query, loops never have to be nested.

position() +1 = last()

position() = count( )descendant::a

/child::b[ ]

(cn)

(cn,cp, cs) - loop

child::b[ ] Compute node set for whichchild::b[…] is true(cn,cp, cs) - loop

• “Wadler Fragment” [Wadler, 1999]: Core Xpath + position(), last(), and arithmetics.

• Evaluation in quadratic time and linear space.

• For x in [[//a]] compute contexts (y,p,n) in x.[[b]] Compute Y = { y | (y,p,n) 2 x.[[b]] and p*2=n }.• Similarly, compute Z = { z | z.[[ d[position()*3 = last()] ]] is true}.• Compute X = { x | z 2 Z, x 2 z.[[ child::c ]]-1 } – in linear time.• Result is { w | v \in X \cap Y, w \in v.[[descendant::e]] }.

Linear Space Fragment

//a/b[position() * 2 = last() and c/d[position()*3 = last()]]//e

(cn) (cn)

(cn,cp,cs)

(cn)(cn)

(cn,cp,cs)

Parallel Complexity of XPath

Parallel Complexity of XPath

• Known: Xpath is in P w.r.t. combined complexity[G., K., and Pichler, VLDB 2002].

• P-hardness => unlikely that there is an efficient parallel algorithm (conjecture: P > NC)

• Even quite restrictive fragments of Xpath are P-hard– Core Xpath using only child, parent, and descendant axes, no

“branching” of tree patterns.– Proof by encoding circuits, somewhat involved!

• But: without negation, Core Xpath is in LOGCFL (< NC2, highly parallelizable!!)

PF – Path Query Fragment

• PF = Core XPath without conditions.• E.g. //a/b//c/parent::d//f/g/ancestor::a/*

• Theorem: PF is NL-complete w.r.t. combined complexity (and L-reductions).

• Membership: paths easy to guess and check in NL.• NL-Hardness by reduction from Graph Reachability …

Where can we go from v2 in one step?

*::c/parent::*/child::e/parent::ntc/descenda::child V||V||2*

Where can we go from v2 in one step?

• Reachable from v2 in one step: v1, v3!

PF is NL-hard.

• Reachability in precisely m steps:

• Add loop at each node to graph => reachability in at most m steps.• Set m = |E|.

1V||V||2* */::c/parent::*/child::e/parent::ntc/descenda::child kk

jv::self0

miv /::descendant/

Further fragments with low parallel complexity

Combined complexity of Core Xpath is in L if:

1. Only one-step axes are used (child, parent; self).

2. Only transitive downward axes are used (descendant, descendant-or-self, …).

Increasing the Size of the LOGCFL Fragment

• “positive Wadler fragment” [Wadler, 2000]: just like positive Core XPath, but with position arithmetics in conditions.– child::a[position()+1 = last()] … get the second-last child

labeled “a”.– No iteration of predicates: child::a[…][…].

• Theorem (combined complexity): the positive WF is– LOGCFL-complete;– with iterated predicates (already when iterated at most

twice), it is P-complete.

Increasing the Size of the LOGCFL Fragment

• pXPath: “positive”/parallel XPath.1. No negation2. No iterated predicates […][…]3. Depth of nesting of arithmetic operations inside a predicate is

bounded by some constant.4. Forbidden built-in functions: count, sum, string, local-name,

name, namespace-uri, string-length, normalize-space.5. Forbidden: relational operations on booleans.

• Theorem. pXPath is LOGCFL-complete (combined complexity).

• Maximal parallelizable fragment of Xpath, unless P = NC.– Adding any of the features (1) – (5) leads to P-hardness.

Combined Complexity of XPath

Data and Query Complexity

• Theorem. PF is L-complete under NC1-reductions (data complexity).

• Theorem. XPath w/o multiplication, concatenation is in L w.r.t. query complexity.

• Surprisingly, data complexity and query complexity are low; combined complexity is higher!

L

L-complete(NC1-red.)

XPath

PF

Data complexity

Processing Xpath on Streams using Finite Automata

FSA on Streams

• Translate Xpath path query into FSA, process stream of (e.g.) SAX events.– Very good scalability, low memory consumption (stack

needed)

• Selective dissemination of information (SDI) / publish-subscribe(cf. Xfilter [Altinel and Franklin, VLDB 2000], Xtrie [Chan et al., ICDE 2002]).– Boolean queries.– Extensions to support branching tree patterns, condition

predicates, backward axes, …– Goal is to evaluate multiple queries at once (10^4 – 10^6

queries.)

Example: $x in //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)(01)

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)(01)(02)

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)(01)

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(01)

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)$x

(01)

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)

$x

(01)(02)

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)

$x

(01)$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

(02)

$x

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)(01)

$x

$x

$x

Example: //a/b

a

b

a a b

ab

b$x $x

NFA DFA

(0)

$x

$x

$x

Size of DFAs

//a/*/*/b

Size of DFAs

• Exponential in the size of Xpath statement, but– Only exponential in number of occurrences of “*”.– In case of automaton for multiple queries, exponential in

number of occurrences of “//”.

• Lazy evaluation of DFA– Computation of states and transitions only on demand.– Saves much time and space in practice: documents usually

from quite restrictive language.

[Green, Miklau, Onizuka, Suciu, ICDT 2003]

Extensions

• Branching tree patterns.• Condition predicates.• Backward axes

• Boolean queries (“Can tree pattern be embedded into XML document?”)– Rather than node-selecting queries.

Highly Expressive Queries and Automata

Motivation

• Scalability in databases = (all three points at the same time)– Strictly linear time.– Little main memory required (DB in secondary storage).– Little jumping around in the data, sequential scans of disk

preferred (streaming).• Paged sequential reading much faster than random

access.• Node-selecting queries on unranked trees (XML)

– Higher expressiveness than what is possible with single pass.

• Folklore: unary MSO queries can be evaluated in two passes through the tree.

The Arb Query Processor

Evaluates node-selecting queries– In two sequential scans of the data.– Memory requirements: O(depth(tree)), otherwise

independent of size of DB.– Highly parallelizable.– Tree Automata-based.– High expressiveness: unary Monadic Second

Order Logic (MSO).– Succinct representation of automata.

[Frick, Grohe, K., LICS 2003; K., VLDB 2003]

Selecting Tree Automata (STAs)

• STA: Nondeterministic bottom-up tree automata

with a set of selecting states.• Select a node if it is assigned a selecting state in all (or one)

accepting runs:

or

• Expressive power: unary MSO queries on trees.

[Neven’s thesis]; [Frick, Grohe & K., LICS 2003]

Two-Phase Query Evaluation

From STA

1. Deterministic bottom-up tree automaton

– compute reachable states.2. Deterministic top-down tree automaton (with selection)

• Eliminate state-to-node assignments that do not lead to accepting run.

• Select nodes of query result.

[Frick, Grohe & K., LICS 2003]

Representation on Disk

Representation on Diska

b

ba

a

c c c

b

b b

ba

a

Representation on Diska

b

ba

a

c c c

b

b b

ba

a

FirstChild NextSibling

Representation on Disk1

2 3

4 5 6 9 11 12

7 8

10 13

14

a

b

ba

a

c c c

b

b b

ba

a


Representation on Disk1

2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

10 01 11 01 01 00 01 11 01 01 01 01 00 00

a b a a b c b a b c c a b bLabel:

Children?

a

b

ba

a

c c c

b

b b

ba

a


Running Automata by Sequential Disk Scans

Running Automata by Seq. Scans

• Deterministic top-down tree automaton– One sequential forward scan of the data.– Memory: Stack bounded by depth of tree.

• Deterministic bottom-up tree automaton– One sequential backward scan of the data.– Memory: Stack bounded by depth of tree.

• For unranked trees !

Bottom-up Traversal1

2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

10 01 11 01 01 00 01 11 01 01 01 01 00 00

13


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

13

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

10 01 11 01 01 00 01 11 01 01 01 01 00 00

12


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

12

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

10 01 11 01 01 00 01 11 01 01 01 01 00 00

11


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

10 01 11 01 01 00 01 11 01 01 01 01 00 00

10


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

14

10 01 11 01 01 00 01 11 01 01 01 01 00 00

9


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

8

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

7

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

7

10 01 11 01 01 00 01 11 01 01 01 01 00 00

6


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

7

10 01 11 01 01 00 01 11 01 01 01 01 00 00

5


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

7

10 01 11 01 01 00 01 11 01 01 01 01 00 00

4


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

3

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

2

10 01 11 01 01 00 01 11 01 01 01 01 00 00


2 3

4 5 6 9 11 12

7 8

10 13

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

10 01 11 01 01 00 01 11 01 01 01 01 00 00

Monadic Datalog and TMNF

• Monadic datalog: datalog, all “intensional predicates” are unary.• Over unranked, ordered, finite trees:

– Unary: Root, hasFirstChild, hasNextSibling, Label_a, and their complements.– Binary: FirstChild, NextSibling

Example:

D0(x) :- Root(x).

D1(x) :- D0(x0), First-Child(x0, x).

D0(x) :- D1(x0), First-Child(x0, x).

D0(x) :- D0(x0), Next-Sibling(x0, x).

D1(x) :- D1(x0), Next-Sibling(x0, x).

• TMNF (“tree-marking normal form”) - restricted syntax:– P(x) :- P1(x), P2(x). P(x) :- P0(x0), R(x0, x). P(x) :- P0(x0), R(x, x0).

•D0: nodes at even depth in tree.

•D1: nodes at odd depth in tree.

Known Facts about Monadic Datalog

[Gottlob & K., PODS 2002]:• M.dl.o.t. can be evaluated in time O(|Program| * |Data|).• M.dl.o.t. captures the unary MSO queries over trees.

[Gottlob & K., LICS 2002], [Frick, Grohe, K. LICS 2003]:• Linear-time reduction to TMNF.• Linear-time reduction also from Core Xpath to TMNF (negation!)

[Grohe and Schweikardt, CSL 2003]:• But: M.dl. much less succinct than MSO, monadic fixpoint logic.

– However, no problems observed in practice yet.

TMNF Example

P1(x) :- Root(x).

P2(y) :- P1(x), FirstChild(x,y).


P4(y) :- P3(x), FirstChild(y, x).


{P1}

{}

{}

TMNF Example

P1(x) :- Root(x).





{P1}

{P2}

{}

TMNF Example

P1(x) :- Root(x).





{P1}

{P2}

{P3}

TMNF Example

P1(x) :- Root(x).





{P1}

{P2,P4}

{P3}

TMNF Example

P1(x) :- Root(x).





{P1,P5}

{P2,P4}

{P3}

Implementation

• Bottom-up phase has to deal with nondeterminism – very large sets of states possible.

• Compact representation of state sets using residual logic programs.

• Compilation of TMNF program P into

– Deterministic bottom-up automaton

• Sets of reachable states of STA become states of .

• Each such state is represented as a residual logic program.

– Deterministic top-down automaton.

• Both evaluated lazily: Transitions computed on demand and stored.

Propositional “Local” Program

P1(x) :- Root(x).





P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

“Local”

[1] [2]

Bottom-up Run

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].A

Bottom-up Run

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].A {}

Bottom-up Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

A

Bottom-up Run

{}

{P4 :- P2}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

A

Represents 2^3 * (2^2 - 1) = 24 reachable states

Bottom-up Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

{P4 :- P2}

A

+ {Root; P4[1] :- P2[1]}

Bottom-up Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

{P4 :- P2}

A

+ {Root; P4[1] :- P2[1]}

{P1; P2[1]; P4[1]; P5}

Top-down Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

{P4 :- P2}

A {P1; P2[1]; P4[1]; P5}

Top-down Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

{P4 :- P2}A

{P1; P2[1]; P4[1]; P5}

+ {P2; P4}

Top-down Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].

{P2; P3[1]; P4}

A

{P1; P2[1]; P4[1]; P5}

+ {P2; P4}

Top-down Run

{}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].A

{P1; P2[1]; P4[1]; P5}

+ {P3} {P2; P3[1]; P4}

Top-down Run

{P3}

P1 :- Root.

P2[1] :- P1.

P3[1] :- P2.

P4 :- P3[1].

P5 :- P4[1].A

{P1; P2[1]; P4[1]; P5}

+ {P3} {P2; P3[1]; P4}

• Encode string as almost complete binary “infix” tree.

• Represent backward step between leaves as caterpillar (tree-walking) expression.

• Express regular expression over strings as monadic datalog program over infix tree.

Example: Parallel Regular Expression Matching

e

x

a

m

p

l

e

Some further interesting work

• Structural Joins, Twig Joins– [Al-Khalifa et al., ICDE 2002; Bruno, Koudas, and Srivastava, SIGMOD

2002; …]– Exploit tree structure to compute matches of tree pattern in time O(|input| + |

output|).

• Index Structures for Path Expressions– [Kemper and Moerkotte, 1992; Milo and Suciu, ICDT 1999]– Bisimulation; data guides, 1-indexes, t-indexes, …

• Optimization of XPath– Containment and Minimization [Miklau and Suciu, PODS 2002, Neven and

Schwentick, ICDT 2003; Wood, WebDB 2001, ICDT 2003; Deutsch and Tannen, KRDB 2001]

– Satisfiability [Hidders, DBPL 2003]– Axiom sytems for query rewriting [Benedikt, Fan and Kuper, ICDT 2003]

• Closure Properties for Xpath Fragments– [Benedikt, Fan and Kuper, ICDT 2003]

xpath query processing dbpl9 tutorial, sept. 8, 2003, part 2 georg gottlob, tu wien christoph koch,...

Documents

core xpath time

space odata

evaluation time odata

core xpath position

cvt time odata

phard core xpath

time bound

wadler fragment time