xpath query processing dbpl9 tutorial, sept. 8, 2003, part 2 georg gottlob, tu wien christoph koch,...
TRANSCRIPT
XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2
Georg Gottlob, TU Wien
Christoph Koch, U. Edinburgh
Based on joint work with R. Pichler
ContentsPart 1• Xpath Basics• Axis Evaluation• Experiments with current systems • Polynomial-time evaluation of Core Xpath• Core XPath and datalog • Polynomial-time evaluation of full Xpath
Part 2• Context simplification and efficient evaluation of Xpath• Parallel complexity of Xpath• Automata-based techniques:
– Xpath on Streaming XML– Expressive queries and automata.
• Further relevant work
Time and space bound
Bottom-up evaluation based on CVT:– Time O(|data|5 * |query|2), space O(|data|4 * |query|2).
Space bound (n … number of nodes in input document.):• Contexts are at most triples: at most n^3 contexts.• Sizes of values:
– Node sets: at most O(n)– Strings, numbers: at most O( |data|* |query|) – (iterated
concatenation of strings, multiplication of numbers) Each CVT is of size (|data|4 * |query|).
Time bound: most expensive computation is O(n^2) – Relational operation “=“ on node sets (e.g. a/b//c[d//e/f/g = h/i//j])
Alternative context representation
• Contexts represented as (“previous context node, “current context node”)rather than (“context node”, “position”, “size”).
• Need to recompute “position” and “size” on demand.
• Complexity lowered to time O(|data|4 * |query|2), space O(|data|3 * |query|2).
//a/b[position() + 1 = size()]
1:a 5:a
6:b 7:b2:b 3:b 4:b
0:c
child::b … { (1,2), (1,3), (1,4), (5,6), (5,7) }
child::b[position()+1=size()] … { (1,3), (5,6) }
Context Simplification Technique
1. Only materialize relevant context.
2. Core Xpath evaluation algorithm for outermost and innermost paths //a/b/c//d[…]/e[…(a/b/c)].
3. Treating “position” and “size” in a loop.• Because of tree shape of query, loops never have to be nested.
position() +1 = last()
position() = count( )descendant::a
/child::b[ ]
(cn)
(cn,cp, cs) - loop
child::b[ ] Compute node set for whichchild::b[…] is true(cn,cp, cs) - loop
• “Wadler Fragment” [Wadler, 1999]: Core Xpath + position(), last(), and arithmetics.
• Evaluation in quadratic time and linear space.
• For x in [[//a]] compute contexts (y,p,n) in x.[[b]] Compute Y = { y | (y,p,n) 2 x.[[b]] and p*2=n }.• Similarly, compute Z = { z | z.[[ d[position()*3 = last()] ]] is true}.• Compute X = { x | z 2 Z, x 2 z.[[ child::c ]]-1 } – in linear time.• Result is { w | v \in X \cap Y, w \in v.[[descendant::e]] }.
Linear Space Fragment
//a/b[position() * 2 = last() and c/d[position()*3 = last()]]//e
(cn) (cn)
(cn,cp,cs)
(cn)(cn)
(cn,cp,cs)
Summary
Full XPath• Bottom-up algorithm based on CVT
– Time O(|data|5 * |query|2), space O(|data|4 * |query|2).
• Top-down evaluation– Time O(|data|4 * |query|2), space O(|data|3 * |query|2).
• Context-reduction technique– Time O(|data|4 * |query|2), space O(|data|2 * |query|2).
Wadler fragment– Time O(|data|2 * |query|2), space O(|data| * |query|).
Core Xpath– Time and space O(|data| * |query|).
Parallel Complexity of XPath
• Known: Xpath is in P w.r.t. combined complexity[G., K., and Pichler, VLDB 2002].
• P-hardness => unlikely that there is an efficient parallel algorithm (conjecture: P > NC)
• Even quite restrictive fragments of Xpath are P-hard– Core Xpath using only child, parent, and descendant axes, no
“branching” of tree patterns.– Proof by encoding circuits, somewhat involved!
• But: without negation, Core Xpath is in LOGCFL (< NC2, highly parallelizable!!)
PF – Path Query Fragment
• PF = Core XPath without conditions.• E.g. //a/b//c/parent::d//f/g/ancestor::a/*
• Theorem: PF is NL-complete w.r.t. combined complexity (and L-reductions).
• Membership: paths easy to guess and check in NL.• NL-Hardness by reduction from Graph Reachability …
PF is NL-hard.
• Reachability in precisely m steps:
• Add loop at each node to graph => reachability in at most m steps.• Set m = |E|.
1V||V||2* */::c/parent::*/child::e/parent::ntc/descenda::child kk
jv::self0
miv /::descendant/
Further fragments with low parallel complexity
Combined complexity of Core Xpath is in L if:
1. Only one-step axes are used (child, parent; self).
2. Only transitive downward axes are used (descendant, descendant-or-self, …).
Increasing the Size of the LOGCFL Fragment
• “positive Wadler fragment” [Wadler, 2000]: just like positive Core XPath, but with position arithmetics in conditions.– child::a[position()+1 = last()] … get the second-last child
labeled “a”.– No iteration of predicates: child::a[…][…].
• Theorem (combined complexity): the positive WF is– LOGCFL-complete;– with iterated predicates (already when iterated at most
twice), it is P-complete.
Increasing the Size of the LOGCFL Fragment
• pXPath: “positive”/parallel XPath.1. No negation2. No iterated predicates […][…]3. Depth of nesting of arithmetic operations inside a predicate is
bounded by some constant.4. Forbidden built-in functions: count, sum, string, local-name,
name, namespace-uri, string-length, normalize-space.5. Forbidden: relational operations on booleans.
• Theorem. pXPath is LOGCFL-complete (combined complexity).
• Maximal parallelizable fragment of Xpath, unless P = NC.– Adding any of the features (1) – (5) leads to P-hardness.
Data and Query Complexity
• Theorem. PF is L-complete under NC1-reductions (data complexity).
• Theorem. XPath w/o multiplication, concatenation is in L w.r.t. query complexity.
• Surprisingly, data complexity and query complexity are low; combined complexity is higher!
L
L-complete(NC1-red.)
XPath
PF
Data complexity
FSA on Streams
• Translate Xpath path query into FSA, process stream of (e.g.) SAX events.– Very good scalability, low memory consumption (stack
needed)
• Selective dissemination of information (SDI) / publish-subscribe(cf. Xfilter [Altinel and Franklin, VLDB 2000], Xtrie [Chan et al., ICDE 2002]).– Boolean queries.– Extensions to support branching tree patterns, condition
predicates, backward axes, …– Goal is to evaluate multiple queries at once (10^4 – 10^6
queries.)
Size of DFAs
• Exponential in the size of Xpath statement, but– Only exponential in number of occurrences of “*”.– In case of automaton for multiple queries, exponential in
number of occurrences of “//”.
• Lazy evaluation of DFA– Computation of states and transitions only on demand.– Saves much time and space in practice: documents usually
from quite restrictive language.
[Green, Miklau, Onizuka, Suciu, ICDT 2003]
Extensions
• Branching tree patterns.• Condition predicates.• Backward axes
• Boolean queries (“Can tree pattern be embedded into XML document?”)– Rather than node-selecting queries.
Motivation
• Scalability in databases = (all three points at the same time)– Strictly linear time.– Little main memory required (DB in secondary storage).– Little jumping around in the data, sequential scans of disk
preferred (streaming).• Paged sequential reading much faster than random
access.• Node-selecting queries on unranked trees (XML)
– Higher expressiveness than what is possible with single pass.
• Folklore: unary MSO queries can be evaluated in two passes through the tree.
The Arb Query Processor
Evaluates node-selecting queries– In two sequential scans of the data.– Memory requirements: O(depth(tree)), otherwise
independent of size of DB.– Highly parallelizable.– Tree Automata-based.– High expressiveness: unary Monadic Second
Order Logic (MSO).– Succinct representation of automata.
[Frick, Grohe, K., LICS 2003; K., VLDB 2003]
Selecting Tree Automata (STAs)
• STA: Nondeterministic bottom-up tree automata
with a set of selecting states.• Select a node if it is assigned a selecting state in all (or one)
accepting runs:
or
• Expressive power: unary MSO queries on trees.
[Neven’s thesis]; [Frick, Grohe & K., LICS 2003]
Two-Phase Query Evaluation
From STA
1. Deterministic bottom-up tree automaton
– compute reachable states.2. Deterministic top-down tree automaton (with selection)
• Eliminate state-to-node assignments that do not lead to accepting run.
• Select nodes of query result.
[Frick, Grohe & K., LICS 2003]
Representation on Disk1
2 3
4 5 6 9 11 12
7 8
10 13
14
a
b
ba
a
c c c
b
b b
ba
a
FirstChild NextSibling
Representation on Disk1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
a b a a b c b a b c c a b bLabel:
Children?
a
b
ba
a
c c c
b
b b
ba
a
FirstChild NextSibling
Running Automata by Seq. Scans
• Deterministic top-down tree automaton– One sequential forward scan of the data.– Memory: Stack bounded by depth of tree.
• Deterministic bottom-up tree automaton– One sequential backward scan of the data.– Memory: Stack bounded by depth of tree.
• For unranked trees !
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
13
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
13
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
13
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
12
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
12
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
11
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
10
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
9
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
14
10 01 11 01 01 00 01 11 01 01 01 01 00 00
9
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
8
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
7
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
7
10 01 11 01 01 00 01 11 01 01 01 01 00 00
6
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
7
10 01 11 01 01 00 01 11 01 01 01 01 00 00
5
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
7
10 01 11 01 01 00 01 11 01 01 01 01 00 00
4
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
3
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Bottom-up Traversal1
2 3
4 5 6 9 11 12
7 8
10 13
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1
10 01 11 01 01 00 01 11 01 01 01 01 00 00
Monadic Datalog and TMNF
• Monadic datalog: datalog, all “intensional predicates” are unary.• Over unranked, ordered, finite trees:
– Unary: Root, hasFirstChild, hasNextSibling, Label_a, and their complements.– Binary: FirstChild, NextSibling
Example:
D0(x) :- Root(x).
D1(x) :- D0(x0), First-Child(x0, x).
D0(x) :- D1(x0), First-Child(x0, x).
D0(x) :- D0(x0), Next-Sibling(x0, x).
D1(x) :- D1(x0), Next-Sibling(x0, x).
• TMNF (“tree-marking normal form”) - restricted syntax:– P(x) :- P1(x), P2(x). P(x) :- P0(x0), R(x0, x). P(x) :- P0(x0), R(x, x0).
•D0: nodes at even depth in tree.
•D1: nodes at odd depth in tree.
Known Facts about Monadic Datalog
[Gottlob & K., PODS 2002]:• M.dl.o.t. can be evaluated in time O(|Program| * |Data|).• M.dl.o.t. captures the unary MSO queries over trees.
[Gottlob & K., LICS 2002], [Frick, Grohe, K. LICS 2003]:• Linear-time reduction to TMNF.• Linear-time reduction also from Core Xpath to TMNF (negation!)
[Grohe and Schweikardt, CSL 2003]:• But: M.dl. much less succinct than MSO, monadic fixpoint logic.
– However, no problems observed in practice yet.
TMNF Example
P1(x) :- Root(x).
P2(y) :- P1(x), FirstChild(x,y).
P3(y) :- P2(x), FirstChild(x,y).
P4(y) :- P3(x), FirstChild(y, x).
P5(y) :- P4(x), FirstChild(y, x).
{P1}
{}
{}
TMNF Example
P1(x) :- Root(x).
P2(y) :- P1(x), FirstChild(x,y).
P3(y) :- P2(x), FirstChild(x,y).
P4(y) :- P3(x), FirstChild(y, x).
P5(y) :- P4(x), FirstChild(y, x).
{P1}
{P2}
{}
TMNF Example
P1(x) :- Root(x).
P2(y) :- P1(x), FirstChild(x,y).
P3(y) :- P2(x), FirstChild(x,y).
P4(y) :- P3(x), FirstChild(y, x).
P5(y) :- P4(x), FirstChild(y, x).
{P1}
{P2}
{P3}
TMNF Example
P1(x) :- Root(x).
P2(y) :- P1(x), FirstChild(x,y).
P3(y) :- P2(x), FirstChild(x,y).
P4(y) :- P3(x), FirstChild(y, x).
P5(y) :- P4(x), FirstChild(y, x).
{P1}
{P2,P4}
{P3}
TMNF Example
P1(x) :- Root(x).
P2(y) :- P1(x), FirstChild(x,y).
P3(y) :- P2(x), FirstChild(x,y).
P4(y) :- P3(x), FirstChild(y, x).
P5(y) :- P4(x), FirstChild(y, x).
{P1,P5}
{P2,P4}
{P3}
Implementation
• Bottom-up phase has to deal with nondeterminism – very large sets of states possible.
• Compact representation of state sets using residual logic programs.
• Compilation of TMNF program P into
– Deterministic bottom-up automaton
• Sets of reachable states of STA become states of .
• Each such state is represented as a residual logic program.
– Deterministic top-down automaton.
• Both evaluated lazily: Transitions computed on demand and stored.
Propositional “Local” Program
P1(x) :- Root(x).
P2(y) :- P1(x), FirstChild(x,y).
P3(y) :- P2(x), FirstChild(x,y).
P4(y) :- P3(x), FirstChild(y, x).
P5(y) :- P4(x), FirstChild(y, x).
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
“Local”
[1] [2]
Bottom-up Run
{}
{P4 :- P2}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
A
Represents 2^3 * (2^2 - 1) = 24 reachable states
Bottom-up Run
{}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
{P4 :- P2}
A
+ {Root; P4[1] :- P2[1]}
Bottom-up Run
{}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
{P4 :- P2}
A
+ {Root; P4[1] :- P2[1]}
{P1; P2[1]; P4[1]; P5}
Top-down Run
{}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
{P4 :- P2}
A {P1; P2[1]; P4[1]; P5}
Top-down Run
{}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
{P4 :- P2}A
{P1; P2[1]; P4[1]; P5}
+ {P2; P4}
Top-down Run
{}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].
{P2; P3[1]; P4}
A
{P1; P2[1]; P4[1]; P5}
+ {P2; P4}
Top-down Run
{}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].A
{P1; P2[1]; P4[1]; P5}
+ {P3} {P2; P3[1]; P4}
Top-down Run
{P3}
P1 :- Root.
P2[1] :- P1.
P3[1] :- P2.
P4 :- P3[1].
P5 :- P4[1].A
{P1; P2[1]; P4[1]; P5}
+ {P3} {P2; P3[1]; P4}
• Encode string as almost complete binary “infix” tree.
• Represent backward step between leaves as caterpillar (tree-walking) expression.
• Express regular expression over strings as monadic datalog program over infix tree.
Example: Parallel Regular Expression Matching
e
x
a
m
p
l
e
Some further interesting work
• Structural Joins, Twig Joins– [Al-Khalifa et al., ICDE 2002; Bruno, Koudas, and Srivastava, SIGMOD
2002; …]– Exploit tree structure to compute matches of tree pattern in time O(|input| + |
output|).
• Index Structures for Path Expressions– [Kemper and Moerkotte, 1992; Milo and Suciu, ICDT 1999]– Bisimulation; data guides, 1-indexes, t-indexes, …
• Optimization of XPath– Containment and Minimization [Miklau and Suciu, PODS 2002, Neven and
Schwentick, ICDT 2003; Wood, WebDB 2001, ICDT 2003; Deutsch and Tannen, KRDB 2001]
– Satisfiability [Hidders, DBPL 2003]– Axiom sytems for query rewriting [Benedikt, Fan and Kuper, ICDT 2003]
• Closure Properties for Xpath Fragments– [Benedikt, Fan and Kuper, ICDT 2003]