1 prefix path streaming: a new clustering method for xml twig pattern matching ting chen, tok wang...
TRANSCRIPT
1
Prefix Path Streaming: a New Clustering Method for XML Twig
Pattern Matching
Ting Chen, Tok Wang Ling, Chee-Yong ChanSchool of Computing,
National University of SingaporeSept, 2004
2
Outline
XML Twig Pattern Matching Background State of the Art: TwigStack Sub-optimality of TwigStack Contributions of our Work
PPS: Prefix-Path-Streaming Notations Difference of Tag Streaming and PP Streaming
PPS Stream Pruning PPS Based Twig Pattern Matching
Algorithm: PPS-TwigStack Analysis
Performance Conclusion
3
XML Twig Pattern Matching XML Data Model
A XML document is commonly modeled as a rooted, ordered and labeled tree.
E.g. Note that identifiers (e.g. b1) are given to tree nodes for easy reference
book
preface chapter chapter
paragraph section
section
figure
paragraph
section
figure
paragraph figure
paragraph
………….
title
title
p1
t1
c1
s1
s2
t2 p2
pf1
f1
s3
p3
f2
c2
f3
p4
b1
D1:
4
XML Twig Pattern Matching Regional Coding [1]
Node Label: (startPos: endPos, LevelNum) startPos and endPos are calculated by performing a pre-order
traversal of the document tree; LevelNum is the level of the node in the tree.
E.g. book (0: 50, 1)
preface (1:3, 2) chapter (4:22, 2) chapter(23:45, 2)
paragraph (2:2, 3) section (5:21, 3)
section(7:12, 4)
figure (10:10, 6)
paragraph(9:11, 5)
section(13:17, 4)
figure (15:15, 6)
paragraph(14:16, 5) figure (19:19, 5)
paragraph(18:20, 4)title: (6:6, 4)
title: (8:8, 5)
D1:
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
5
XML Twig Pattern Matching
What is a Twig Pattern? A twig pattern is a small tree whose nodes are predicates (e.g.
element type test) and edges are either Parent-Child (P-C)
edges or Ancestor-Descendant (A-D) edges. E.g. An XPath query Q1 selects Figure elements which are
descendants of some Paragraph elements which in turn are children of Section elements having at least one child element Title
Section
Title Paragraph
Figure
Q1: Section[Title]/Paragraph//Figure
6
XML Twig Pattern Matching
Twig Pattern Matching Problem Statement
Given a query twig pattern Q, and a XML database D that has index structures (e.g. regional coding scheme) to identify database nodes that satisfy each of Q’s node predicates, compute ALL the answers to
Q in D. E.g. The matches for twig pattern Section[Title]/Paragraph//Figure in the
document D1 are:
(s1, t1, p4, f3)
(s2, t2, p2, f1)
D1: b1
pf1 c1 c2
p1 s1
s2
f1
p2
s3
f2
p3 f3
s4t1
t2
7
XML Twig Pattern Matching TwigStack[2]: a Holistic approach
Tag Streaming: all elements of tag q are grouped in a stream Tq ordered by their startPos
Optimal when all the edges in twig pattern are A-D edges Two-phase algorithm:
Phase 1 TwigJoin: a list of intermediate paths are outputted Phase 2 Merge: merge the intermediate path list to get the result
1. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
8
XML Twig Pattern Matching TwigStack Review
A node q in a twig pattern Q is coupled with a stack Sq An element e with tag E is pushed into its stack SE if and only if e is in some
match to Q. E.g. Only color highlighted elements are pushed into their stacks. Thus it is ensured that no redundant paths are output.
An element e is popped out from its stack if all matches involving it have been reported
Thus we ensure that the memory space used by stacks is bounded.
Q: Section[//Title]//Paragraph//Figure
SSection
STitle
SParagraph
SFigure
D1:b1
pf1 c1 c2
p1 s1
s2
f1
p2
s3
f2
p3 f3
s4t1
t2
9
XML Twig Pattern Matching Optimality of TwigStack for A-D edge only twig pattern
Each stream Tq is scanned only once Obviously q should appear the twig pattern
No redundant intermediate result: All intermediate paths output in Phase 1 appear in the final result;
CPU and I/O cost: O(|Input| + |Output|) Space Complexity: O(|Longest Path in the XML tree|) E.g. Q2:Section[//Title]//Paragraph//Figure on document D1
Phase 1 Output: <s1,t1>, <s1,t2>, <s2,t2> <s1,p2,f1>, <s2,p2,f1>, <s1,p3,f2>, <s1,p4,f3>,Note that <s3,p3,f2> will not be output ; although it matches one root to leaf
path of Q2, it is not in a match to Q2. Phase 2 (i.e. the final) output:
< s1,t1,p2,f1>, < s1,t1,p3,f2>, < s1,t1,p4,f3>, < s1,t2,p2,f1>, < s1,t2,p3,f2>, < s1,t2,p4,f3>, < s2,t2,p2,f1>
10
XML Twig Pattern Matching TwigStack is not optimal for twig pattern with P-C
edge E.g. Q1:Section[Title]/Paragraph//Figure on document D1
s1 s2 s3
t1 t2 p1 p2 p3 p4
f1 f2 f3
•The only match involving s1 is
<s1,t1,p4,f3>
•To know s1 is in a match requires to advance the stream Tpara to p4. p2 and p3 should be read.
•p2 is in a match <s2,t2,p2,f1> to Q1 but p3 is not. However, since s2 and s3 have not been read, both p2 and p3 should be stored.
•Space complexity is no longer
O(|Longest Path in the XML tree|)
Tsection
Ttitle Tpara
Tfigure
11
XML Twig Pattern Matching TwigStack is not optimal for twig pattern with P-C edge
Remove elements p4 and f3 in document D1 to get D2 To answer Q1:Section[Title]/Paragraph//Figure on document D2
TwigStack will push s1 into stack Ssection because s1 is in a match to Section[//Title]//Paragraph//Figure (instead of Q1)
Consequently, a redundant intermediate path s1/t1 will be output.
D1: Matches to Q1 are highlighted D2: Matches to Q1 are highlighted
b1
pf1 c1 c2
p1 s1
s2
f1
p2
s3
f2
p3 f3
s4t1
t2
b1
pf1 c1 c2
p1 s1
s2
f1
p2
s3
f2
p3
t1
t2
12
XML Twig Pattern Matching Why is TwigStack not optimal for twig pattern with
P-C edge ? Match Order
s1 s2 s3
t1 t2 p1 p2 p3 p4
f1 f2 f3
•The only match involving s1 is
m1 = <s1,t1,p4,f3>
•The only match involving s2 is
m2 = <s2,t2,p2,f1>
•s1 is ahead of s2 in Tsection
but
p4 is behind p2 in stream Tpara
•m1 doesn’t have all its elements be ahead of the corresponding elements of m2 in their streams. Neither does m2.
Tsection
Ttitle Tpara
Tfigure
Q1:Section[Title]/Paragraph//Figure
13
XML Twig Pattern Matching Why is TwigStack optimal for twig pattern with only
A-D edge? Match Order
s1 s2 s3
t1 t2 p1 p2 p3 p4
f1 f2 f3
•s1 is in a match m1 = <s1,t1,p2,f1> to Q
•The only match involving s2 is
m2 = <s2,t2,p2,f1>
•Now m1 has all its elements be ahead of the corresponding elements of m2 in their streams.
•It is possible to know whether s1 is in a match to Q without skipping any matching elements of s2 to Q.
Tsection
Ttitle Tpara
Tfigure
Q:Section[//Title]//Paragraph//Figure
14
XML Twig Pattern Matching Tag Streaming can not achieve optimality. How about other streaming method? Solution: Divide elements of the same tag into more than one stream as the
following diagram shows E.g: Notice that now the match of s1 is not “blocked” by that of s2 any more. Question: How can the division be done?
s1
t1
p2 p3
f1 f2
Q1:Section[Title]/Paragraph//Figure
s2 s3 p4
t2 f3
s1 s2 s3
t1 t2 p1 p2 p3 p4
f1 f2 f3
Tsection
Ttitle Tpara
Tfigure
15
Our Contributions Prefix Path Streaming: a New XML streaming scheme for XML twig
pattern matching. PPS reduces I/O cost by pruning irrelevant streams. PPS allows optimal processing of a large class of twig patterns with
both A-D and P-C edges. Optimality achieved by outputting only intermediate paths (as Phase 1
output) which appears in final results for the class of twig patterns. The class of twig pattern includes:
A-D only twig pattern (with any number of branch nodes) ; and P-C only twig pattern (with any number of branch nodes); and 1-branch node twig pattern (which can have both A-D and P-C edges)
16
XML Twig Pattern Matching Our contributions
XML Twig Pattern
A-D edge only Twig Pattern
1-branchnode Twig Pattern
P-C edge only Twig Pattern
Optimal for Tag Streaming
Optimal for PP Streaming
17
Prefix Path Streaming The prefix-path of an element in a XML document tree is the
path from the document root to the element. A prefix-path stream (PP stream) contains all elements in the
document tree with the same prefix-path, ordered by their startPos numbers.
Example: For the sample XML document D1, there are 11 PP streams
compared with 7 streams generated by Tag Streaming. For example, instead of a single stream Tsection in Tag Streaming,
now we have two PP streams: Tbook/chapter/section and Tbook/chapter/section/section.
18
Prefix Path Streaming Notations
Twig Pattern Q: a twig pattern q : a node in twig pattern. Qq: the sub-twig of Q rooted at q We do not allow nodes with the same tag in a twig pattern; therefore we
will use node q and tag q interchangeably according to the context Element e is used to refer to a node in an XML document tree; the
capital letter E is used to denote the tag of e Element Stream
T is used to denote an element stream Tag Stream is denoted by Tq where q is the tag name of elements in Tq. PP Stream is denoted by Tprefix-path where prefix-path is the common prefix
path of elements in Tprefix-path A PP Stream is of class q if its elements are of tag q.
A match to a twig pattern is denoted by M
19
Prefix Path Streaming PP Streaming allows some degree of “parallel” access to sequential files
containing elements with the same tag. More formally, Definition 1: (Match Order)
Intuitively a match MA is ahead of another match MB if no element of MA is placed behind any element of MB in the same stream (either tag stream or PP stream)
Note that the definition of Match Order can be applied to both Tag Streaming and PP streaming
20
Prefix Path Streaming
Tag Streaming Neither M1 ≤ M2 nor M2 ≤ M1
because of (s1,s2) and (p2, p4)
Informally, we can think M1 and M2 are “tangled”.
PP Streaming M1 ≤ M2 and M2 ≤ M1 M1 and M2 are not
“tangled”
s1 s2 s3
t1 t2 p1 p2 p3 p4
f1 f2 f3
Tsection
Ttitle Tpara
Tfigure
Match OrderExample: Q1:Section[Title]/Paragraph//Figure
M1: <s1,t1,p4,f3> and M2: <s2,t2,p2,f1>
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/T TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/T
TB/C/S/S/P/F
TB/C/S/P/F
21
Properties of Prefix Path Streaming Definition (Element Order)
Example Under Tag Streaming for the XML document and Twig Pattern
Q1 in the previous slide t1 ≤ Q1 s2 ; but s2 ≤ Q1 t1 does not hold because t2 is placed behind
t1 in the stream TTitle
Neither s1 ≤ Q1 s2 nor s2 ≤ Q1 s1 Under PP Streaming
s1 ≤ Q1 s2 and s2 ≤ Q1 s1 Intuitively, a ≤Q b means it is possible to know a
is in match to QA without skipping any match involving b to QB.
22
Properties of Prefix Path Streaming Why can Tag Streaming provide optimal solution for A-D edge only twig
pattern?
PP Streaming has similar property for a larger class of twig pattern
Informally, referring back to slide 20, Lemma 1 says under Tag Streaming, there is no problem of “tangle” for A-D only twig pattern;
Lemma 2 says under PP Streaming, there is no problem of “tangle” for all three classes of twig patterns.
Lemma 2 makes it possible to devise optimal algorithms for a larger class of twig patterns on PP streams
23
Prefix Path Stream Pruning
Using the set of PP Stream labels, we can prune away many PP streams without accessing the data E.g for Q1:Section[Title]/Paragraph//Figure , the PP
stream in document D1 TBook/Preface/Paragraph can be pruned
Details please refer to our paper [3].
24
PP Stream Based Twig Pattern Matching PPS-TwigStack:
Similar to TwigStack, a stack Sq is coupled with a node q in the twig pattern.
Using stacks, exponential number of matches can be encoded in linear space. For more details, see [2].
E.g.SSection
STitle SParagraph
SFigure
Section
Title Paragraph
Figure
25
PP Stream Based Twig Pattern Matching
Overview Similar to TwigStack, our algorithm
operates in two phases (See slide 7).
Our goal is to reduce the number of intermediate paths output in Phase 1 Twig Join (See slide 7).
Flow Chart of the procedure of Phase 1
In each step getNext() finds a “potential” matching element e
Depending on the current stack states, e is then being pushed into its stack SE or being used only for updating stack matching states. (See TwigStack for details)
Phase 1 ends when all leaf streams (streams whose tag is leave node of Q)
Phase 2 “Merge” of TwigStack remains unchanged.
TA
TB
e := getNext()
If e should be pushed into its stack*
Push e to its stack and output a path if e is a leaf element
If all leaf streams end END
Advance the stream e belongs to
……
NO
NO
YES
YES
Pop elements from stack SE and Sparent(E) which are not ancestor of e
*: e is pushed into its stack SE if it has a parent or ancestor element in stack Sparent(E)
26
PP Stream Based Twig Pattern Matching
Which element will getNext() select in the remaining elements? 1. e of tag E is in a match to QE and e ≤Q e’ for each remaining element
e’; and2. e does not have a parent (ancestor)* element p of tag parent(E) such
that p is in a match to Qparent(E). AndAdditionally, to correctly maintain the stack states, we require3. e has the smallest startPos among elements which satisfy condition (1)
and (2) and have tag E or Sibling(E). Sibling(E) is a node which is sibling node of E in Q.
Why? To find e, no element appearing in matches will be skipped because of (1) and
thus we don’t need to store any elements. (2) ensures that an ancestor element in a match to Q will always be found before
its descendant element. (3) ensures that when an element is popped out of its stack, all of its matches
have been output.
*: depending on the edge type of <E,parent(E)>
27
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
1st call getNext() returns s1
•s1 is then pushed into Ssection
•Note that s1 is in a match to QSection which is Q1 itself.TB/C/S/T
TB/C/S/S/T
current stream cursor position
current stream end
28
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
2nd call getNext() returns t1
•t1 is then pushed into STitle
•Path s1/t1 outputted; since s1 is in a match to Q1, s1/t1 will be in the final match
t1
TB/C/S/T
TB/C/S/S/T
current stream cursor position
current stream end
29
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
3rd call getNext() returns s2
•s2 is then pushed into Ssection
s2TB/C/S/T
TB/C/S/S/T
current stream cursor position
current stream end
30
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
4th call getNext() returns t2
•t2 is then pushed into Stitle
•Path s2/t2 outputted
•Note that p4 is not selected because it satisfies all conditions but (3); the reason is that by selecting p4, s2 will be popped out prematurely.
s2
TB/C/S/T
TB/C/S/S/T
t2
current stream cursor position
current stream end
31
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
5th call getNext() returns p2
•p2 is pushed into Sparagraph
s2
TB/C/S/T
TB/C/S/S/T
p2
current stream cursor position
current stream end
32
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
6th call getNext() returns f1
•f1 is pushed into SFigure
•Path s2/p2/f1 outputted
s2
TB/C/S/T
TB/C/S/S/T
p2
f1
current stream cursor position
current stream end
33
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
7th call getNext() returns p3
•s3 is skipped because it is not in any match to Qsection
•p3 is selected but discarded because it doesn’t have a parent element in Ssection
•p2 and s2 are popped out because they are not ancestor of p3
s2
TB/C/S/T
TB/C/S/S/T
p2
current stream cursor position
current stream end
34
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
8th call getNext() returns f2
•f2 is then discarded because it does not have an ancestor element in Sparagraph
TB/C/S/T
TB/C/S/S/T
current stream cursor position
current stream end
35
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
9th call getNext() returns p4
•p4 is pushed into Sparagraph
TB/C/S/T
TB/C/S/S/T
p4
current stream cursor position
current stream end
36
PP Stream Based Twig Pattern Matching Example: PPS-TwigStack
– Q1:Section[Title]/Paragraph//Figure
s1
t1
p2 p3
f1 f2
s2 s3 p4
t2 f3
TB/C/S
TB/C/S/S TB/C/S/P
TB/C/S/S/P
TB/C/S/S/P/F
TB/C/S/P/F
SSection
STitle SParagraph
SFigure
s1
10th call getNext() returns f3
•f3 is pushed into SFigure
•Path s1/p4/f3 outputted
•Matching ends because all streams end.TB/C/S/T
TB/C/S/S/T
p4
f3current stream cursor position
current stream end
37
PP Stream Based Twig Pattern Matching
Just like TwigStack for A-D only twig patterns, for the three classes of twig pattern PPS-TwigStack Push into stacks only elements which appear in the final results No redundant paths are output as the intermediate results
getNext() in PPS-TwigStack selects element e of tag E such
that it is in a match to sub-twig QE getNext() is a recursive method as its counterpart in TwigStack It works by ensuring for each child node Ei of E in Q,
there exists an element ei such that ei is in a match of sub-twig QEi
e and ei satisfy the A-D or P-C relationship of edge <E,Ei> getNext() in PPS-TwigStack is more complex than that of
TwigStack because the descendant elements of an element e can scatter on several PP streams
For each child node Ei of E in Q, we consider all possible PP streams of class Ei to find the suitable child/descendant element ei of e.
38
PP Stream Based Twig Pattern Matching
Example: Call stacks of the 1st getNext() element s1 is returned The current cursor position of each stream is denoted in
parentheses. e.g. TBCS(s1)
<paragraph, {TB/C/S/S/P(p2),TB/C/S/P(p4)} >
<section, {TB/C/S(s1),TB/C/S/S(s2)}>
< figure, {TB/C/S/S/P/F(f1),TB/C/S/P/F(f3)} >
getNext(Paragraph)
getNext(Figure)
getNext(title)
<title, {TB/C/S/T(t1),TB/C/S/S/T(t2)}>
getNext(Section)
39
PP Stream Based Twig Pattern Matching
Analysis of PPS-TwigStack Perform a single forward scan of all useful PP streams. No redundant intermediate paths
The worst case I/O complexity is linear in the sum of sizes of the useful PP streams and the output list.
CPU complexity is O((|no. of PP streams after pruning|) * (|Input list|+|Output list|)),
which is also independent of the size of partial matches to any root-to-leaf path of the twig.
The space complexity is the maximum length of a root-to-leaf path.
If a twig pattern is not in any of the three classes afore-mentioned, we can simply run TwigStack by simulating a single stream Tq in Tag Streaming using multiple PP streams of class q.
40
Performance The impact of Prefix-Path streaming on XML document
storage. In most practical cases, the total number of PPS streams in XML
documents can be considered as linear to the number of different tags. Compared with the space used by Tag Streaming approach, PP
streaming requires little extra space. Fragmentation Factor = space used by PPS / space used by Tag Streaming
Data Set Size(MB)
No. of tags
No of PP streams
Max no. of PP streams per tag
Fragmentation factor
XMark xmark1 50 77 536 96 1.08
xmark2 130 77 548 99 1.06
xmark3 500 77 548 99 1.02
DBLP 100 27 63 6 1.01
41
Performance Effects of Prefix-Path stream pruning.
We measure the effect of PPS pruning on several twig queries in XMark benchmark .
Pruning can reduce the I/O cost significantly Query 1: site/people/person/name
Query 2: site/open_auctions/open_auction/bidder/increase/
Query 3: site/closed_auctions/closed_auction/price
Query 4: site/regions//item
Query 5: site/people/person[//age and //income]/name
Query No. of Pages before pruning
No of PPS streams before pruning
No. of Pages after Pruning
No. of PPS streams after pruning
Page PruningRatio
Q1 293 17 60 4 4.9
Q2 180 6 68 5 2.6
Q3 14 4 14 4 1
Q4 40 10 11 8 3.6
Q5 316 19 84 6 3.76
42
Performance Performance Comparison of PPS-TwigStack and
TwigStack The running times of PPS-TwigStack (with pruning) and the TwigStack on
the five queries on a XMark document of size 200M Bytes PPS-TwigStack out-performs TwigStack in all the queries except Q3
which PPS-pruning can’t find any irrelevant PPS stream.
0
2
4
6
8
Exe
cuti
on
T
ime(
Sec
on
d)
Q1 Q2 Q3 Q4 Q5
PPS TwigStack TwigStack
43
Performance Performance Comparison of PPS-TwigStack and TwigStack
Even there is no PP streams which can be pruned, PPS-TwigStack still outperforms TwigStack because it outputs fewer intermediate paths.
We synthesize XML documents which have a lot of partial matches to branches of twig pattern and there is no PP stream can be pruned.
We control and vary the ratio of portions of XML document that contain partial matches and that contain full matches.
Run PPS-TwigStack and TwigStack on a 300 MB documents and twig pattern A/B[C//D]//E. Notice PPS-TwigStack is optimal for the query.
020406080
100
5% 10% 15% 20% 25% 30%Portions of document having matches
Exe
cuti
on
T
ime(
Sec
on
d)
PPS TwigStack TwigStack
44
Conclusion
We propose a novel clustering method Prefix-Path Streaming (PPS) for XML twig pattern matching.
PP Streaming can reduce I/O cost by avoiding scanning of irrelevant PP streams.
PP Streaming optimal processing of a large class of twig patterns with both A-D and P-C edges.
The work is published in [3].
45
References:
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
3. T. Chen, T. W. Ling, C.Y. Chan, Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching. In In Proceedings of DEXA, 2004.