a unified model for xquery evaluation over xml data streams jinhui jian hong su elke a....
Post on 22-Dec-2015
217 views
TRANSCRIPT
A Unified Model for XQuery Evaluation over XML Data Streams
Jinhui Jian
Hong Su
Elke A. Rundensteiner
Worcester Polytechnic Institute
ER 2003
Need for Stream Processing
New environment Data sources are everywhere Data requests are everywhere
New applications Sensor networks Analysis of XML web logs Selective dissemination of XML information
(e.g., news)
Specific Challenges for XML Streams <biditems>
<book year=“2001">
<title>Dream Catcher</title>
<author><last>King</last><first>S.</first></author>
<publisher>Bt Bound </publisher>
<price> 20 </price>
</book>
…
Token-by-Token access manner
timeline
<biditems> <book> <title> Dream Catcher </title> …
Token: not a direct counterpart of a tuple
Pattern retrieval + Filtering/Restructuring
FOR $b in doc (biditems.xml) //bookLET $p := $b/price/text() $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Two Computation Paradigms
Automata-based [yfilter02, x-scan01, xsm02, xsq03, xpush03…]
Algebraic [niagara00, …]
This project intends to integrate both paradigms into one
Automata Paradigm:
FOR $b in stream(biditems.xml) //bookLET $p = $b/price/text(), $t = $b/titleWHERE $p < 30RETURN <Inexpensive>$t</Inexpensive>
1book*
2
4title
3
price
5Text()
Auxiliary structures for:
1. Buffering data
2. Evaluating predicates
3. Restructuring buffered data
…
//book
//book/title
//book/price/text()
Algebraic Computation
book bookbook
title author
last first
publisher price
Text
Text Text
Text Text
Navigate //book, price
Tagger
Navigate //book, title
Select price < 30
Navigate //book, price
Select price < 30
Tagger
Navigate //book, title
Selection push-down enabled
FOR $b in doc (biditems.xml) //bookLET $p = $b/price/text(), $t = $b/titleWHERE $p < 30RETURN <Inexpensive>$t</Inexpensive>
<book year=“2001"> …</book>
<book>… … </book>
<title>… </title>
Navigate//book, /title
Observations Automata paradigm
Good and long studied for pattern retrieval on tokens
Patches needed for complex filtering and restructuring
Algebraic paradigm Good and long studied for expressing and optimizin
g query plans on sets of tuples Tokenized inputs not accommodated yet
Either paradigm has deficiencies
Both patterns complement each other
Research Challenges
How to integrate the two models? How to optimize a query within the integrated query
model?
Raindrop Approach:Uniform Modeling in an Algebraic Framework
Uniform Algebraic Plan
XML data stream
Query answer
Algebraic Plan
Uniform Algebraic Plan
Token-based plan (automata plan)
Tuple-based plan
Tuple stream
XML data stream
Query answer
Modeling the Automata in Algebraic Plan:Black Box[xscan] vs. White Box
$b := //book$p := $b/price$t := $b/title
SJoin//book
Extract //book/price
Extract //book/title
Black Box White Box
Xscan
FOR $b in stream(biditems.xml) //bookLET $p = $b/price/text(), $t = $b/titleWHERE $p < 30RETURN <Inexpensive>$t</Inexpensive>
A Unified Process at the Logical View
FOR $b in doc (biditems.xml) //bookLET $p := $b/price/text() $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Token-based plan (automata plan)
Tuple-based plan
A Unified Process at the Logical View
FOR $b in doc (biditems.xml) //bookLET $p := $b/price/text() $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Tuple-based plan
SJoin//book
Extract$p, //book/price
Extract$t, //book/title
A Unified Process at the Logical View
FOR $b in doc (biditems.xml) //bookLET $p := $b/price/text() $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
SJoin//book
Extract//book/price
Extract//book/title
Select //book/price >5 0
Navigate //book, //book/title
The Algebra CoreOp Symbol Semantic
Selection Filter tuples based on the predicate pred
Projection Filter columns in the input tuples based on the variable list v
Join Join input tuples based on the predicate pred
Aggregate Aggregate over input tuples with the aggregate function f, e.g., sum and average
Tagger Format outputs based on the pattern pt, i.e., reconstruct XML tags
Navigate Take input elements of path p1 and output ancestor elements of path p2
Extract Identify elements of path p from the input stream
Structural Join
Join input tuples on their structural relationship, e.g, the common parent relationship p
2,1 pp
p
pred
v
ptT
f
Relational-like
XML-Specific
SJ
Extract Operator
1 2book
*
Extract//book/title
<bib> <book> <title> Dream Catcher </title> … </book>…
1title
<title> Dream Catcher </title>
Structural Join Operator
1 2book
3title*
4price
Extract//book/title
Extract//book/price
SJoin//book
<title>…</title> <price>…</price>
<biditems> <book> <title> Dream Catcher </title> … </book>…
<price>…</price><title>…</title>
FOR $b in doc (biditems.xml) //bookLET $p := $b/price/text() $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Optimization via Query Rewriting
In or Out?
Token-based plan (automata plan)
Tuple-based Plan
Tuple stream
XML data stream
Query answer
Pattern retrieval
Plan Alternatives
Extract //book
Navigate /price
Select price<30
Navigate book/title
The pull-out plan
Extract //book/price
Extract //book/title
SJoin //book
Select price < 30
The push-in plan
TaggerTagger
Pattern Retrieval Alternatives<title>…</title> <price>…</price>
<title>…</title> <price>…</price>
<price>…</price>
<price>…</price>
<title>…</title>
<title>…</title>
In Automata (/title, /price)
1book
*
2
4title
3
price
<book>… … </book>
<book year=“2001"> <title>Dream Catcher</title> <author> <last> King </last> <first> S. </first> </author> <publisher> Bt Bound </publisher> <price> 20 </price> </book>
<title>…</title>
<title>…</title>
<book>… … </book>
<book>… … </book>
<title>…</title>
<title>…</title>
<book>… … </book>
<book>… … </book>
<price>…</price>
<price>…</price>
Out of Automata(/title, /price)
1book
*
2
t2
t10
t2t10
SJ
Experiment:
Selectivity = 5% Selectivity = 90%
Related Work
Camp 1: Complete Automata Model [XSQ, XSM, XPush]
For $x in $R/a return
for $Y in $X/b return
<res>$Y, $X </res>
0,0,0
1,0,0
2,1,0
2,2,1
2,2,2
2,1,3
1,1,3
1,2,2
1,2,1
1,1,0
*r=er|r++*r=sr|r++
*r!=<a>|r++*r=<a>|w(x,sx),w(x,<a>),r++,x”++
*r=</a>|w(x,</a>),w(x,ex),r++,xs=x
*r!=</a>&*r!=</b>|w(x,*r),r++,x”++
*r=<b>|w(x,<b>),r++
*true|xm=x’, w(o,<res>),w(o,<b>),x’++
*r!=</a>&*r!=</b>|w(x,*r),w(o,*r),x”++,r++
*r=</b>|w(x,</b>),w(o,</b>),r++,x”++
!AE(x’)&*x’!=ex|w(o,*x’),x’++
AE(x’)&*r!=</a>|w(x,*r),w(o,*r),r++,x”++
AE(x’)&*r=</a>|w(x,</a>),w(o,</a>),w(x,ex),r++,x’++
!AE(x’)&x’!=ex|w(o,*x’),x’++
!AE(x”)&x”=</b>|w(o,</b>),x”++
!AE(x”)&*x”!=</b>|w(o,*x”),x”++
True|xm=x’,w(o,<res>),w(o,<b>),x’++
!AE(x”)&*x”=<b>|x”++
!AE(x”)&*x”!=<b>&*x”!=ex|x”++
!AE(x”)&*x”=ex|xs=x”
Camp 1: Complete Automata Model [XSQ, XSM, XPush]
All details are presented on the same level (and low level!) Hard to understand Not suitable for optimizing at different levels
Little has been studied for using automata as query processing paradigm
Camp 2: Automata-Algebra Loosely Coupled Model [Tukwila, YFilter]
Fixed interface for automata computation (all pattern retrieval pushed down)
No opportunity of pushing/pulling computation into/from automata
Bloated, black box operator Algebraic rewriting impossible for internal
optimization
AutomataPlan
$b := //book$p := //book/price$t := //book/title
$b $p $t
Contributions
Combining automata and algebra leads to a powerful query processing model Modeling:
Uniform, simple logical view – better understandability Optimization:
Uniform rewriting – more optimization opportunities (e.g., pushin/pullout)
Optimization necessity is verified by experiments
Email: [email protected]
Experiment 2
Number of patterns = 2 Number of patterns = 20