a uniform and layered algebraic framework for xqueries on xml streams hong su jinhui jian elke a....
Post on 21-Dec-2015
213 views
TRANSCRIPT
A Uniform and Layered Algebraic Framework for XQueries on XML Streams
Hong Su
Jinhui Jian
Elke A. Rundensteiner
Worcester Polytechnic Institute
CIKM, Nov 5, 2003
2
Need for Stream Processing
New computing environment Data sources can be anywhere/anytime
On-line arriving data Data requests can be anywhere/anytime
Real-time response requirement New applications
Relational Sensor networks
XML Analysis of XML web logs Selective dissemination of XML information (e.g., news)
3
What’s Special for XML Stream Processing<Biditems>
<book year=“2001">
<title>Dream Catcher</title>
<author><last>King</last><first>S.</first></author>
<publisher>Bt Bound </publisher>
<price> 30 </initial>
</book>
…
<biditems> <book> <title> Dream Catcher </title> …
Token-by-Token access manner
timeline
Pattern retrieval + Filtering + Restructuring
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
Token: not a direct counterpart of a tuple
30Bt BoundS.KingDream2001
pricepublisherfirstlasttitleyear
Pattern Retrieval on Token Streams
4
Two Computation Paradigms
Automata-based [yfilter02, xscan01, xsm02, xsq03, xpush03…]
Algebraic [niagara00, …]
This Raindrop framework intends to integrate both paradigms into one
5
Automata-Based Paradigm
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
1book*
2
4title
price
Auxiliary structures for:
1. Buffering data
2. Filtering
3. Restructuring
…
//book
//book/title
//book/price3
6
Algebraic Computation
book bookbook
title author
last first
publisher price
Text
Text Text
Text Text
<book> …</book>
<title>… </title>
<book>… </book>
Navigate$b, /title -> $t
Navigate $b, /price->$p
Navigate $b, /title->
$t
Tagger
Select $p < 30
Logic Plan
Navigate //$b, /title->$t
Rewrite by “pushing down selection”
Navigate $b,/price->$p
Select $p < 30
Tagger
Rewritten Logic Plan
Navigate-Index $b, /price -> $p
Select $p < 30
Tagger
Navigate-Scan $b, /title -> $t
Physical Plan
Choose low-level implementation alternatives
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
$b
$b $t
…
…
7
Observations
Either paradigm has deficiencies
Both paradigms complement each other
Automata Paradigm Algebra Paradigm
Good for pattern retrieval on tokens Does not support token inputs
Need patches for filtering and restructuring
Good for filtering and restructuring
Present all details on same low level Support multiple descriptive levels (declarative->procedural)
Little studied as query processing paradigm
Well studied as query process paradigm
8
How to Integrate Two Paradigms
9
How to Integrate Two Models? Design choices
Extend algebraic paradigm to support automata? Extend automata paradigm to support algebra? Come up with completely new paradigm?
Extend algebraic paradigm to support automata Practical
Reuse & extend existing algebraic query processing engines Natural
Present details of automata computation at low level Present semantics of automata computation (target patterns)
at high level
10
Raindrop: Four-Level Framework
Semantics-focused PlanSemantics-focused Plan
Stream Logic PlanStream Logic Plan
Stream Physical PlanStream Physical Plan
Stream Execution PlanStream Execution Plan
Abstraction Level
High (Declarative)
Low (Procedural)
11
Level I: Semantics-focused Plan [Rainbow-ZPR02] Express query semantics regardless of store
d or stream input sources Reuse existing techniques for stored XML pr
ocessing Query parser Initial plan constructor Rewriting optimization
Decorrelation Selection push down …
12
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
<Biditems>
<book year=“2001">
<title>Dream Catcher</title>
<author><last>King</last><first>S.</first></author>
<publisher>Bt Bound </publisher>
<price> 30 </initial>
</book>
…
$S1<Biditems> … </Biditems>
$S1<Biditems>… </Biditems>
$b<book> … </book>
<Biditems> … </Biditems>
<book> … </book>
$S1<Biditems>… </Biditems>
$b<book>… </book>
$p
30
<Biditems> … </Biditems>
<book>… </book>
…
$S1<Biditems>…
</Biditems>
$b<book>… </book>
$p 30
$t Dream
Catcher
<Biditems>… </Biditems>
<book>. .. </book>
… …
NavUnnest $S1, //book ->$b
NavNest $b, /price/text() ->$p
NavNest $b, /title ->$t
Select$p<30
Tagger “Inexpensive”, $t->$r
Example Semantics-focused Plan
13
Level II: Stream Logical Plan
Extend semantics-focused plan to accommodate tokenized stream inputs New input data format:
contextualized tokens New operators:
StreamSource, Nav, ExtractUnnest, ExtractNest, StructuralJoin
New rewrite rules: Push-into-Automata
14
One Uniform Algebraic View
Token-based plan (automata plan)
Tuple-based plan
Tuple stream
XML data stream
Query answer
Algebraic Stream Logical Plan
15
Modeling the Automata in Algebraic Plan:Black Box[XScan01] vs. White Box
$b := //book$p := $b/price$t := $b/title
Black Box
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 20Return <Inexpensive> $t </Inexpensive>
XScan
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
White Box
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
16
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
Tuple-based plan
Token-based plan (automata plan)
17
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
Tuple-based plan
18
Example Uniform Algebraic Plan
FOR $b in stream(biditems.xml) //bookLET $p := $b/price $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Navigate $b, /price->$p
Navigate $b, /title->$t
Navigate $S1, //book ->$b
Select$p<30
Tagger “Inexpensive”, $t->$r
19
From Semantics-focused Plan to Stream Logical Plan
StructuralJoin$b
ExtractNest $b, $p
ExtractNest $b, $t
Nav $b, /price/text()->$p
Nav $b, /title->$t
Nav $S1, //book ->$b
Select$p<30
Tagger “Inexpensive”, $t->$r
NavUnnest $S1, //book ->$b
NavNest $b, /price/text() ->$p
NavNest $b, /title ->$t
Select$p<30
Tagger “Inexpensive”, $t->$r
Apply “push into automata”
Apply “push into automata”
20
Level III: Stream Physical Plan
For each stream logical operator, define how to generate outputs when given some inputs Multiple physical implementations may be provided fo
r a single logical operator Automata details of some physical implementation are
exposed at this level Nav, ExtractNest, ExtractUnnest, Structural Join
21
One Implementation of Extract/Structural Join
1book
title*
4price
3
2
ExtractNest$b, $t
ExtractNest/$b, $p
SJoin//book
<title>…</title> <price>…</price>
<biditems> <book> <title> Dream Catcher </title> … </book>…
<price>…</price><title>…</title>
Nav $b, /price->$p
Nav $b, /title->$t
Nav ., //book ->$b
22
Level IV: Stream Execution Plan
Describe coordination between operators regarding when to fetch the inputs When input operator generates one output tuple When input operator generates a batch When a time period has elapsed …
Potentially unstable data arrival rate in stream makes fixed scheduling strategy unsuitable Delayed data under scheduling may stall engine Bursty data not under scheduling may cause overflow
23
Raindrop: Four-Level Framework (Recap)
Semantics-focused PlanSemantics-focused Plan
Stream Logic PlanStream Logic Plan
Stream Physical PlanStream Physical Plan
Stream Execution PlanStream Execution Plan
Express the semantics of query regardless of
input sources
Accommodate tokenized
input streams
Describe how operators manipulate
given data
Decides the Coordination
among operators
24
Optimization Opportunities
25
Optimization Opportunities
Semantics-focused PlanSemantics-focused Plan
Stream Logic PlanStream Logic Plan
Stream Physical PlanStream Physical Plan
Stream Execution PlanStream Execution Plan
General rewriting (e.g., selection push down)
General rewriting (e.g., selection push down)
Break-linear-navigation rewriting
Break-linear-navigation rewriting
Physical implementations choosing
Physical implementations choosing
Execution strategy choosing
Execution strategy choosing
26
From Semantics-focused to Stream Logical Plan: In or Out?
Token-based plan (automata plan)
Tuple-based Plan
Tuple stream
XML data stream
Query answer
Pattern retrieval in Semantics-focused plan
Apply “push into automata”
27
Plan Alternatives
Nav $b, /price->$p
ExtractNest $b, $p
ExtractNest $b, $t
SJoin //book
Select price < 30
Tagger
Nav $b, /title->$t
Nav $S1, //book->$b
ExtractNest $S1, $b
Navigate /price
Select price<30
Navigate book/title
Tagger
Nav $S1, //book->$b
NavUnnest $S1, //book ->$b
NavNest $b, /price ->$p
NavNest $b, /title ->$t
Select$p<30
Tagger “Inexpensive”, $t->$r
Out In
28
Experimentation Results
Execution Time on 85M XML Stream Under Various Selectivity
25000
30000
35000
40000
45000
50000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Selectivity of Selection
Exe
cutio
n Ti
me
(ms)
1 Nav
2 Navs
3 Navs
4 Navs
5 Navs
29
Contributions
Combined automata and algebra based paradigms into one uniform algebraic paradigm
Provided four layers in algebraic paradigm Query semantics expressed at high layer Automata computation on streams hidden at low layer
Supported optimization at an iterative manner (from high abstraction level to low abstraction level)
Illustrated enriched optimization opportunities by experiments
30
Email: [email protected]