a unified model for xquery evaluation over xml data streams jinhui jian hong su elke a....

A Unified Model for XQuery Evaluation over XML Data Streams

Jinhui Jian

Hong Su

Elke A. Rundensteiner

Worcester Polytechnic Institute

ER 2003

Need for Stream Processing

New environment Data sources are everywhere Data requests are everywhere

New applications Sensor networks Analysis of XML web logs Selective dissemination of XML information

(e.g., news)

Specific Challenges for XML Streams <biditems>

<book year=“2001">

<title>Dream Catcher</title>

<author><last>King</last><first>S.</first></author>

<publisher>Bt Bound </publisher>

<price> 20 </price>

</book>

…

Token-by-Token access manner

timeline

<biditems> <book> <title> Dream Catcher </title> …

Token: not a direct counterpart of a tuple

Pattern retrieval + Filtering/Restructuring

FOR $b in doc (biditems.xml) //bookLET $p := $b/price/text() $t := $b/titleWHERE $p < 30Return <Inexpensive> $t </Inexpensive>

Two Computation Paradigms

Automata-based [yfilter02, x-scan01, xsm02, xsq03, xpush03…]

Algebraic [niagara00, …]

This project intends to integrate both paradigms into one

Automata Paradigm:

FOR $b in stream(biditems.xml) //bookLET $p = $b/price/text(), $t = $b/titleWHERE $p < 30RETURN <Inexpensive>$t</Inexpensive>

1book*

2

4title

3

price

5Text()

Auxiliary structures for:

1. Buffering data

2. Evaluating predicates

3. Restructuring buffered data

…

//book

//book/title

//book/price/text()

Algebraic Computation

book bookbook

title author

last first

publisher price

Text

Text Text

Text Text

Navigate //book, price

Tagger

Navigate //book, title

Select price < 30

Navigate //book, price

Select price < 30

Tagger

Navigate //book, title

Selection push-down enabled

FOR $b in doc (biditems.xml) //bookLET $p = $b/price/text(), $t = $b/titleWHERE $p < 30RETURN <Inexpensive>$t</Inexpensive>

<book year=“2001"> …</book>

<book>… … </book>

<title>… </title>

Navigate//book, /title

Observations Automata paradigm

Good and long studied for pattern retrieval on tokens

Patches needed for complex filtering and restructuring

Algebraic paradigm Good and long studied for expressing and optimizin

g query plans on sets of tuples Tokenized inputs not accommodated yet

Either paradigm has deficiencies

Both patterns complement each other

Research Challenges

How to integrate the two models? How to optimize a query within the integrated query

model?

Raindrop Approach:Uniform Modeling in an Algebraic Framework

Uniform Algebraic Plan

XML data stream

Query answer

Algebraic Plan

Uniform Algebraic Plan

Token-based plan (automata plan)

Tuple-based plan

Tuple stream

XML data stream

Query answer

Modeling the Automata in Algebraic Plan:Black Box[xscan] vs. White Box

$b := //book$p := $b/price$t := $b/title

SJoin//book

Extract //book/price

Extract //book/title

Black Box White Box

Xscan

FOR $b in stream(biditems.xml) //bookLET $p = $b/price/text(), $t = $b/titleWHERE $p < 30RETURN <Inexpensive>$t</Inexpensive>

A Unified Process at the Logical View



Tuple-based plan



Tuple-based plan

SJoin//book

Extract$p, //book/price

Extract$t, //book/title



SJoin//book

Extract//book/price

Extract//book/title

Select //book/price >5 0

Navigate //book, //book/title

The Algebra CoreOp Symbol Semantic

Selection Filter tuples based on the predicate pred

Projection Filter columns in the input tuples based on the variable list v

Join Join input tuples based on the predicate pred

Aggregate Aggregate over input tuples with the aggregate function f, e.g., sum and average

Tagger Format outputs based on the pattern pt, i.e., reconstruct XML tags

Navigate Take input elements of path p1 and output ancestor elements of path p2

Extract Identify elements of path p from the input stream

Structural Join

Join input tuples on their structural relationship, e.g, the common parent relationship p

2,1 pp

p

pred

v

ptT

f

Relational-like

XML-Specific

SJ

Extract Operator

1 2book

*

Extract//book/title

<bib> <book> <title> Dream Catcher </title> … </book>…

1title

<title> Dream Catcher </title>

Structural Join Operator

1 2book

3title*

4price

Extract//book/title

Extract//book/price

SJoin//book

<title>…</title> <price>…</price>

<biditems> <book> <title> Dream Catcher </title> … </book>…

<price>…</price><title>…</title>

Optimization via Query Rewriting

In or Out?


Tuple-based Plan

Tuple stream

XML data stream

Query answer

Pattern retrieval

Plan Alternatives

Extract //book

Navigate /price

Select price<30

Navigate book/title

The pull-out plan

Extract //book/price

Extract //book/title

SJoin //book

Select price < 30

The push-in plan

TaggerTagger

Pattern Retrieval Alternatives<title>…</title> <price>…</price>

<title>…</title> <price>…</price>

<price>…</price>

<price>…</price>

<title>…</title>

<title>…</title>

In Automata (/title, /price)

1book

*

2

4title

3

price


<book year=“2001"> <title>Dream Catcher</title> <author> <last> King </last> <first> S. </first> </author> <publisher> Bt Bound </publisher> <price> 20 </price> </book>

<title>…</title>

<title>…</title>



<title>…</title>

<title>…</title>



<price>…</price>

<price>…</price>

Out of Automata(/title, /price)

1book

*

2

t2

t10

t2t10

SJ

Experiment:

Selectivity = 5% Selectivity = 90%

Related Work

Camp 1: Complete Automata Model [XSQ, XSM, XPush]

All details are presented on the same level (and low level!) Hard to understand Not suitable for optimizing at different levels

Little has been studied for using automata as query processing paradigm

Camp 2: Automata-Algebra Loosely Coupled Model [Tukwila, YFilter]

Fixed interface for automata computation (all pattern retrieval pushed down)

No opportunity of pushing/pulling computation into/from automata

Bloated, black box operator Algebraic rewriting impossible for internal

optimization

AutomataPlan

$b := //book$p := //book/price$t := //book/title

$b $p $t

Contributions

Combining automata and algebra leads to a powerful query processing model Modeling:

Uniform, simple logical view – better understandability Optimization:

Uniform rewriting – more optimization opportunities (e.g., pushin/pullout)

Optimization necessity is verified by experiments

Email: [email protected]

Experiment 2

Number of patterns = 2 Number of patterns = 20

a unified model for xquery evaluation over xml data streams jinhui jian hong su elke a....

Documents