buffering in query evaluation over xml streams ziv bar-yossef technion marcus fontoura vanja...

Buffering in Query Evaluation over XML

Streams

Ziv Bar-YossefTechnion

Marcus FontouraVanja Josifovski

IBM Almaden Research Center

2

XML Document1: <department>2: <name>3: Software Testing4: </name>5: <employee id= 1>6: <name>7: Alice8: </name>9: <position>10: engineer11: </position >12: </employee >13: <employee id = 2>14: <name>15: Bob16: </name>17: <position >

18: engineer19: </position >20: </ employee >21: <employee id = 3>22: <name>23: Carole24: </name>25: <position >26: assistant27: </position >28: </employee >29: <manager id = 4>30: <name>31: John32: </name>33: </manager>34: </department>

3

XML Document Tree

Software Testing

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name4 John

4

XPath Queries

[manager/name = “John”] [position = “engineer”]

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name4 John

/department /employee /name

5

XPath Queries

/department /name

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name4 John

[employee/name = manager/name]

6

XPath

XPath 2.0 Forward axes only Eval(Q,D): nodes in D that match Q

Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is

nonempty.

7

XML Streams

XML stream: sequence of SAX events startDocument(), endDocument(),

startElement(name), endElement(name), text(str), … Critical resources

Memory Processing time

Why XML streams? For transferring XML between systems For efficient access to large XML documents

8

Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] …

All of them use lots of memory on certain queries & documents

All of them use lots of memory on certain queries & documents

9

Memory Bottleneck I: Storage of Large Transition Tables

Framework of most algorithms: Q NFA Simulate NFA by DFA

Caveat: exponential blowup However: exponential blowup is not necessary

[Bar-Yossef, Fontoura, Josifovski 04]

Algorithm for filtering XML streams whose space is linear in the query size

10

Memory Bottleneck II:Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part

of the output./department[manager/name = “John”]/employee[position = “engineer”]/name

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name

4 John

11

Memory Bottleneck II:Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending

predicates.

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name

4 John

/department[employee/name = manager/name ]/name

12

Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that

are nested within each other.

Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura,

Josifovski 04]

13

Our Results

Quantitative space lower bounds for: Full-fledged evaluation of queries with predicates

(Scenario 1) Filtering/full-fledged evaluation of queries with

“multi-variate” predicates (Scenario 2) Matching upper bound

Eager evaluation of predicates In all other scenarios: no buffering required

Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

14

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

15

Document Concurrency Q: query D = 1,…,n: document

Each i is an SAX event t = (1,…,t) Definition: x D is alive at step t if x t and

s.t. x Eval(Q, t) x Eval(Q, t)

t-concurrency(D,Q): number of distinct nodes that are alive at step t

concurrency(D,Q): maxt t-concurrency(D,Q)

16

Lower Bound Notions A “normal” lower bound:

For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents

An “ideal” lower bound:For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true

A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D

17

Our Lower Bound

Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra empty

nodes with auxiliary names. Theorem holds only if:

Q is “star-free” D is non-recursive

18

Why isn’t this Obvious?

Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL.

Reason 2: Obvious: If x is alive at step t A has to buffer x

Because: A may or may not need to output x Not obvious: If x and y are alive at step t A has

to buffer both If x and y are not “independent”, maybe it’s enough to

buffer just x (or just y)

19

Proof of Lower Bound

C = t-concurrency(D,Q) x1,…,xC = distinct nodes alive at step t

Recall: for every xi there exist i and i s.t. xi Eval(Q, ti)

xi Eval(Q, ti)

Lemma: there exist a single and a single s.t. for all i, xi Eval(Q, t)

xi Eval(Q, t)

20

Proof of Lower Bound (cont.)

For every S { 1,…,C } define document DS:

DS is the same as D, except For every i S, we “mark” xi Marking: an extra empty child with an auxiliary

name

Note: DS is almost-isomorphic to D

tS = first t events in DS

21

Proof of Lower Bound (cont.)

A = any algorithm Consider state of A after processing t

S:

If suffix = , none of the xi’s should be output A could not have output any xi by step t

If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information

about S Conclusion: space ≥ (C)

Actual proof: by one-way communication complexity

22

Conclusions

Our contributions: Quantitative space lower bounds

Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi-

variate” predicates Matching upper bound

Open problems: Quantitative lower bounds for XQuery evaluation

over streams Address larger fragments of XPath

23

Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that

are nested within each other.

a

root

c

a

ba

c

b

//a[b and c]

Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura,

Josifovski 04]

24

Concurrency: Example

1: <department>2: <name>3: Software Testing4: </name>5: <employee id= 1>6: <name>7: Alice8: </name>9: <position>10: engineer11: </position >12: </employee >13: <employee id = 2>14: <name>15: Bob16: </name>17: <position >

18: engineer19: </position >20: </ employee >21: <employee id = 3>22: <name>23: Carole24: </name>25: <position >26: assistant27: </position >28: </employee >29: <manager id = 4>30: <name>31: John32: </name>33: </manager>34: </department>

/department[manager/name = “John”]/employee[position = “engineer”]/name

alive

alive

dead

buffering in query evaluation over xml streams ziv bar-yossef technion marcus fontoura vanja...

Documents

position bob engineer

john position

root employee

john department employee

john slide

managername slide

xml streams xml stream

memory bottleneck