buffering in query evaluation over xml streams ziv bar-yossef technion marcus fontoura vanja...

24
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

Buffering in Query Evaluation over XML

Streams

Ziv Bar-YossefTechnion

Marcus FontouraVanja Josifovski

IBM Almaden Research Center

Page 2: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

2

XML Document1: <department>2: <name>3: Software Testing4: </name>5: <employee id= 1>6: <name>7: Alice8: </name>9: <position>10: engineer11: </position >12: </employee >13: <employee id = 2>14: <name>15: Bob16: </name>17: <position >

18: engineer19: </position >20: </ employee >21: <employee id = 3>22: <name>23: Carole24: </name>25: <position >26: assistant27: </position >28: </employee >29: <manager id = 4>30: <name>31: John32: </name>33: </manager>34: </department>

Page 3: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

3

XML Document Tree

Software Testing

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name4 John

Page 4: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

4

XPath Queries

[manager/name = “John”] [position = “engineer”]

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name4 John

/department /employee /name

Page 5: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

5

XPath Queries

/department /name

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name4 John

[employee/name = manager/name]

Page 6: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

6

XPath

XPath 2.0 Forward axes only Eval(Q,D): nodes in D that match Q

Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is

nonempty.

Page 7: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

7

XML Streams

XML stream: sequence of SAX events startDocument(), endDocument(),

startElement(name), endElement(name), text(str), … Critical resources

Memory Processing time

Why XML streams? For transferring XML between systems For efficient access to large XML documents

Page 8: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

8

Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] …

All of them use lots of memory on certain queries & documents

All of them use lots of memory on certain queries & documents

Page 9: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

9

Memory Bottleneck I: Storage of Large Transition Tables

Framework of most algorithms: Q NFA Simulate NFA by DFA

Caveat: exponential blowup However: exponential blowup is not necessary

[Bar-Yossef, Fontoura, Josifovski 04]

Algorithm for filtering XML streams whose space is linear in the query size

Page 10: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

10

Memory Bottleneck II:Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part

of the output./department[manager/name = “John”]/employee[position = “engineer”]/name

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name

4 John

Page 11: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

11

Memory Bottleneck II:Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending

predicates.

@id position

department

employee

name

root

employee

@idname

Alice

2

name

position

Bob engineer

employee

@id

name

1

assistant3position

Carole

engineer

manager

@id name

4 John

/department[employee/name = manager/name ]/name

Page 12: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

12

Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that

are nested within each other.

Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura,

Josifovski 04]

Page 13: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

13

Our Results

Quantitative space lower bounds for: Full-fledged evaluation of queries with predicates

(Scenario 1) Filtering/full-fledged evaluation of queries with

“multi-variate” predicates (Scenario 2) Matching upper bound

Eager evaluation of predicates In all other scenarios: no buffering required

Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

Page 14: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

14

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Page 15: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

15

Document Concurrency Q: query D = 1,…,n: document

Each i is an SAX event t = (1,…,t) Definition: x D is alive at step t if x t and

s.t. x Eval(Q, t) x Eval(Q, t)

t-concurrency(D,Q): number of distinct nodes that are alive at step t

concurrency(D,Q): maxt t-concurrency(D,Q)

Page 16: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

16

Lower Bound Notions A “normal” lower bound:

For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents

An “ideal” lower bound:For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true

A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D

Page 17: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

17

Our Lower Bound

Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra empty

nodes with auxiliary names. Theorem holds only if:

Q is “star-free” D is non-recursive

Page 18: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

18

Why isn’t this Obvious?

Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL.

Reason 2: Obvious: If x is alive at step t A has to buffer x

Because: A may or may not need to output x Not obvious: If x and y are alive at step t A has

to buffer both If x and y are not “independent”, maybe it’s enough to

buffer just x (or just y)

Page 19: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

19

Proof of Lower Bound

C = t-concurrency(D,Q) x1,…,xC = distinct nodes alive at step t

Recall: for every xi there exist i and i s.t. xi Eval(Q, ti)

xi Eval(Q, ti)

Lemma: there exist a single and a single s.t. for all i, xi Eval(Q, t)

xi Eval(Q, t)

Page 20: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

20

Proof of Lower Bound (cont.)

For every S { 1,…,C } define document DS:

DS is the same as D, except For every i S, we “mark” xi Marking: an extra empty child with an auxiliary

name

Note: DS is almost-isomorphic to D

tS = first t events in DS

Page 21: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

21

Proof of Lower Bound (cont.)

A = any algorithm Consider state of A after processing t

S:

If suffix = , none of the xi’s should be output A could not have output any xi by step t

If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information

about S Conclusion: space ≥ (C)

Actual proof: by one-way communication complexity

Page 22: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

22

Conclusions

Our contributions: Quantitative space lower bounds

Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi-

variate” predicates Matching upper bound

Open problems: Quantitative lower bounds for XQuery evaluation

over streams Address larger fragments of XPath

Page 23: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

23

Memory Bottleneck II:Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that

are nested within each other.

a

root

c

a

ba

c

b

//a[b and c]

Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura,

Josifovski 04]

Page 24: Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

24

Concurrency: Example

1: <department>2: <name>3: Software Testing4: </name>5: <employee id= 1>6: <name>7: Alice8: </name>9: <position>10: engineer11: </position >12: </employee >13: <employee id = 2>14: <name>15: Bob16: </name>17: <position >

18: engineer19: </position >20: </ employee >21: <employee id = 3>22: <name>23: Carole24: </name>25: <position >26: assistant27: </position >28: </employee >29: <manager id = 4>30: <name>31: John32: </name>33: </manager>34: </department>

/department[manager/name = “John”]/employee[position = “engineer”]/name

alive

alive

dead