engineering succinct dom - university of leicester · 2011-10-11 · xml “bloat” •verbose...

24
Engineering Succinct DOM Engineering Succinct DOM O’Neil Delpratt, Rajeev Raman, Naila Rahman

Upload: others

Post on 22-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Engineering Succinct DOMEngineering Succinct DOM

O’Neil Delpratt, Rajeev Raman, Naila Rahman

Page 2: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

XML “Bloat”

• Verbose data representation.

▫ interoperability

▫ human-readability

IAB 2008

2

▫ human-readability

• XML processing uses much more memory.

▫ benefits?

• “Premature optimisation is the root of all evil”

▫ SOAP packets ~ 1MB

Page 3: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

DOM

book

#doc

7

<book catalogue=“java”><author>Selman</author><title>Java3d programming</title><year>2000</year>

</book>

cat “java”

IAB 2008

3

titleauthor year1

4 6

7

“[cr][sp]”

“Selman” “2000”

“[cr][sp]” “[cr][sp]”“[cr][sp]”

title year

“Java3d programming”

DOM (Document Object Model) – basic in-memory representation.

Navigate tree; retrieve information associated with node.

Standard DOM implementations: pointer-based.

In-memory representation = 300% to 800% of file size

For static documents, is there a “pointerless” representation, with full DOM functionality? Succinct DOM

Page 4: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Succinct DOM• Supports most static methods of DOM Core L3▫ document changes not supported.

• Supports x R y, R є {ancestor, descendant,preceding, following,

IAB 2008

4

descendant,preceding, following,

preceding/following-sibling}

• Good worst-case behaviour:▫ navigation, x R y etc. in O(1) time.▫ good upper bounds on space usage.

• “Very fast” with “reasonable compression”.• Leicester Research Archive (Google title)▫ SDOM library (linux/i686), test harness, Valgrind

3.5x faster to 7x slower than Xerces-C

(i) significantly smaller than original XML file or (ii) comparable to “query-

friendly” XML compressors

Page 5: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Previous Work• Many XML compressors: (XMill, Xgrind, XPress, XCQ, Xbzip…) ▫ Query-friendly (Xgrind, Xpress, XCQ, Xbzip …) focus on XPath queries, not navigation

▫ Navigation typically slow – order of milliseconds.▫ SDOM/(-CT)

� (at worst) few times slower than memory access.

IAB 2008

5

� (at worst) few times slower than memory access.� no direct support for Xpath queries.

• DOM implementations▫ SEDOM –basic operations slow (ms), CR between SDOM and SDOM-CT (allows updates)

▫ DDOM* – space better than Xerces (allows updates)▫ TinyTree – space much better than Xerces (static)▫ ISX (WWW’07):

� similar approach to ours, seems 5x-10x slower from paper, significantly worse CR, but fast updates.

• Summary: Succinct DOM occupies new point in Space/Time tradeoff, first to surpass performance of Xerces with good compression.

Page 6: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Succinct DOM

• Based on Succinct Data Structures.

▫ “Information-theoretic” optimality.

▫ Succinctness != compression.

IAB 2008

6

▫ Succinctness != compression.

• Highly optimised components [GRRR’06, DRR’07]

Page 7: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

• Space usage?▫ Naïve: ≥ 2 pointers/node

▫ Succinct:

Tree Structure

1

221log

n

n

n

)(log2 nOn −=

IAB 2008

)(log2 nOn −=

7

Page 8: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Tree Structure

• Parenthesis sequence:

▫ findclose/open, enclose

Xerces: 5 ptrs/node(160 bits/node)

Succinct: 2 bits/node

a b c d e f g h i j

8

( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ) ( ) )

▫ findclose/open, enclose

� enclose(10) = 1

• O(1) time

▫ 2.86n bits [GRRR ‘06]

IAB 2008

Page 9: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Tree Nodes

• Numbered in document order

▫ Text nodes

� numbered consecutively 1..t in “document order”

IAB 2008

9

Bit-vector data structure with Rank operation

� numbered consecutively 1..t in “document order”

▫ Non-text tree nodes

� numbered consecutively 1..e in “document order”

� index into array of “short codes”, where each entry is log(p + 12) bits.

Number of distinct

element names

Specify other kinds of nodes e.g. processing-instruction.

Page 10: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Text Nodes<book catalogue=“java”><author>Selman</author><title>Java3d programming</title><year>2000</year></book>

book

#doc

7

cat “java”

• “String value” of text node

titleauthor year1

4 6

7

“[cr][sp]”

“Selman” “2000”

“[cr][sp]” “[cr][sp]”“[cr][sp]”

title year

“Java3d programming”

IAB 2008

10

Page 11: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Text Nodes

• Text nodes numbered 1..t

S e l m a n c

r

s

pJ a v a 3 D … m m i n g c

r

s

p2 0 0

IAB 2008

11

Text data “array”

• Naïve: 4-8 bytes per text node.

▫ Average text node: 10-12 characters – uncompressed.

• Offset array compression [DRR ‘07]: ~ 1 byte/node.

S e l m a nr p

J a v a 3 D … m m i n gr p

2 0 0

Offset “array”

Page 12: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Text data “array”

SDOM SDOM-CT (Q&D)

12

IAB 2008

(All text nodes concatenated in document order; offsets given by offset “array”)

• Array of uncompressed characters

▫ text node value is null-terminated: can return pointer into array.

▫ Pure succinct representation.

1. Bzip-compress 16KB blocks, decompress on demand.

o cache 4 uncompressed blocks.

2. Store array in FM-index or Compressed Suffix Array.

o seems slower than (1)

� Use (1) if locality in text access; (2) otherwise.

� Other alternatives e.g. hashing.

Page 13: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Node class

• Most important class in DOM for navigation.▫ represents a node in the DOM tree.▫ SDOM Node: 2 ints + ptr to Document object.

• SDOM initially has no objects

IAB 2008

13

• SDOM initially has no Node objects▫ Created upon traversal; no check to see if a tree node has an existing Node object:� y = x.getFirstChild();

� z = x.getFirstChild(); /* z != y */

• No garbage collection…• Better use TreeWalker class▫ NextNode/PreviousNodemethods are very fast.

Page 14: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

IAB 2008

14

Page 15: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Files used (UW repository)

File Size Nodes

Max.

DepthMax node degree

#text

chars

#attr

Chars

Orders 5MB 300,004 3 30001 1MB NEG

Lineitem 32MB 2,045,954 3 120351 6MB 0

IAB 2008

15

Lineitem 32MB 2,045,954 3 120351 6MB 0

XPATH 50MB 2,522,571 5 42075 13MB NEG

Treebank_e 82MB 7,312,613 36 112769 57MB NEG

SwissProt 110MB 10,599,084 5 100000 35MB 13MB

DBLP 128MB 10,595,379 6 657717 64MB 7MB

XCDNA 594MB 25,221,153 7 82237 256MB 0

Also Xmark files: not shown. Machine: Pentium 2GB RAM. Ubuntu. g++ 3.3.

Page 16: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

SDOM space usage

File Size SDOM Xerces Saxon

Orders 5MB 37% 451% 157%

Lineitem 32MB 28% 399% 161%

• DOM implementations (with uncompressed text)

• Percentages relative to file sizes.

IAB 2008

16

Lineitem 32MB 28% 399% 161%

XPath 50MB 33% 383% 137%

Treebank 82MB 84% 866% 266%

SwissPrt 110MB 60% 704% 272%

DBLP 128MB 68% 737% 240%

XCDNA 594MB 50% 491% 136%

sizes.

• Textual data left uncompressed!

Page 17: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

SDOM-CT space usage

• Comparison with:▫ Query-Friendly XML compressors

IAB 2008

17

� XBZipIndex, XPRESS,

� XQZip, XGRIND.

▫ XMill

�SDOM-CT is a “middle-of-the-road” compressor.

�about 25% of SDOM-CT’s space is purely to support fast navigation.

Page 18: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Speed test

• Basic test: several kinds of traversals▫ Document-order▫ Reverse document-order▫ Upward path enumeration

IAB 2008

18

▫ Upward path enumeration• Full test:▫ Document-order traversal plus:

� traverse all attributes and values� search for a (uncommon) substring in all text nodes.

• Compare traversals using Node and TreeWalkerNextNode methods.• All XML documents (except XCDNA) fit comfortably in main memory even for Xerces.

Page 19: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Basic Test (Document/RDO)

IAB 2008

19

20

25

30

0

5

10

15

orders lineitem XPATH treebank_e SwissProt dblp XCDNA

Tim

e (Secs)

Xerces-DocOrder SDOM-DocOrder Xerces-ReverseDocOrder SDOM-ReverseDocOrder

Page 20: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Upward Path Enumeration

IAB 2008

20

100

120

0

20

40

60

80

orders lineitem XPATH treebank_e SwissProt dblp XCDNA

Tim

e (Secs)

Xerces SDOM

Page 21: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Full Test

IAB 2008

21

20

25

30

0

5

10

15

20

treebank_e SwissProt dblp XCDNA

Tim

e (Secs)

Xerces-TreeNav Xerces-NextNode SDOM-TreeNav SDOM-NextNode SDOM-CT-TreeNav

Page 22: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Text compression

libBzip2-blocks FMIndex

Files Text path-order doc-order doc-order

Orders 1MB 22% 30% 29%

IAB 2008

22

Lineitem 6MB 21% 31% 26%

XPATH 13MB 9% 13% 13%

Treebank 57MB 42% 45% 56%

SwissProt 49MB 17% 29% 20%

DBLP 71MB 28% 35% 30%

Difference between path-order and doc-order is fairly small for bzip2 (cf. Xmill).

Page 23: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

PARSING

“Spike” caused by use of vector for accumulating Textual data as it is

IAB 2008

23

2,000

2,500

3,000

Memory usage (MB)

DOM parsers memory usage -XCDNA.xml (594MB)

Textual data as it is processed by SAX.

Easy fix for much “preprocessing” space usage.

-

500

1,000

1,500

Memory usage (MB)

Memory usage snap-shot

SDOM Xerces-c

Page 24: Engineering Succinct DOM - University of Leicester · 2011-10-11 · XML “Bloat” •Verbose data representation. interoperability human-readability IAB 2008 2 •XML processing

Conclusion/Future Work

• Succinct DOM: very fast, reasonable compression, and robust performance. ▫ Incorporate into higher-level application

IAB 2008

24

▫ Incorporate into higher-level application (Xalan/Saxon/OpenOffice)? ▫ Optimise parsing further?▫ Succinctness + secondary storage?

• Commercialisation?:▫ Saxon £250 license▫ Intel high-performance XSLT library $500▫ IBM XML parsing hardware ~ $5k.