engineering succinct dom - university of leicester · 2011-10-11 · xml “bloat” •verbose...
TRANSCRIPT
Engineering Succinct DOMEngineering Succinct DOM
O’Neil Delpratt, Rajeev Raman, Naila Rahman
XML “Bloat”
• Verbose data representation.
▫ interoperability
▫ human-readability
IAB 2008
2
▫ human-readability
• XML processing uses much more memory.
▫ benefits?
• “Premature optimisation is the root of all evil”
▫ SOAP packets ~ 1MB
DOM
book
#doc
7
<book catalogue=“java”><author>Selman</author><title>Java3d programming</title><year>2000</year>
</book>
cat “java”
IAB 2008
3
titleauthor year1
4 6
7
“[cr][sp]”
“Selman” “2000”
“[cr][sp]” “[cr][sp]”“[cr][sp]”
title year
“Java3d programming”
DOM (Document Object Model) – basic in-memory representation.
Navigate tree; retrieve information associated with node.
Standard DOM implementations: pointer-based.
In-memory representation = 300% to 800% of file size
For static documents, is there a “pointerless” representation, with full DOM functionality? Succinct DOM
Succinct DOM• Supports most static methods of DOM Core L3▫ document changes not supported.
• Supports x R y, R є {ancestor, descendant,preceding, following,
IAB 2008
4
descendant,preceding, following,
preceding/following-sibling}
• Good worst-case behaviour:▫ navigation, x R y etc. in O(1) time.▫ good upper bounds on space usage.
• “Very fast” with “reasonable compression”.• Leicester Research Archive (Google title)▫ SDOM library (linux/i686), test harness, Valgrind
3.5x faster to 7x slower than Xerces-C
(i) significantly smaller than original XML file or (ii) comparable to “query-
friendly” XML compressors
Previous Work• Many XML compressors: (XMill, Xgrind, XPress, XCQ, Xbzip…) ▫ Query-friendly (Xgrind, Xpress, XCQ, Xbzip …) focus on XPath queries, not navigation
▫ Navigation typically slow – order of milliseconds.▫ SDOM/(-CT)
� (at worst) few times slower than memory access.
IAB 2008
5
� (at worst) few times slower than memory access.� no direct support for Xpath queries.
• DOM implementations▫ SEDOM –basic operations slow (ms), CR between SDOM and SDOM-CT (allows updates)
▫ DDOM* – space better than Xerces (allows updates)▫ TinyTree – space much better than Xerces (static)▫ ISX (WWW’07):
� similar approach to ours, seems 5x-10x slower from paper, significantly worse CR, but fast updates.
• Summary: Succinct DOM occupies new point in Space/Time tradeoff, first to surpass performance of Xerces with good compression.
Succinct DOM
• Based on Succinct Data Structures.
▫ “Information-theoretic” optimality.
▫ Succinctness != compression.
IAB 2008
6
▫ Succinctness != compression.
• Highly optimised components [GRRR’06, DRR’07]
• Space usage?▫ Naïve: ≥ 2 pointers/node
▫ Succinct:
Tree Structure
−
−
1
221log
n
n
n
)(log2 nOn −=
IAB 2008
)(log2 nOn −=
7
Tree Structure
• Parenthesis sequence:
▫ findclose/open, enclose
Xerces: 5 ptrs/node(160 bits/node)
Succinct: 2 bits/node
a b c d e f g h i j
8
( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ) ( ) )
▫ findclose/open, enclose
� enclose(10) = 1
• O(1) time
▫ 2.86n bits [GRRR ‘06]
IAB 2008
Tree Nodes
• Numbered in document order
▫ Text nodes
� numbered consecutively 1..t in “document order”
IAB 2008
9
Bit-vector data structure with Rank operation
� numbered consecutively 1..t in “document order”
▫ Non-text tree nodes
� numbered consecutively 1..e in “document order”
� index into array of “short codes”, where each entry is log(p + 12) bits.
Number of distinct
element names
Specify other kinds of nodes e.g. processing-instruction.
Text Nodes<book catalogue=“java”><author>Selman</author><title>Java3d programming</title><year>2000</year></book>
book
#doc
7
cat “java”
• “String value” of text node
titleauthor year1
4 6
7
“[cr][sp]”
“Selman” “2000”
“[cr][sp]” “[cr][sp]”“[cr][sp]”
title year
“Java3d programming”
IAB 2008
10
Text Nodes
• Text nodes numbered 1..t
S e l m a n c
r
s
pJ a v a 3 D … m m i n g c
r
s
p2 0 0
IAB 2008
11
Text data “array”
• Naïve: 4-8 bytes per text node.
▫ Average text node: 10-12 characters – uncompressed.
• Offset array compression [DRR ‘07]: ~ 1 byte/node.
S e l m a nr p
J a v a 3 D … m m i n gr p
2 0 0
Offset “array”
Text data “array”
SDOM SDOM-CT (Q&D)
12
IAB 2008
(All text nodes concatenated in document order; offsets given by offset “array”)
• Array of uncompressed characters
▫ text node value is null-terminated: can return pointer into array.
▫ Pure succinct representation.
1. Bzip-compress 16KB blocks, decompress on demand.
o cache 4 uncompressed blocks.
2. Store array in FM-index or Compressed Suffix Array.
o seems slower than (1)
� Use (1) if locality in text access; (2) otherwise.
� Other alternatives e.g. hashing.
Node class
• Most important class in DOM for navigation.▫ represents a node in the DOM tree.▫ SDOM Node: 2 ints + ptr to Document object.
• SDOM initially has no objects
IAB 2008
13
• SDOM initially has no Node objects▫ Created upon traversal; no check to see if a tree node has an existing Node object:� y = x.getFirstChild();
� z = x.getFirstChild(); /* z != y */
• No garbage collection…• Better use TreeWalker class▫ NextNode/PreviousNodemethods are very fast.
IAB 2008
14
Files used (UW repository)
File Size Nodes
Max.
DepthMax node degree
#text
chars
#attr
Chars
Orders 5MB 300,004 3 30001 1MB NEG
Lineitem 32MB 2,045,954 3 120351 6MB 0
IAB 2008
15
Lineitem 32MB 2,045,954 3 120351 6MB 0
XPATH 50MB 2,522,571 5 42075 13MB NEG
Treebank_e 82MB 7,312,613 36 112769 57MB NEG
SwissProt 110MB 10,599,084 5 100000 35MB 13MB
DBLP 128MB 10,595,379 6 657717 64MB 7MB
XCDNA 594MB 25,221,153 7 82237 256MB 0
Also Xmark files: not shown. Machine: Pentium 2GB RAM. Ubuntu. g++ 3.3.
SDOM space usage
File Size SDOM Xerces Saxon
Orders 5MB 37% 451% 157%
Lineitem 32MB 28% 399% 161%
• DOM implementations (with uncompressed text)
• Percentages relative to file sizes.
IAB 2008
16
Lineitem 32MB 28% 399% 161%
XPath 50MB 33% 383% 137%
Treebank 82MB 84% 866% 266%
SwissPrt 110MB 60% 704% 272%
DBLP 128MB 68% 737% 240%
XCDNA 594MB 50% 491% 136%
sizes.
• Textual data left uncompressed!
SDOM-CT space usage
• Comparison with:▫ Query-Friendly XML compressors
IAB 2008
17
� XBZipIndex, XPRESS,
� XQZip, XGRIND.
▫ XMill
�SDOM-CT is a “middle-of-the-road” compressor.
�about 25% of SDOM-CT’s space is purely to support fast navigation.
Speed test
• Basic test: several kinds of traversals▫ Document-order▫ Reverse document-order▫ Upward path enumeration
IAB 2008
18
▫ Upward path enumeration• Full test:▫ Document-order traversal plus:
� traverse all attributes and values� search for a (uncommon) substring in all text nodes.
• Compare traversals using Node and TreeWalkerNextNode methods.• All XML documents (except XCDNA) fit comfortably in main memory even for Xerces.
Basic Test (Document/RDO)
IAB 2008
19
20
25
30
0
5
10
15
orders lineitem XPATH treebank_e SwissProt dblp XCDNA
Tim
e (Secs)
Xerces-DocOrder SDOM-DocOrder Xerces-ReverseDocOrder SDOM-ReverseDocOrder
Upward Path Enumeration
IAB 2008
20
100
120
0
20
40
60
80
orders lineitem XPATH treebank_e SwissProt dblp XCDNA
Tim
e (Secs)
Xerces SDOM
Full Test
IAB 2008
21
20
25
30
0
5
10
15
20
treebank_e SwissProt dblp XCDNA
Tim
e (Secs)
Xerces-TreeNav Xerces-NextNode SDOM-TreeNav SDOM-NextNode SDOM-CT-TreeNav
Text compression
libBzip2-blocks FMIndex
Files Text path-order doc-order doc-order
Orders 1MB 22% 30% 29%
IAB 2008
22
Lineitem 6MB 21% 31% 26%
XPATH 13MB 9% 13% 13%
Treebank 57MB 42% 45% 56%
SwissProt 49MB 17% 29% 20%
DBLP 71MB 28% 35% 30%
Difference between path-order and doc-order is fairly small for bzip2 (cf. Xmill).
PARSING
“Spike” caused by use of vector for accumulating Textual data as it is
IAB 2008
23
2,000
2,500
3,000
Memory usage (MB)
DOM parsers memory usage -XCDNA.xml (594MB)
Textual data as it is processed by SAX.
Easy fix for much “preprocessing” space usage.
-
500
1,000
1,500
Memory usage (MB)
Memory usage snap-shot
SDOM Xerces-c
Conclusion/Future Work
• Succinct DOM: very fast, reasonable compression, and robust performance. ▫ Incorporate into higher-level application
IAB 2008
24
▫ Incorporate into higher-level application (Xalan/Saxon/OpenOffice)? ▫ Optimise parsing further?▫ Succinctness + secondary storage?
• Commercialisation?:▫ Saxon £250 license▫ Intel high-performance XSLT library $500▫ IBM XML parsing hardware ~ $5k.