Download - Efficient XML Interchange
Efficient XML Interchange
XML
Why is XML good?• A widely accepted standard for data
representation• Fairly simple format• FlexibleIt’s not used by everyone, but it’s used by enough
people to make for a rich tools environmentIt’s flexible enough to be used in lots of contextsIt’s text based and human readable, which makes
it a good archival format
XML
XML in 10 pointshttp://www.w3.org/XML/1999/XML-in-10-PointsIncludes (3) “XML is meant to be read”,
and (4) “XML is verbose by design”XML can (but should not be) read by
humans, and is not very compact
XML
These design principles also make it very difficult to use XML in some environments
• Wireless military links: low bandwidth• Mobile devices: battery life limitations• Processing efficiency: it can take CPU
cycles to parse XML• Data binding
Limitations
Lots of ships have 64 Kbit/sec at best. It is problematic to ship XML across these links
CPUs are on Moore’s law curve, but battery power is limited by the state of chemistry. We can’t assume that faster processors will save us. Lots of applications for hand held devices with limited battery power (cell phones, etc.)
Cell phones don’t necessarily have strong CPUs, so parsing XML can be expensive relative to other tasks
Data Binding
This is a more subtle problem.<Point x=“1.0” y=“2.0”/>How do you convert this to an object? You
need to parse the string “1.0”, then convert it to a binary representation
It’s the difference betweenstring x;And float x;
Data Binding
Typically something comes in from the wire, and you have to do the Java equivalent of
Float.parseFloat(“1.0”);This is expensive when working with
numeric-heavy documentsIt is much more efficient to keep the value
X in a binary representation in the document, then simply read it on the receiving side
Efficient XML Interchange
EXI relaxes some of the requirements of XML in order to be more compact, faster to parse, and have better data binding characteristics
• Relax the “human readable” requirement• Allow binary dataWhat you get is an alternate encoding of the XML
infoset that is more compact, faster to parse, and allows deployment in new environments that XML previously could not be deployed in
EXI
EXI is being developed by a W3C working group and is on a standards track. The hope is that this will become a W3C-blessed encoding of the XML infoset
Working group draft now working its way to approval.
Need multiple implementations, blessed by W3C technical architecture group, approval by other W3C working groups (encryption, processors, etc.)
EXI
• Represents the same data as an XML document, only in a more efficient encoding
• Minimal impact on other XML technologies, such as encryption
• More efficient to parse, better data binding performance
EXI
http://www.w3.org/XML/EXIIncludes file format specification, primer
on EXI, best practicesNote that one thing that is NOT specified
is an API for accessing the data. This is an important and significant omission
Lack of a standardized typed API means we still have to go through string representations
Typed API
What is meant by a typed API?DOM and SAX return string values:Attr anAttribute;…// DOM returns a String attribute value hereString val = anAttribute.getValue()And then we need to convert val into a float
viaFloat aFloat = Float.parseFloat(val);
Typed API
But what we often want is the value specified in the schema:
Float aFloat = anAttribute.getFloat();There are proposals for a generalized
typed API, but it is not part of this standard
EXI
EXI has several options to handle different situations.
• You have an XML document and a schema
• You have an XML document but no schema
• You have an XML document, and a schema that almost, but not quite, matches the document
Element and Attribute Names
Tag names take up a lot of space, and can be somewhat expensive to parse
<Name first=“James” last=“Madison”> <State>Virginia</State></Name>Count up the characters used for markup
here:31/55 ~=50-60% of file size for markup tagsIf we replace the character tags with numeric stand-
ins we can get much more compact, and it will be faster to parse
Schema-Informed
If you have a schema, that gives you type information about the XML document. You know that <foo x=“1.0”/> means the x is a float value rather than a string, because the schema tells you that.
That means you can store the “1.0” value in a binary format, which is generally more compact and has the potential to have better data binding with a typed API
Schemaless
What if you don’t have a schema? This means you can’t exploit type information. But EXI should support this situation, because it should be a general solution
EXI handles this by replacing repeating strings with a compact identifier
Schemaless
<Address town=“Monterey” zip=“93943”/>The strings “Monterey” and the zip code are
likely to be repeated many times in an XML document. We can create a table of these values, and then use the table ID rather than the whole string
String ID
Monterey 1
93940 2
San Jose 3
98842 5
“Almost” Schemas
If you have a document that doesn’t quite match the schema, EXI can take a forgiving attitude. It uses the schema to encode the types it knows about, and uses strings and string table identifiers to handle the ones not described by the schema
Implementations
As of now there is one implementation of the draft spec, Efficient XML from Agile Delta (http://www.agiledelta.com)
Other open source projects underway, and some commercial projects
The standards process requires that multiple independent implementations be available before the standard is approved
Results
Example: Distributed Interactive Simulation (DIS) is an IEEE standard for modeling and simulation. It is a binary standard that contains (x,y,z), velocity, acceleration, and other numeric-heavy data
We did an XML representation of the binary DIS standard
Results
DIS Binary(bytes)
DISXML
EXIFormat
1 PDU 144 1167 129
1000 PDUs
464,480 3,924,680
365,564
Results
• Somewhat better size than the original binary format. The exact size varies somewhat depending on the numeric data, while the original binary format is always the same size. Exi seems to be consistently better, though
• AND it is marked up in a way that makes it equivalent to an XML file. This means we can easily access all the tools of the XML ecosystem by simply converting it to a text XML representation
Conclusions
Replace all text XML with EXI? No! EXI is intended to expand the use of XML into use cases that XML could not service. XML mostly does fine in its existing environment
EXI can be used to XML-ify existing binary protocols and get slightly better performance with greatly increased interoperability (no one knows DIS binary, everyone knows XML)
Next great frontier: typed XML APIs