the xml trial: finding of facts arnaud sahuguet, chief inquisitor penn database research group
Post on 19-Dec-2015
214 views
TRANSCRIPT
The XML Trial:FINDING of FACTS
Arnaud Sahuguet, Chief InquisitorPenn Database Research Group
Preliminary Remarks
What is XML used for
• Messages (XML-RPC)• Text content (HTML, WML)• Data Content (FinXML, BioML)• Documents (DocBook)• Component serialization (Java Beans)
Everything!
XML applications have different properties/requirements
• Order vs. No order• Notion of “equivalence” between documents• Nested vs. Flat• Structure vs Semi-structure
XML and DTDs are 2 distinct issues
• XML does not need DTDs (well-formedness)• The structure of an XML document can be described
using other representations– grammars– schemas
• Questioning DTDs does not mean questioning XML itself• XML is just a mark-up after all
DTDs
What is a DTD[ISO 8879]
A document type definition specifies:• the generic identifiers (GIs) of elements that are
permissible in a document of this type• for each GI, the possible attributes, their range of values
and defaults• for each GI, the structure of its contents, including:
– which element can occur and in what order– whether text characters can occur– whether non character data can occur
• The purpose of a DTD is to permit to determine whether the mark-up for an individual document is correct and also to supply markup that is missing, because it can be inferred unambiguously from other mark-up present.
What is a DTD (cont’d)
• A DTD contains– element declarations– attribute declarations– entity references– entity parameters– notations– processing instruction (<? …. ?>)
• Elements are defined according to content-model [?+*,]• Attributes can be CDATA, NMTOKEN• Attributes can be optional (#IMPLIED) or mandatory
(#FIXED)• Mix content corresponds to (PCDATA|xxx)*• Notations are a way to describe the content of an entity
reference (e.g. jpg picture)
What is a DTD (cont’d)<!ELEMENT title (#PCDATA)>
<!ELEMENT info (metadata+)>
<!ELEMENT metadata EMPTY><!ATTLIST metadata owner CDATA #REQUIRED>
<!ELEMENT folder (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST folder %node.att; folded (yes|no) #FIXED 'yes' >
<!ELEMENT bookmark (title?, info?, desc?)>
<!ATTLIST bookmark %node.att; %url.att;>
<!ELEMENT desc (#PCDATA)>
<!ELEMENT separator EMPTY>
<!ELEMENT alias EMPTY><!ATTLIST alias ref IDREF #REQUIRED>
<!NOTATION jpg PUBLIC ‘-//JPG’>
<!ENTITY folder SYSTEM “folder.jpg” NDATA jpg><!ENTITY bookmark SYSTEM “bookmark.jpg” NDATA jpg>
<!ENTITY % local.node.att ""><!ENTITY % local.url.att ""><!ENTITY % local.nodes.mix ""><!ENTITY % node.att
"id ID #IMPLIEDadded CDATA #IMPLIED%local.node.att;">
<!ENTITY % url.att"href CDATA #REQUIREDvisited CDATA #IMPLIEDmodified CDATA #IMPLIED%local.url.att;">
<!ENTITY % nodes.mix"bookmark|folder|alias|separator%local.nodes.mix;">
<!ELEMENT xbel (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST xbel %node.att; version CDATA #FIXED "1.0">
What is the role of a DTD
• Constrain structure SCHEMA• Declare entities MODULARITY• Provide some default values for attributes
XML vs SGML
• No tag omission• no exceptions• restriction for mixed content• no AND (&) operator• no distinction betweem CDATA and RCDATA• in SGML, 39 types of attributes
How are DTDs being used
Methodology of the survey
• Harvesting– xml.org
• Cleansing– missing elements, typos, etc.
• Normalization– expansion of entities– translation into our internal data-model
• “Mining”• Visualization
• Data Model– Node = list of Node, list of Attribute
Issues that have not been looked at
• Are DTDs being used– well-formed vs valid documents
• How do documents actually used DTDs– what is the meaning of * or +
Most DTDs are not correct
• missing elements• wrong syntax• incompatible attribute declarations
If the DTD itself is not correct,how can we hope to validate any document?
DTDs are not always a connected graph
• This is only at the level of the document that the root is defined.
• Use of ANY
Encoding of tuples
• Given the absence of the AND content-model, most DTDs represent tuple <a,b,c> as (a|b|c)*.
• The correct syntax would be:– SGML: ( a & b & c )– XML: (a,b,c) | (a,c,b) | (b,c,a) | (b,a,c) | (c,b,a) | (c,a,b)
“|” is used and overused
Encoding Inheritance
• Parameter entities are used to capture “syntactic” inheritance
Some features are almost never used
• Attribute types like NMTOKEN• IDREF• Notations• Processing instructions
Global comments
• DTDs are full of mistakes• DTDs -- when expanded -- are really messy• People tend to pick a “type” much larger than they
really need
What’s wrong with DTDs
• too much document oriented– DTDs have been designed to interface with text processing
tools
• too simple and too complicated at the same time• too limited to represent complex structures• IDREFs are not typed• No notion of record• No notion of inheritance/sub-typing• Content-model ambiguous• too many ways to represent the same thing• names are global, not locals• no obvious way to offer versioning, extension, evolution
Improvements
type-checkingconstantsmacrosvoid*void*header file#ifdefstandard librarynamespace
validationentity referenceentity parameter ANY IDREF DTD conditional
section key entities namespace
Analogy XML/ProgLang
It is interesting to remark that features like inheritance, type inference, polymorphism or modules are missing
Analogy XML/ProgLang (cont’d)
XML Functional Object Oriented===================================
======| variant, union type abstract class, record with order ordered inheritance& record inheritance
? option null+/* list list
Give a look at Phil Wadler’s talk.
Immediate Solutions
• Remove ANY• DTD = single rooted connected graph• Support for “&”• Need for DTD validators• Forget about:
– notations– conditional sections– ID IDREF (as they are)
I thought I was a too drastic,but on the xml-dev mailing,
there is an even more brutal proposal7(XML2.0alpha, by Rick Jeliffe).
What is the future of DTDs
Family Tree of Schema Languages for Markup Languages (Rick Jelliffe © 1999)
What should DTDs be used for
• validation• efficient XML storage (persistency extension, or
database storage)• optimization of path-expression queries• documentation• design of DTDs extensions (to resolve shortcomings)• efficient parsing• design of supporting tools
How DTDs could be used
• Like software components/libraries– import, export– inheritance– over-riding
• Is there a need for a DTD/schema repository?
Major Challenges
• Combining text and data processing in a unified framework– Structured query vs document query– Markup algebra
• Doing “versioning” the right way– subtyping– need for backward/forward compatibilty
My message
• XML needs some modern PL tools/features
• XML and Java– Java is far from perfect but it established some features
like gc, threads, distributed computation as sine qua non requirements of a programming language
– XML should try to do the same for text/data processing
• XML is not an abstract thing. People are using it and we should keep that in mind.
On-going research
XML algebra
• ECFG (Franck Neven)• XML model (Phil Wadler)• Semi-structured schema (Beeri/Milo)• Deterministic Data Model (Penn)• Union-types (Peter,Benjamin)
• Mark-up Algebra– WebL– Algebra for querying Text Regions (Consens/Milo)– Nested Text-Region Algebra (Jaakola, Kilpeläinen) + sgrep
XML semantics
• Path constraints (Jérôme Siméon, Wenfei Fan)• XSLT and XPath (Phil Wadler)• Extending DB constraints for Codi to XML (Penn)• F-Logic
From XML to XML query languages
• XPath and XSLT (Phil Wadler)• XSLT (Franck Neven)• XML-QL• UnQL
Misc.
Mapping from OO to XML (POQL/INRIA)
• Class Person tuple(name: string, age: integer, spouse:Person)
• <!ELEMENT person (name, age, person?)<!ATTLIST id ID #IMPLIED
spouse IDREF #REQUIRED>
<PERSON id=p1 spouse=p2> <NAME>Vassilis</NAME> <AGE>32</AGE> <PERSON> <NAME>Irène</NAME> <AGE>29</AGE> </PERSON></PERSON>
<PERSON id=p1 spouse=p2> <NAME>Vassilis</NAME> <AGE>32</AGE></PERSON><PERSON> <NAME>Irène</NAME> <AGE>29</AGE></PERSON>
Schematron (Rick Jelliffe)
• Idea: encoding structure using tree constraints• Not based on grammars but on tree patterns• Semantics
– find a context node in the document– check for constraints (I.e. XPath expressions)
• Features– in the spirit of XSL (patterns, rules)– based on Xpath
• Benefits– a “schema” specification can have more or less refined– supports variations of the schema (versions, etc.)
Example<!-- +//IDN sinica.edu.tw//DTD Schematron 1.0a//EN --><!ELEMENT schema ( title?, pattern+ )><!ELEMENT assert ( #PCDATA )> <!ELEMENT pattern ( rule+ )> <!ELEMENT report ( #PCDATA )><!ELEMENT rule ( assert | report )+><!ELEMENT title ( #PCDATA )><!ATTLIST schema ns CDATA #IMPLIED ><!ATTLIST assert test CDATA #REQUIRED > <!ATTLIST pattern name CDATA #REQUIRED see CDATA #IMPLIED > <!ATTLIST report test CDATA #REQUIRED ><!ATTLIST rule context CDATA #REQUIRED >
<schema> <title>Demonstration Patterns for the Schematron Itself</title> <pattern name="The Open Schematron DTD 1.0"> <rule context="schema"> <assert test="pattern">A schema element should contain at least one pattern elements.</assert> </rule> <rule context="pattern"> <assert test="rule">A pattern element should contain at least one rule elements.</assert> <assert test="@name">A pattern element should have an attribute called name.</assert> </rule> <rule context="rule"> <assert test="assert | report ">A rule element should contain at least one assert or report elements.</assert> <assert test="@context">A rule element should have an attribute called context. This should be an XPath for selecting nodes to make assertions and reports about.</assert> </rule> <rule context="assert"> <assert test="@test">An assert element should have an attribute called test. This should be an XSLT expression.</assert> </rule> <rule context="report"> <assert test="@test">A report element should have an attribute called test. This should be an XSLT expression.</assert> </rule> </pattern></schema>
Example (cont’d)<schema> <pattern name="The Closed Schematron DTD 1.0a"> <rule context="schema"> <assert test="count(*) = count(pattern | title)">Unexpected element(s) found: a schema element
should contain only pattern elements.</assert> <assert test="pattern">A schema element should contain at least one pattern element.</assert> <report test="phase">The element phase is only used in the 1.2 DTD</report> </rule> <rule context="pattern"> <assert test="count(*) = count(rule)">Unexpected element(s) found: A pattern element should contain
only rule elements.</assert> <assert test="rule">A pattern element should contain at least one rule elements.</assert> <assert test="@name">A pattern element should have an attribute called name.</assert> </rule> <rule context="rule"> <assert test="count(*) = count(assert | report ) ">Unexpected element(s) found: a rule element
should contain only assert and report elements.</assert> <assert test="assert | report ">A rule elemement should contain at least one
assert or report elements.</assert> <assert test="@context">A rule element should have an attribute called context. This should be an XPath for selecting nodes to make assertions and reports about.</assert> <report test="key">The element key is only used in the 1.2 DTD</report> </rule> <rule context="assert"> <assert test="@test">An assert element should have an attribute called test. This should be an XSLT expression.</assert> <report test="name">The element name is only used in the 1.1 DTD</report> </rule> <rule context="report"> <assert test="@test">A report element should have an attribute called test. This should be an XSLT expression.</assert> <report test="name">The element name is only used in the 1.1 DTD</report> </rule> </pattern></schema>
Anonymous Content Types (Rick Jelliffe)
• EMPTY and ANY are built-in content types• What about offering some new ones
• SINGLE (only 1 child, no data element)• PAIR (two children of different types, no data element)• PAIRS (multiple of PAIR)• SAME (zero or more elements of the same type)• LEAF (1 empty sub-element or PCDATA)• LEAVES (multiple of LEAF)• UNIQUE (any number of elements; one type appears
once)• NONRECURSIVE (element cannot contain itself)
Looks like polymorphism to me :-)
A new data-model for XML...
• Node, NodeList• Only “&”, “+” and “?” are allowed
<?xml version="1.0" ?><!DOCTYPE box [<!ELEMENT box ( box )* ><!ATTLIST box id ID #REQUIRED length-breadth-width NMTOKENS #REQUIRED units NMTOKEN #REQUIRED >]><box id="b1" length-breadth-width="3 5 8" units="cm"><box id="b2" /></box>
…and what you can do with it
• Some attributes always go together– ( id & ( unit & length-breadth-width)?)
• IDREFS can be modeled as ( terminal-node & Element )– household IDREF can be specified as
(mother? & father? & child* & grandparents* & grandchild* & unmarried-sibling* & refugee* & pet* & ghost*)