the xml trial: finding of facts arnaud sahuguet, chief inquisitor penn database research group

The XML Trial:FINDING of FACTS

Arnaud Sahuguet, Chief InquisitorPenn Database Research Group

Preliminary Remarks

What is XML used for

• Messages (XML-RPC)• Text content (HTML, WML)• Data Content (FinXML, BioML)• Documents (DocBook)• Component serialization (Java Beans)

Everything!

XML applications have different properties/requirements

• Order vs. No order• Notion of “equivalence” between documents• Nested vs. Flat• Structure vs Semi-structure

XML and DTDs are 2 distinct issues

• XML does not need DTDs (well-formedness)• The structure of an XML document can be described

using other representations– grammars– schemas

• Questioning DTDs does not mean questioning XML itself• XML is just a mark-up after all

What is a DTD[ISO 8879]

A document type definition specifies:• the generic identifiers (GIs) of elements that are

permissible in a document of this type• for each GI, the possible attributes, their range of values

and defaults• for each GI, the structure of its contents, including:

– which element can occur and in what order– whether text characters can occur– whether non character data can occur

• The purpose of a DTD is to permit to determine whether the mark-up for an individual document is correct and also to supply markup that is missing, because it can be inferred unambiguously from other mark-up present.

What is a DTD (cont’d)

• A DTD contains– element declarations– attribute declarations– entity references– entity parameters– notations– processing instruction (<? …. ?>)

• Elements are defined according to content-model [?+*,]• Attributes can be CDATA, NMTOKEN• Attributes can be optional (#IMPLIED) or mandatory

(#FIXED)• Mix content corresponds to (PCDATA|xxx)*• Notations are a way to describe the content of an entity

reference (e.g. jpg picture)

What is a DTD (cont’d)<!ELEMENT title (#PCDATA)>

<!ELEMENT info (metadata+)>

<!ELEMENT metadata EMPTY><!ATTLIST metadata owner CDATA #REQUIRED>

<!ELEMENT folder (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST folder %node.att; folded (yes|no) #FIXED 'yes' >

<!ELEMENT bookmark (title?, info?, desc?)>

<!ATTLIST bookmark %node.att; %url.att;>

<!ELEMENT desc (#PCDATA)>

<!ELEMENT separator EMPTY>

<!ELEMENT alias EMPTY><!ATTLIST alias ref IDREF #REQUIRED>

<!NOTATION jpg PUBLIC ‘-//JPG’>

<!ENTITY folder SYSTEM “folder.jpg” NDATA jpg><!ENTITY bookmark SYSTEM “bookmark.jpg” NDATA jpg>

<!ENTITY % local.node.att ""><!ENTITY % local.url.att ""><!ENTITY % local.nodes.mix ""><!ENTITY % node.att

"id ID #IMPLIEDadded CDATA #IMPLIED%local.node.att;">

<!ENTITY % url.att"href CDATA #REQUIREDvisited CDATA #IMPLIEDmodified CDATA #IMPLIED%local.url.att;">

<!ENTITY % nodes.mix"bookmark|folder|alias|separator%local.nodes.mix;">

<!ELEMENT xbel (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST xbel %node.att; version CDATA #FIXED "1.0">

What is the role of a DTD

• Constrain structure SCHEMA• Declare entities MODULARITY• Provide some default values for attributes

XML vs SGML

• No tag omission• no exceptions• restriction for mixed content• no AND (&) operator• no distinction betweem CDATA and RCDATA• in SGML, 39 types of attributes

How are DTDs being used

Methodology of the survey

• Harvesting– xml.org

• Cleansing– missing elements, typos, etc.

• Normalization– expansion of entities– translation into our internal data-model

• “Mining”• Visualization

• Data Model– Node = list of Node, list of Attribute

Issues that have not been looked at

• Are DTDs being used– well-formed vs valid documents

• How do documents actually used DTDs– what is the meaning of * or +

Most DTDs are not correct

• missing elements• wrong syntax• incompatible attribute declarations

If the DTD itself is not correct,how can we hope to validate any document?

DTDs are not always a connected graph

• This is only at the level of the document that the root is defined.

• Use of ANY

Encoding of tuples

• Given the absence of the AND content-model, most DTDs represent tuple <a,b,c> as (a|b|c)*.

• The correct syntax would be:– SGML: ( a & b & c )– XML: (a,b,c) | (a,c,b) | (b,c,a) | (b,a,c) | (c,b,a) | (c,a,b)

“|” is used and overused

Encoding Inheritance

• Parameter entities are used to capture “syntactic” inheritance

Some features are almost never used

• Attribute types like NMTOKEN• IDREF• Notations• Processing instructions

Global comments

• DTDs are full of mistakes• DTDs -- when expanded -- are really messy• People tend to pick a “type” much larger than they

really need

What’s wrong with DTDs

• too much document oriented– DTDs have been designed to interface with text processing

tools

• too simple and too complicated at the same time• too limited to represent complex structures• IDREFs are not typed• No notion of record• No notion of inheritance/sub-typing• Content-model ambiguous• too many ways to represent the same thing• names are global, not locals• no obvious way to offer versioning, extension, evolution

Improvements

type-checkingconstantsmacrosvoid*void*header file#ifdefstandard librarynamespace

validationentity referenceentity parameter ANY IDREF DTD conditional

section key entities namespace

Analogy XML/ProgLang

It is interesting to remark that features like inheritance, type inference, polymorphism or modules are missing

Analogy XML/ProgLang (cont’d)

XML Functional Object Oriented===================================

======| variant, union type abstract class, record with order ordered inheritance& record inheritance

? option null+/* list list

Give a look at Phil Wadler’s talk.

Immediate Solutions

• Remove ANY• DTD = single rooted connected graph• Support for “&”• Need for DTD validators• Forget about:

– notations– conditional sections– ID IDREF (as they are)

I thought I was a too drastic,but on the xml-dev mailing,

there is an even more brutal proposal7(XML2.0alpha, by Rick Jeliffe).

What is the future of DTDs

What should DTDs be used for

• validation• efficient XML storage (persistency extension, or

database storage)• optimization of path-expression queries• documentation• design of DTDs extensions (to resolve shortcomings)• efficient parsing• design of supporting tools

How DTDs could be used

• Like software components/libraries– import, export– inheritance– over-riding

• Is there a need for a DTD/schema repository?

Major Challenges

• Combining text and data processing in a unified framework– Structured query vs document query– Markup algebra

• Doing “versioning” the right way– subtyping– need for backward/forward compatibilty

My message

• XML needs some modern PL tools/features

• XML and Java– Java is far from perfect but it established some features

like gc, threads, distributed computation as sine qua non requirements of a programming language

– XML should try to do the same for text/data processing

• XML is not an abstract thing. People are using it and we should keep that in mind.

On-going research

XML algebra

• ECFG (Franck Neven)• XML model (Phil Wadler)• Semi-structured schema (Beeri/Milo)• Deterministic Data Model (Penn)• Union-types (Peter,Benjamin)

• Mark-up Algebra– WebL– Algebra for querying Text Regions (Consens/Milo)– Nested Text-Region Algebra (Jaakola, Kilpeläinen) + sgrep

XML semantics

• Path constraints (Jérôme Siméon, Wenfei Fan)• XSLT and XPath (Phil Wadler)• Extending DB constraints for Codi to XML (Penn)• F-Logic

From XML to XML query languages

• XPath and XSLT (Phil Wadler)• XSLT (Franck Neven)• XML-QL• UnQL

Mapping from OO to XML (POQL/INRIA)

• Class Person tuple(name: string, age: integer, spouse:Person)

• <!ELEMENT person (name, age, person?)<!ATTLIST id ID #IMPLIED

spouse IDREF #REQUIRED>

<PERSON id=p1 spouse=p2> <NAME>Vassilis</NAME> <AGE>32</AGE> <PERSON> <NAME>Irène</NAME> <AGE>29</AGE> </PERSON></PERSON>

<PERSON id=p1 spouse=p2> <NAME>Vassilis</NAME> <AGE>32</AGE></PERSON><PERSON> <NAME>Irène</NAME> <AGE>29</AGE></PERSON>

Schematron (Rick Jelliffe)

• Idea: encoding structure using tree constraints• Not based on grammars but on tree patterns• Semantics

– find a context node in the document– check for constraints (I.e. XPath expressions)

• Features– in the spirit of XSL (patterns, rules)– based on Xpath

• Benefits– a “schema” specification can have more or less refined– supports variations of the schema (versions, etc.)

Example<!ELEMENT schema ( title?, pattern+ )><!ELEMENT assert ( #PCDATA )> <!ELEMENT pattern ( rule+ )> <!ELEMENT report ( #PCDATA )><!ELEMENT rule ( assert | report )+><!ELEMENT title ( #PCDATA )><!ATTLIST schema ns CDATA #IMPLIED ><!ATTLIST assert test CDATA #REQUIRED > <!ATTLIST pattern name CDATA #REQUIRED see CDATA #IMPLIED > <!ATTLIST report test CDATA #REQUIRED ><!ATTLIST rule context CDATA #REQUIRED >

<schema> <title>Demonstration Patterns for the Schematron Itself</title> <pattern name="The Open Schematron DTD 1.0"> <rule context="schema"> <assert test="pattern">A schema element should contain at least one pattern elements.</assert> </rule> <rule context="pattern"> <assert test="rule">A pattern element should contain at least one rule elements.</assert> <assert test="@name">A pattern element should have an attribute called name.</assert> </rule> <rule context="rule"> <assert test="assert | report ">A rule element should contain at least one assert or report elements.</assert> <assert test="@context">A rule element should have an attribute called context. This should be an XPath for selecting nodes to make assertions and reports about.</assert> </rule> <rule context="assert"> <assert test="@test">An assert element should have an attribute called test. This should be an XSLT expression.</assert> </rule> <rule context="report"> <assert test="@test">A report element should have an attribute called test. This should be an XSLT expression.</assert> </rule> </pattern></schema>

Example (cont’d)<schema> <pattern name="The Closed Schematron DTD 1.0a"> <rule context="schema"> <assert test="count(*) = count(pattern | title)">Unexpected element(s) found: a schema element

should contain only pattern elements.</assert> <assert test="pattern">A schema element should contain at least one pattern element.</assert> <report test="phase">The element phase is only used in the 1.2 DTD</report> </rule> <rule context="pattern"> <assert test="count(*) = count(rule)">Unexpected element(s) found: A pattern element should contain

only rule elements.</assert> <assert test="rule">A pattern element should contain at least one rule elements.</assert> <assert test="@name">A pattern element should have an attribute called name.</assert> </rule> <rule context="rule"> <assert test="count(*) = count(assert | report ) ">Unexpected element(s) found: a rule element

should contain only assert and report elements.</assert> <assert test="assert | report ">A rule elemement should contain at least one

assert or report elements.</assert> <assert test="@context">A rule element should have an attribute called context. This should be an XPath for selecting nodes to make assertions and reports about.</assert> <report test="key">The element key is only used in the 1.2 DTD</report> </rule> <rule context="assert"> <assert test="@test">An assert element should have an attribute called test. This should be an XSLT expression.</assert> <report test="name">The element name is only used in the 1.1 DTD</report> </rule> <rule context="report"> <assert test="@test">A report element should have an attribute called test. This should be an XSLT expression.</assert> <report test="name">The element name is only used in the 1.1 DTD</report> </rule> </pattern></schema>

Anonymous Content Types (Rick Jelliffe)

• EMPTY and ANY are built-in content types• What about offering some new ones

• SINGLE (only 1 child, no data element)• PAIR (two children of different types, no data element)• PAIRS (multiple of PAIR)• SAME (zero or more elements of the same type)• LEAF (1 empty sub-element or PCDATA)• LEAVES (multiple of LEAF)• UNIQUE (any number of elements; one type appears

once)• NONRECURSIVE (element cannot contain itself)

Looks like polymorphism to me :-)

A new data-model for XML...

• Node, NodeList• Only “&”, “+” and “?” are allowed

<?xml version="1.0" ?><!DOCTYPE box [<!ELEMENT box ( box )* ><!ATTLIST box id ID #REQUIRED length-breadth-width NMTOKENS #REQUIRED units NMTOKEN #REQUIRED >]><box id="b1" length-breadth-width="3 5 8" units="cm"><box id="b2" /></box>

…and what you can do with it

• Some attributes always go together– ( id & ( unit & length-breadth-width)?)

• IDREFS can be modeled as ( terminal-node & Element )– household IDREF can be specified as

(mother? & father? & child* & grandparents* & grandchild* & unmarried-sibling* & refugee* & pet* & ghost*)

the xml trial: finding of facts arnaud sahuguet, chief inquisitor penn database research group

Documents

dtds slide

b slide

overused slide

dtd contd slide

xml document

b c xml

list of attribute slide

syntactic inheritance