sebastian bitzer ([email protected])[email protected] seminar semistructured data university of osnabrueck...
Post on 22-Dec-2015
215 views
TRANSCRIPT
Sebastian Bitzer ([email protected])Seminar Semistructured DataUniversity of OsnabrueckMay 2, 2003
XML
An introduction in relation to semistructured data
02.05.2003 XML 2
Overview
• Background / History
• Basic syntax
• XML and semistructured data
• Document type definitions
• Extensions for XML
• Paraphernalia
02.05.2003 XML 3
Overview
• Background / History– SGML– SGML, HTML and XML– World Wide Web Consortium
• Basic syntax• XML and semistructured data• Document type definitions• Extensions for XML• Paraphernalia
02.05.2003 XML 4
Standard Generalized Markup Language (SGML)
• model information exclusively on basis of its inner laws and its function
platform independent storage of structured information
• standard: ISO 8879 from 1986
02.05.2003 XML 5
SGML, HTML and XML
• SGML(web application) = HTML (is one special instance of SGML)
• XML SGML
02.05.2003 XML 6
Why XML from SGML?
SGML:– is exceedingly complex and difficult to
understand– is formally so complex, that online-applications
have difficulties to process it in reasonable time– has many properties which were not designed
for use in network environments (remember that it is a standard from 1986)
02.05.2003 XML 7
World Wide Web Consortium
• Nov 1996: initial XML draft
• Dec 1997: XML1.0 Proposed Recommendation
• Feb 1998: W3C Recommendation: Extensible Markup Language (XML) 1.0
• Oct 2000: XML1.0 2nd edition
02.05.2003 XML 8
Overview
• Background / History• Basic syntax
– Elements– Attributes– Well-formed XML documents
• XML and semistructured data• Document type definitions• Extensions for XML• Paraphernalia
02.05.2003 XML 9
Elements
• element = <tag> content </tag>
• <tag>, </tag> = markups
• content = structures between markups
• no predefined tags
• basic content (no markups) is treated as text: PCDATA (Parsed Character Data)
• abbreviation for empty elements: <tag />
02.05.2003 XML 10
Example
<personnel><person>
<name> John Cage </name><function> Bearer </function>
</person><person>
<name> Elaine Vassal </name><function> chief secretary </function>
</person>…
</personnel>
02.05.2003 XML 11
Attributes
• sometimes called “property” in data models
• (name=“value”) pairs
• value always a string (type NMTOKEN)
• allows building of groups of elements
• ambiguity: information as attribute or element?
02.05.2003 XML 12
Example
<personnel><person sex=“m”>
<name> John Cage </name><function department=“civil rights”> Bearer </function>
</person><person sex=“f”>
<name> Elaine Vassal </name><function department=“admin”> chief secretary </function>
</person>…
</personnel>
02.05.2003 XML 13
Well-formed XML documents
• a XML document is well-formed, if:– tags nest properly
(not <t1><t2></t1></t2>)– attributes are unique within one element
(not <tag att=“a” att=“b”>)
02.05.2003 XML 14
Overview
• Background / History• Basic syntax• XML and semistructured data
– Simple transformations– Differences that make transformation more difficult– Additional constructs
• Document type definitions• Extensions for XML• Paraphernalia
02.05.2003 XML 15
Simple transformations
with basic XML syntax (no attributes, tree as data structure):
• from XML to ssd:<person>
<name> John Cage </name><function> Bearer </function>
</person>
{person : {name : “John Cage”, function : ”bearer”}}
02.05.2003 XML 16
Simple transformations II
• from ssd to XML (transformation function T):T(atomic value) = atomic value
T({l1 : v1, …, ln : vn}) =
<l1> T(v1) </l1>
…
<ln> T(vn) </ln>
02.05.2003 XML 17
Differences that make transformation more difficult
• different semantic of labels
• element or attribute
• order
• mixing elements and text
02.05.2003 XML 18
Semantics of labels
XML• graphs with labels on
nodes
ssd• graphs with labels on
edges
person
name age email
Alan 42 ab@com
person
name age email
Alan 42 ab@com
<person><name>Alan</name><age>42</age><email>ab@com</email>
</person>
{person : {name : “Alan”}, {age: 42}, {email: “ab@com”} }
02.05.2003 XML 19
Element or attribute
• ambiguity between representation of information as element or as attribute different possibilities of encoding
• in particular in combination with references
<a> <b id=“&o123”> some string </b></a><a c=“&o123” />
or:<a b=“&o123” /><a> <c id=“&o123”> some string </c></a>
a a
b c
“some string”
02.05.2003 XML 20
Order
• ssd model based on unordered collections
• XML elements are ordered
• but: XML attributes are not
• unordered data can be processed more efficiently
for data exchange applications ignore order of XML
02.05.2003 XML 21
Mixing elements and text
• XML allows mixing of PCDATA and subelements:
<talk>XML - An introduction in relation to semistructured data
<speaker> Sebastian Bitzer </speaker>
</talk>
02.05.2003 XML 22
Additional constructs in XML
• comments <!-- comment -->
• processing instructions <?application-name instruction-text>
• CDATA (for escaping)<![CDATA[ markups won’t be processed here ]]>
• entitiese.g. “ä” but also external files can be declared as
entities e.g. a gif-file as “&pic-1;”
02.05.2003 XML 23
Overview• Background / History• Basic syntax• XML and semistructured data• Document type definitions
– DTDs as grammars– DTDs as schemas– Attributes– Valid XML documents– Limitations
• Extensions for XML• Paraphernalia
02.05.2003 XML 24
DTDs as grammar
• document type definition (DTD) serves as grammar for underlying XML document
• is precisely a context-free grammar (non-terminal ordered list of one or more terminals and non-terminals)
• can be recursive
02.05.2003 XML 25
Definitions
DTD:
<!DOCTYPE root-name [ element-def.s ]>
element-def.s:
<!ELEMENT name ( content model )>
…
content model:
ordered list of names of elements which can occur in the outer element
02.05.2003 XML 26
Variations of content model<!ELEMENT r1 (a?, b*, c | d+)>
means that elements of type “r1” contain:– 0 or 1 “a” (“a” is optional) and– arbitrary many “b” (0 - ∞) and– either: exactly 1 “c” (“c” is obligatory)
or: at least 1 “d” (“d” is required)
groups can be build, too:<!ELEMENT r2 ((a, b)+, c?)>
means: at least one sequence of “a” followed by “b” comes in front of the optional “c”
02.05.2003 XML 27
DTDs as Schemas
• DTD:<!DOCTYPE db [
<!ELEMENT db ((r1,r2)*)><!ELEMENT r1 ((a,b,c)|(a,c,b)| (b,a,c) | (b,c,a) | (c,a,b) | (c,b,a))><!ELEMENT r2 ((c, d) | (d, c))><!ELEMENT a (#PCDATA)><!ELEMENT b (#PCDATA)><!ELEMENT c (#PCDATA)><!ELEMENT d (#PCDATA)>
]>can be seen as representation for relational schema
r1(a,b,c), r2(c,d)
02.05.2003 XML 28
Declaring attributes
<!ATTLIST el.name att.name1 type1 spec1 att.name2 type2 spec2 … >
el.name: element which is modified by att.s
type: often “CDATA”, but also more restricted e.g.: “(m|f)” for male or female in att. “sex”
spec: #REQUIRED, #IMPLIED, #FIXED or default value
02.05.2003 XML 29
Unique Identifiers
e.g.:<!ATTLIST person id ID #REQUIRED
mom IDREF #IMPLIED dad IDREF #IMPLIED children IDREFS #IMPLIED
instance:<person id=“john” mom=“jane” dad=“james”
children=“jack jim”>
02.05.2003 XML 30
Valid XML documents
• a XML document is valid, if:– document is well-formed– additionally has a DTD– conforms to that DTD:
• elements only nested as described in DTD
• just attributes used which are allowed by DTD
• all attributes of type ID must have distinct values
• all IDREFS must be to existing identifiers
02.05.2003 XML 31
Limitations of DTDs as schemas (summarized)
• order• only one atomic type (PCDATA, but no INT
etc.)
• names are global (partial solution: namespaces)
• IDREFs are not constrained to a certain type (“mother”-reference should point to a “person”)
02.05.2003 XML 32
Overview
• Background / History• Basic syntax• XML and semistructured data• Document type definitions• Extensions for XML
– DCD
– Document navigation
• Paraphernalia
02.05.2003 XML 33
Document Content Definitions
• making typing more precise• seems to be gone• recent approach: XML Schema which must e.g.:
– provide for primitive data typing, including byte, date, integer, sequence, SQL & Java primitive data types, etc.
– allow creation of user-defined datatypes, such as datatypes that are derived from existing datatypes and which may constrain certain of its properties
– mechanism for URI reference to standard semantic understanding of a construct;
– … (http://www.w3.org/TR/NOTE-xml-schema-req)
02.05.2003 XML 34
XLink & XPointer
• pointing to arbitrary positions in documents
• using IDs or relative position
• links can be defined externally to both source and target (files)
02.05.2003 XML 35
Overview
• Background / History• Basic syntax• XML and semistructured data• Document type definitions• Extensions for XML• Paraphernalia
– RDF– Stylesheets– SAX and DOM
02.05.2003 XML 36
Resource Description Framework
• for representing metadata
• consists of data model and syntax
• simple form: edge-labelled graph
• additionally: – containers (bag, sequence or alternative)– higher-order statements (“John says that …”)
02.05.2003 XML 37
Stylesheets
• to specify presentation of data• Cascading Style Sheets (CSS):
associate with each element type a presentation
• Extensible Stylesheet Language (XSL):specifies the presentation of a class of XML
documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary
http://www.w3.org/Style/XSL/
02.05.2003 XML 38
SAX and DOM
• Application Programming Interfaces• Simple API for XML (SAX)
– standard for parsing
• Document Object Model (DOM):interface that will allow programs and scripts to
dynamically access and update the content, structure and style of documents
– compile whole document and build a tree representation for it
http://www.w3.org/DOM/
02.05.2003 XML 39
Outlook
• Database issues:– How are we going to model XML? (graphs).– How are we going to query XML? (XML-QL)– How are we going to store XML (in a relational
database? object-oriented?)– How are we going to process XML efficiently?
(uh… well..., um..., ah..., get some good grad students!)
Raghu Ramakrishnanhttp://www.cs.wisc.edu/~cs784-1/handouts/intro-ssxml.ppt
02.05.2003 XML 40
References
• S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web. From relations to Semistructured Data and XML, Morgan Kaufmann Publishers, San Francisco 2000
• H. Lobin, Informationsmodellierung in XML und SGML, Berlin, Heidelberg, 2000
• World Wide Web Consortium, Extensible Markup Language (XML), http://www.w3.org/XML/