sdpl 20112: xml basics1 2 basics of xml and xml documents survivor's guide to xml, or xml for...

41
SDPL 2011 2: XML Basics 1 2 Basics of XML and XML 2 Basics of XML and XML documents documents Survivor's Guide to XML, or Survivor's Guide to XML, or XML for Computer Scientists / XML for Computer Scientists / Dummies Dummies 2.1 XML and XML documents 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.2 Basics of XML DTDs 2.3 XML Namespaces 2.3 XML Namespaces

Upload: sharlene-king

Post on 16-Jan-2016

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 1

2 Basics of XML and XML documents2 Basics of XML and XML documents

Survivor's Guide to XML, orSurvivor's Guide to XML, orXML for Computer Scientists / Dummies XML for Computer Scientists / Dummies

2.1 XML and XML documents2.1 XML and XML documents2.2 Basics of XML DTDs2.2 Basics of XML DTDs2.3 XML Namespaces2.3 XML Namespaces

Page 2: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 2

2.1 XML and XML documents2.1 XML and XML documents

XML 1.0 - Extensible Markup Language,XML 1.0 - Extensible Markup Language,W3C Recommendation, February 1998W3C Recommendation, February 1998– not an official standard, but a stable industry standardnot an official standard, but a stable industry standard– 22ndnd Ed 2000, 3 Ed 2000, 3rdrd Ed 2004, 4 Ed 2004, 4thth Ed 2006, 5 Ed 2006, 5thth Ed Nov 2008 Ed Nov 2008

» editorial revisions, editorial revisions, notnot new versions of XML 1.0 new versions of XML 1.0

a simplified subset of SGML, Standard a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987Generalized Markup Language, ISO 8879:1987– what is said later about what is said later about validvalid XML documents applies to XML documents applies to

SGML documents, tooSGML documents, too

Page 3: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 3

What is XML?What is XML?

ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics

(unlike markup languages, e.g. HTML)(unlike markup languages, e.g. HTML)

XML documents have XML documents have no inherent no inherent (processing or (processing or presentation) presentation) semanticssemantics– even though many think that XML is semantic or self-even though many think that XML is semantic or self-

describing; See nextdescribing; See next

Page 4: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 4

Semantics of XML MarkupSemantics of XML Markup

Meaning of this XML fragment?Meaning of this XML fragment?

– The computer cannot know it either!The computer cannot know it either!– Implementing the semantics is the topic of this courseImplementing the semantics is the topic of this course

Page 5: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 5

What is XML (2)?What is XML (2)?

XML XML isis a a(A) way to use markup to represent information(A) way to use markup to represent information

(B) (B) metalanguagemetalanguage » supports definition of specific markup languages through XML supports definition of specific markup languages through XML

DTDs or SchemasDTDs or Schemas» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML

(C) Often “XML” (C) Often “XML” XML + XML technology XML + XML technology– that is, processing models and languages we’re that is, processing models and languages we’re

studying (and many others ...)studying (and many others ...)

Page 6: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 6

How does it look?How does it look?

<?xml version=’1.0’ encoding=”iso-8859-1” ?><?xml version=’1.0’ encoding=”iso-8859-1” ?>

<invoice num=”1234”><invoice num=”1234”>

<client clNum=”00-01”><client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <name>Pekka Kilpeläinen</name>

<email>[email protected]</email><email>[email protected]</email>

</client></client>

<item price=”60” unit=”EUR”><item price=”60” unit=”EUR”>XML Handbook</item> XML Handbook</item>

<item price=”350” unit=”FIM”><item price=”350” unit=”FIM”>XSLT Programmer’s Ref</item>XSLT Programmer’s Ref</item>

</invoice></invoice>

Page 7: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 7

Essential Features of XMLEssential Features of XML

Overview of XML essentialsOverview of XML essentials– many details skippedmany details skipped

» some to be discussed in exercises or with other some to be discussed in exercises or with other topics when the need arisestopics when the need arises

– Learn to consult original sources Learn to consult original sources (specifications, documentation etc) for details!(specifications, documentation etc) for details!» The XML specification is easy to browseThe XML specification is easy to browse

Page 8: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 8

XML Document CharactersXML Document Characters

XML documents are made of ISO-10646 (32-bit) XML documents are made of ISO-10646 (32-bit) characterscharacters; in practice of 16-bit Unicode chars (cf. Java); in practice of 16-bit Unicode chars (cf. Java)

Three aspects or charactersThree aspects or characters (see next):(see next):

representationrepresentation by bytes/octetsby bytes/octets

numeric numeric code code pointspoints

visual visual presentationpresentation

979798989999......

Page 9: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 9

External Aspects of CharactersExternal Aspects of Characters

Documents are stored/transmitted as a sequence Documents are stored/transmitted as a sequence of bytes (or of bytes (or octetsoctets). An ). An encodingencoding determines how determines how characters are characters are representedrepresented by bytes. by bytes.– UTF-8 (UTF-8 ( 7-bit ASCII) is the XML default encoding 7-bit ASCII) is the XML default encoding– encoding="iso-8859-1"encoding="iso-8859-1"

~>~> 256 Western European chars as single bytes 256 Western European chars as single bytes

A A fontfont (collection of character images called (collection of character images called glyphsglyphs) determines the ) determines the visual presentation visual presentation of of characterscharacters

Page 10: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 10

XML Encoding of Structure 1XML Encoding of Structure 1

XML is, essentially, just a textual encoding XML is, essentially, just a textual encoding scheme of scheme of labelledlabelled, , orderedordered and and attributedattributed treestrees, in which, in which– internal nodes are internal nodes are elementselements labelled by type names labelled by type names– leaves are leaves are text nodestext nodes labelled by string values, or labelled by string values, or

empty element nodesempty element nodes– the left-to-right order of children of a node mattersthe left-to-right order of children of a node matters– element nodes may carry element nodes may carry attributesattributes

(= name-string-value pairs)(= name-string-value pairs)

Page 11: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 11

XML Encoding of Structure 2XML Encoding of Structure 2

XML encoding of a treeXML encoding of a tree– corresponds to a pre-order walkcorresponds to a pre-order walk– start of an element node with type name A start of an element node with type name A

denoted by a denoted by a start tagstart tag <A>, and its end <A>, and its end denoted by denoted by end tagend tag </A> </A>

– possible attributes written within the start tag:possible attributes written within the start tag:<A attr<A attr11=“value=“value11” … attr” … attrnn=“value=“valuenn”>”>

» names must be unique: attrnames must be unique: attrkk attrattrh h when k when k h h

– text nodes written as their string valuetext nodes written as their string value

Page 12: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 12

XML Encoding of Structure: XML Encoding of Structure: ExampleExample

<S><S>

SS

EE

<W><W>

WW

HelloHello

HelloHello</W></W>

WW

world!world!

</W></W><W><W> world!world! </S></S><E A=‘1’/><E A=‘1’/>

A=A=11

Page 13: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 13

XML: Logical Document StructureXML: Logical Document Structure

ElementsElements – indicated by matching (case-sensitive!) tagsindicated by matching (case-sensitive!) tags<ElementTypeName> <ElementTypeName> … … </ElementTypeName></ElementTypeName>

– can contain text and/or subelementscan contain text and/or subelements– can be can be emptyempty::

<elem-type></elem-type><elem-type></elem-type> or or <elem-type/><elem-type/> (e.g. (e.g. <br/><br/> in in

XHTML)XHTML)– unique root element unique root element document a single tree document a single tree

Page 14: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 14

Logical document structure (2)Logical document structure (2)

AttributesAttributes– name-value pairs attached to elementsname-value pairs attached to elements– ‘‘‘‘metadata’’, usually not treated as contentmetadata’’, usually not treated as content– in start-tag after the element type namein start-tag after the element type name

<div class="preface" date='990126'><div class="preface" date='990126'> … …

Also:Also:– <!--<!-- commentscomments outsideoutside other markup other markup -->-->– <?PI value of <?PI value of processing instructionprocessing instruction ‘PI’?> ‘PI’?>

Page 15: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 15

CDATA SectionsCDATA Sections

““CDATA Sections” to include XML markup characters as CDATA Sections” to include XML markup characters as textual contenttextual content

<![CDATA[<![CDATA[ Here we can include for example code Here we can include for example code fragments: fragments: <example>if (Count < 5 && Count > 0) <example>if (Count < 5 && Count > 0) </example> </example> ]]>]]>

Page 16: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 16

Two levels of correctness (1)Two levels of correctness (1)

Well-formedWell-formed documents documents – roughly: follows the syntax of XML,roughly: follows the syntax of XML,

markup correct (elements properly nested, tag markup correct (elements properly nested, tag names match, attributes of an element have names match, attributes of an element have unique names, ...)unique names, ...)

– violation is a fatal errorviolation is a fatal error ValidValid documentsdocuments

– (in addition to being well-formed) (in addition to being well-formed) obey an associated grammar (DTD/Schema)obey an associated grammar (DTD/Schema)

Page 17: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 17

XML docs and valid XML docsXML docs and valid XML docs

XML documents = well-formed XML documentsXML documents = well-formed XML documents

DTD-valid documentsDTD-valid documents Schema-valid documentsSchema-valid documents

Page 18: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 18

An XML Processor (Parser)An XML Processor (Parser)

Reads XML documents and reports their contents Reads XML documents and reports their contents to an application to an application – relieves the application from details of markup relieves the application from details of markup – XML Recommendation specifies, partly, the behaviour XML Recommendation specifies, partly, the behaviour

of XML processors: of XML processors: – recognition of characters as markup or data; what recognition of characters as markup or data; what

information to pass to applications; information to pass to applications; how to check the correctness of documents; how to check the correctness of documents;

– validation based on comparing document against its validation based on comparing document against its grammar grammar

Page 19: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 19

2. 2 Basics of XML DTDs2. 2 Basics of XML DTDs

A A Document Type DeclarationDocument Type Declaration provides a grammar provides a grammar ((document type definitiondocument type definition,, DTD DTD) for a class of ) for a class of documents [Defined in XML Rec]documents [Defined in XML Rec]

Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!<!DOCTYPEDOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd --> [[ <!-- " <!-- "internal subsetinternal subset" may come here --> " may come here --> ]]>>

DTD = union of the external and internal subset DTD = union of the external and internal subset (could be empty); internal has higher precedence(could be empty); internal has higher precedence– can override entity and attribute declarations (see next)can override entity and attribute declarations (see next)

Page 20: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 20

Markup DeclarationsMarkup Declarations

DTD consists of DTD consists of markup declarationsmarkup declarations – element type declarationselement type declarations

» similar to productions of ECFGssimilar to productions of ECFGs

– attribute-list declarationsattribute-list declarations » for declared element typesfor declared element types

– entity declarationsentity declarations» short-hand notations and physical storage unitsshort-hand notations and physical storage units

– notation declarationsnotation declarations » information about external (binary) objectsinformation about external (binary) objects

Page 21: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 21

How do Declarations Look Like?How do Declarations Look Like?

<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>

<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>

<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>

<!ATTLIST client num NMTOKEN #IMPLIED><!ATTLIST client num NMTOKEN #IMPLIED>

<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>

<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>

<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>

<!ATTLIST item <!ATTLIST item

priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED

unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >

string of string of name characters;;(Letter | [0-9] | (Letter | [0-9] |

. | - | _ | : ). | - | _ | : )++

Page 22: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 22

Element type declarationsElement type declarations

The general form isThe general form is<!ELEMENT<!ELEMENT elemType elemType ((EE)>)>

where where EE is a is a content modelcontent model (regular expr.); (regular expr.); (≈ ECFG production (≈ ECFG production elemType elemType E ) E )

Content model operators:Content model operators:E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

No prioritiesNo priorities: either (A,B)|C or A,(B|C), but : either (A,B)|C or A,(B|C), but nono A,B|C A,B|C

Page 23: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 23

Attribute-List DeclarationsAttribute-List Declarations

Can declare attributes for elements:Can declare attributes for elements:– Name, data type and possible default value:Name, data type and possible default value: <!<!ATTLIST FIGATTLIST FIG

idid IDID #IMPLIED#IMPLIEDdescr CDATA #REQUIREDdescr CDATA #REQUIREDclass (a | b | c) class (a | b | c) "a">"a">

Semantics mainly up to the applicationSemantics mainly up to the application– parser checks that parser checks that IDID attributes are unique and that targets attributes are unique and that targets

of of IDREFIDREF and and IDREFSIDREFS attributes exist attributes exist

Page 24: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 24

Mixed, Empty and Arbitrary ContentMixed, Empty and Arbitrary Content

Mixed contentMixed content::<!ELEMENT P<!ELEMENT P (#PCDATA | I | IMG)*>(#PCDATA | I | IMG)*>

– may contain text (may contain text (#PCDATA#PCDATA) and elements) and elements Empty contentEmpty content::

<!ELEMENT IMG <!ELEMENT IMG EMPTYEMPTY>>

Arbitrary content:Arbitrary content:<!ELEMENT HTML <!ELEMENT HTML ANYANY>>

(= (= <!ELEMENT HTML<!ELEMENT HTML (#PCDATA |(#PCDATA | choice-of-all-declared-element-typeschoice-of-all-declared-element-types)*>)*>))

Page 25: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 25

Entities (1)Entities (1)

Named storage units or XML fragments (Named storage units or XML fragments (~~ macros macros in some languages)in some languages)– character entitiescharacter entities::

» &lt;&lt; &#60;&#60; and and &#x3C;&#x3C; all expand to ‘ all expand to ‘<<‘‘(treated as data, not as start-of-markup)(treated as data, not as start-of-markup)

» other other predefined entitiespredefined entities: : &amp; &gt; &apos; &quote;&amp; &gt; &apos; &quote;

forfor & & > > ' ' ""

– general entitiesgeneral entities are shorthand notations: are shorthand notations:<!<!ENTITYENTITY CS “<it>School</it> of Computing"> CS “<it>School</it> of Computing">

Page 26: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 26

Entities (2)Entities (2)

physical storage units comprising a documentphysical storage units comprising a document– parsed entitiesparsed entities

<!ENTITY chap1 SYSTEM <!ENTITY chap1 SYSTEM "http://myweb/ch1">"http://myweb/ch1">

– document entity document entity is the starting point of processingis the starting point of processing– entities and elements must nest properly:entities and elements must nest properly:

<!DOCTYPE doc [<!DOCTYPE doc [<!ENTITY chap1 <!ENTITY chap1

((… as above …)… as above …) > ]> ]<doc><doc>

&chap1;&chap1;</doc></doc>

<sec num="1"><sec num="1">… …

</sec></sec><sec num="2"><sec num="2">

… … </sec></sec>

Page 27: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 27

Unparsed EntitiesUnparsed Entities

For connecting external binary objects to XML For connecting external binary objects to XML documents; (XML processor handles only text)documents; (XML processor handles only text)

<!NOTATION TIFF "bin/gimp"><!NOTATION TIFF "bin/gimp"><!ENTITY fig123 SYSTEM "figs/f123.tif" <!ENTITY fig123 SYSTEM "figs/f123.tif"

NDATANDATA TIFF> TIFF><!ATTLIST IMG<!ATTLIST IMG

filefile ENTITYENTITY #REQUIRED>#REQUIRED>

Usage:Usage: <IMG file="fig123"/><IMG file="fig123"/>– parser provides notation and address to the applicationparser provides notation and address to the application– an SGML legacy technique(?)an SGML legacy technique(?)

Page 28: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 28

An Alternative WayAn Alternative Way

I have rarely used unparsed entities or notationsI have rarely used unparsed entities or notations Easier "HTML-style" linking: Easier "HTML-style" linking:

<IMG type="TIFF" file="figs/fig123"/><IMG type="TIFF" file="figs/fig123"/>

– the link indicates the format and the address directlythe link indicates the format and the address directly– maintenance of numerous entities might be easier with maintenance of numerous entities might be easier with

(indirect references through) explicit declarations(indirect references through) explicit declarations

Page 29: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 29

Parameter EntitiesParameter Entities

to parameterize and modularize DTDs:to parameterize and modularize DTDs:

<!ENTITY <!ENTITY %% tableDTD SYSTEM "dtds/tab.dtd"> tableDTD SYSTEM "dtds/tab.dtd">%%tableDTD; <!-- include sub-dtd -->tableDTD; <!-- include sub-dtd -->

<!ENTITY <!ENTITY %% stAttr "ready (yes | no) 'no'"> stAttr "ready (yes | no) 'no'"><!ATTLIST CHAP <!ATTLIST CHAP %%stAttr; >stAttr; ><!ATTLIST SECT <!ATTLIST SECT %%stAttr; >stAttr; >

(Parameter entities inside a markup declaration allowed only in (Parameter entities inside a markup declaration allowed only in externalexternal DTD subset) DTD subset)

Page 30: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 30

Speculations about XML ParsingSpeculations about XML Parsing

Parsing involves two things:Parsing involves two things:1. Pulling the entities together, and checking the well-1. Pulling the entities together, and checking the well-formednessformedness2. Building a parse tree for the input (a'la DOM), 2. Building a parse tree for the input (a'la DOM), or or otherwise informing the application about document otherwise informing the application about document content and structure (e.g. a'la SAX)content and structure (e.g. a'la SAX)

Task 2 is simple (Task 2 is simple ( simplicity of XML markup; see simplicity of XML markup; see next) next)

Checking well-formedness is straightforward; Checking well-formedness is straightforward; Implementing validation is a bit more challengingImplementing validation is a bit more challenging

Page 31: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 31

Building an XML Parse TreeBuilding an XML Parse Tree

<S><S>

SS

<W><W>

WW

HelloHello

HelloHello

</W></W> <E><E>

EE

<W><W>

WW

world!world!

world!world!

</W></W> </S></S></E></E>

Page 32: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

Simplified XML Tree Building AlgorithmSimplified XML Tree Building Algorithm

C ← C ← newnew Document(); // current node Document(); // current node

Scan document text Scan document text untiluntil at end: at end:

case case of scanned text: of scanned text:

""<Type><Type>":": N ← N ← newnew ElementNode( ElementNode(TypeType); );

C.addChildNode(N); C.addChildNode(N);

C ← N;C ← N;

""TxtTxt":": C.addChildNode( C.addChildNode(newnew TextNode( TextNode(TxtTxt));));

""</Type></Type>":": C ← C.parentNode();C ← C.parentNode();

returnreturn C; C;

SDPL 2011 2: XML Basics 32

Uses DOM like Uses DOM like tree operationstree operations

Page 33: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 33

Sketching XML validationSketching XML validation

Treat the document as a tree Treat the document as a tree d d Document is valid w.r.t. a grammar Document is valid w.r.t. a grammar

(DTD/Schema) G iff (DTD/Schema) G iff dd is a syntax tree over G is a syntax tree over G– Check that the root is labelled by the start symbol Check that the root is labelled by the start symbol

of Gof G– For each element node For each element node nn of the tree, check that of the tree, check that

itsits» attributes match with those of the element typeattributes match with those of the element type» content matches the content model of its type: If content matches the content model of its type: If nn is of is of

type A and its children of type Btype A and its children of type B11, …, B, …, Bnn, check that the , check that the grammar has a production A grammar has a production A E for which E for which

B B11 … B … Bn n L(E) L(E) (1)(1)

Page 34: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 34

Sketching the validation… Sketching the validation… (2)(2)

How to check condition (1; matching of children How to check condition (1; matching of children with a content model)?with a content model)?– by an automaton built from content model Eby an automaton built from content model E– Example: Example: <!ELEMENT A ((B|C)+, D)><!ELEMENT A ((B|C)+, D)>

Page 35: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 35

2.3 XML Namespaces2.3 XML Namespaces

Documents often comprise parts processed by Documents often comprise parts processed by different applications (and/or defined in different different applications (and/or defined in different schemas)schemas)– for example, in XSLT scripts:for example, in XSLT scripts:

<xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1>

</xsl:template></xsl:template>

– How to manage multiple sets of names?How to manage multiple sets of names?

HTML HTML elementselements

XSLT XSLT elements/elements/

instructionsinstructions

Page 36: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 36

XML Namespaces (2/5) XML Namespaces (2/5)

Solution: XML Namespaces, W3C Rec. Jan’99, for Solution: XML Namespaces, W3C Rec. Jan’99, for separating possibly overlapping “vocabularies” separating possibly overlapping “vocabularies” (sets of element type and attribute names) within a (sets of element type and attribute names) within a single documentsingle document

by introducing (arbitrary) local name by introducing (arbitrary) local name prefixesprefixes, and , and binding them to (fixed) globally unique URIsbinding them to (fixed) globally unique URIs– For example, the local prefix “For example, the local prefix “xsl:xsl:” ”

conventionally used in XSLT scriptsconventionally used in XSLT scripts

Page 37: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 37

XML Namespaces briefly (3/5)XML Namespaces briefly (3/5)

Namespace identified by a URI (through Namespace identified by a URI (through the associated local prexif) the associated local prexif) e.g. e.g. http://www.w3.org/http://www.w3.org/1999/XSL/Transform1999/XSL/Transform for XSLTfor XSLT

– conventional but not required to use URLsconventional but not required to use URLs– the identifying URI has to be unique, but it does not the identifying URI has to be unique, but it does not

have to be an existing addresshave to be an existing address

Association inherited to sub-elementsAssociation inherited to sub-elements– see the next example (of an XSLT script)see the next example (of an XSLT script)

Page 38: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 38

XML Namespaces (4/5) XML Namespaces (4/5)

<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">

<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>

</xsl:stylesheet></xsl:stylesheet>

Page 39: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

SDPL 2011 2: XML Basics 39

XML Namespaces briefly (5/5)XML Namespaces briefly (5/5)

Built on top of basic XMLBuilt on top of basic XML– By overloading attribute syntax (By overloading attribute syntax (xmlns:foo=...xmlns:foo=...))– does not affect validation does not affect validation

» namespace attributes must be declared for DTD-validitynamespace attributes must be declared for DTD-validity

» all element type names must be declared (with their all element type names must be declared (with their prefixes!)prefixes!)

– > Other schema languages (XML Schema, Relax NG) > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespacesbetter for validating documents with Namespaces

Processing languages allow to declare NS Processing languages allow to declare NS prefixes; see nextprefixes; see next

Page 40: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

Example: Namespaces in XSLTExample: Namespaces in XSLT

SDPL 2011 2: XML Basics 40

<test xmlns:foo="http://www.uef.fi/Test"><test xmlns:foo="http://www.uef.fi/Test"> <foo:A>aaa</foo:A><foo:A>aaa</foo:A> <foo:B>bbb</foo:B><foo:B>bbb</foo:B></test></test>

<xsl:transform version="1.0"<xsl:transform version="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:bar="http://www.uef.fi/Test">xmlns:bar="http://www.uef.fi/Test">

<xsl:template match="/"><xsl:template match="/"> <xsl:apply-templates select="//bar:*" /><xsl:apply-templates select="//bar:*" /></xsl:template></xsl:template>

</xsl:transform></xsl:transform>

XML:XML:

Page 41: SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

Example: Namespaces in XQueryExample: Namespaces in XQuery

SDPL 2011 2: XML Basics 41

<test xmlns:foo="http://www.uef.fi/Test"><test xmlns:foo="http://www.uef.fi/Test"> <foo:A>aaa</foo:A><foo:A>aaa</foo:A> <foo:B>bbb</foo:B><foo:B>bbb</foo:B></test></test>

xquery version "1.0";xquery version "1.0";declare namespace zap = "http://www.uef.fi/Test";declare namespace zap = "http://www.uef.fi/Test";

doc("ns-test.xml")//zap:*doc("ns-test.xml")//zap:*

ns-test.xml ns-test.xml ::

For NS support in XML APIs, see nextFor NS support in XML APIs, see next