xml conceptslia.deis.unibo.it/corsi/2006-2007/sd-la/slides/4-xml-blank.pdf · such as xhtml, xslt,...
TRANSCRIPT
XML Concepts
Prof Andrea OmiciniDEIS Ingegneria Due
Alma Mater Studiorum Universitagrave di Bologna a Cesenaandreaomiciniuniboit
1
Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX
Outline
2
Introducing XML
3
What is XMLA W3C Standard
httpwwww3orgXMLA mark-up language for text documents
derived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387html
eXtensible Markup LanguageA meta-markup language
to define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup Markup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt
6
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX
Outline
2
Introducing XML
3
What is XMLA W3C Standard
httpwwww3orgXMLA mark-up language for text documents
derived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387html
eXtensible Markup LanguageA meta-markup language
to define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup Markup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt
6
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Introducing XML
3
What is XMLA W3C Standard
httpwwww3orgXMLA mark-up language for text documents
derived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387html
eXtensible Markup LanguageA meta-markup language
to define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup Markup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt
6
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What is XMLA W3C Standard
httpwwww3orgXMLA mark-up language for text documents
derived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387html
eXtensible Markup LanguageA meta-markup language
to define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup Markup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt
6
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup Markup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt
6
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Why Markup Markup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt
6
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML X for Basic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTML
they need to be defined7
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Hey too many Application domains are more and more
numerouscomplexspecific
Special specialised languages as the engineers tools
to represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
8
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML ApplicationsXML per se is ldquosmallrdquo amp simple
languages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML for Portable
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How XML Looks like
12
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How to Work with
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What is an XML It can be
A text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
14
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How does XML Who handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets
15
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Where is XML
Everywhere already
16
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Some History of XML Lot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGML
but had obvious limitationstoo complex
more than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Fundamentals
18
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
A Simple XML
ltplayergt Carlo Nervoltplayergt
19
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Document amp
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Elements amp
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervo
tags are markupthe most common form of markup but there are other kinds
content is character dataincluding the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed
24
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML AttributesElements can be labelled by attributes
attributes are specified in the start tagand in the only tag of empty elements
any number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=value
alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML document
but they are qualifiers for the nodes and
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Using Elements or
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the two
Element-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore
28
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsed
unless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Pre-defined XML
Markup Entity Description
amplt lt less-then
ampgt gt grater-than
ampamp amp ampersand
ampquot double quote
ampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
CDATA Sections
Including code chunks from any language with lt or can be tedious
we need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
CommentsEasy
lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Processing Need to pass information for a given application through the parser
comments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an element
it can appear everywhere out of a tag even before or after the root
34
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
The XML DeclarationLooks like an XML processing instruction
but it is not just the XML declarationIt is optional
but if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Checking Well-Main rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD
37
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Flexibility or
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one team
Document Type Definition (DTD)to define which XML documents are valid
Validity is not mandatory as well-formednesshow to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declaration
then the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD ishellipSGML-based
syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documents
typical syntax-based approachmaybe limited but easy to implement
Maybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
A Simple DTD
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD Declarations
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone else
like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Attribute
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute value
NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFS
48
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Other DTD
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Namespaces
50
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What are Distinguish
different XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Syntax for
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Associating Prefixes Example
a large firm could have a number of namespaces for different purposes
ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not
53
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Setting Default
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Internationalisation
55
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What does Text ldquoTextrdquo can be encoded according so many different alphabets
mapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodings
UTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tags
prefix x-58
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue
it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portability
across platforms across applications across time
59
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML amp CSS
60
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Cascading Style Cascading Style Sheets (CSS)
a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basics
if not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tags
nometag attributo1 valore1 hellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Example How Mozilla Visualises it
66
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Example How Mozilla Visualises it
67
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DOM amp SAX
68
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sorts
through ad hoc APIThe most used hated deprecated widespread are
69
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Document Object httpwwww3orgDOM
standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998
primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000
adds Namespace support and minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can contain
document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Java73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Main Problem of
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76